The Design and Architecture of Urchin

1. Purpose

The first purpose of Urchin is to be a web based RSS aggregator and filter. The software allows the creation of new RSS feeds that consist of an collection of items from many sources, with the new RSS feeds themselves being URI addressable.

A secondary purpose is to facilitate the creation of new primary RSS feeds - the software distibution contains modules that help simplify the creation of RSS feeds from information stored in scrapable text formats or relation databases.

A suplimentary purpose is to help in the interchange of information between RSS and non-RSS formats. Urchin can produce the results of a query in any output format that can be defined as an XSL transformation of an RSS 1.0 document or by using an HTML::Template template. These outputs are also URI addressable, enabling Urchin to function as a simple RSS newsreader or to run a simple news portal.

2. Design

The development of Urchin has followed two design goals:

Be as modular as possible, to allow the easy plugging together of different modules to meet the different purposes.
Capture as much information from incoming RSS feeds as possible - for example, store all the 'extra' namespaced elements in an RSS 1.0 feed.

As always, the code is still a work in progress. See the wishlist for a list of desirable features that are not yet implemented.

3. Architecture

Click on the image below to see a schematic of the architechture of Urchin.

4. Urchin Data Model

For a full entity relationship diagram, see LogicalModel.jpg

Here is a brief description of the purpose of the tables:

Tables channel, skip_hours, skip_days, cloud

These store the channel level elements that are defined in the RSS 0.91, 0.92, 1.0 and 2.0 @@link specifications.

Tables item, enclosure, source, global_identifier

These store the item level elements that are defined in the RSS 0.91, 0.92, 1.0 and 2.0 @@link specifications.

Tables channel_category, item_category, category

These assocosiate RSS 0.9x/2.0 style categories with channels and items.

Tables module_element, channel_module_element, item_module_element, namespace

These four tables form a basic RDF triple store for capturing information contained in RSS 1.0 modules. For more information on RDF and RSS 1.0 modules, see http://www.w3.org/RDF/ and http://web.resource.org/rss/1.0 /.

The predicate of an RDF triple is described by the fields module_element.namespace_id and module_element.name – namespace_id is a foreign key to the namespace table. The full URI for the predicate is the namespace.namespace_uri concatenated with the module_element.name.

The object of the triple is described by the module_element.value and module_element.value_type fields – value is the string value of the object, value_type indicates whether that string should be interpreted as a URI, a literal, or a blank node identifier.

The subject of the triple is indicated by the module_element.parent_element_id field. If the subject of the triple is an item, parent_element_id will be a foreign key to the item_module_element table – that table in turn associates an item.item_id with the triple. The URI for the subect is (in this case) the value of item.link. Likewise, if the subject of the triple is a channel, the channel_module_element table is used in this way. If the subject of the triple is neither an item nor a channel then it must be (in Urchin) the object of a previously stored triple. In this case, module_element.parent_element_id refers to another row in the module_element table – the subject of the triple being the same as the triple object stored in that refered-to row.

This arrangement forms a very basic RDF triple store - but one that is well suited to Urchin's purposes, as it effectively isolates triples from different sources. The provenance of any RDF statement encoded in this way can be efficiently traced to its source, and RDF graphs from different sources are not merged. This avoids the possibility of Urchin outputing statements about an item that were not made by that item's source. Of course, this is not the only way that aggregators can work, but is (we believe) appropriate for the messy world of RSS.

Tables aggregate, channel_aggregation

The aggregate table lists the names of aggregations of channels, and the channel_aggregation table lists the channels that are in those aggregations. See the usage documentation for more information on how these are used.

Table output_filter

This lists named filters - i.e. search strings that have been labeled. See Usage.html for more information on how these are used.

Table output_format

This table lists named custom output styles, how to generate them, and what mime_type they are. See the usage documentation for more information on how these are used.

Table output

This table is lists named outputs - i.e. associates a named filter and a named output format so that a specific search, with the results in a specific style, can be referenced by a simple label.

5. Urchin Perl Modules

For details of each module's API, see the perldocs. A brief description of the function of each module is given here.

Urchin

This is the base class from which all other modules inherit. It implements subroutines for Urchin-wide tasks - loading the configuration file, database access and fetching resources from the Web.

Urchin::Import

This is the base class for the data importing modules. It implements subroutines for preparing an XML::RSS for saving to the Urchin database - i.e. for mapping RSS 0.9x/2.0 elements to RSS 1.0 elements and generating a set of RDF statements.

Urchin::Import::RSS

This module is used if data is to be extracted from an RSS feed.

Urchin::Import::Scrape

This module is used for extracting RSS data from a resource using regular expressions. It implements a subroutine that constructs an XML::RSS object by evaluating the regular expressions. See Usage.html for more details on how to use this.

Urchin::Import::DBI

This module is used for extracting RSS data from DBI sources. It implements a subroutine that constructs an XML::RSS object by evaluating a specially constructed SQL statement. See Usage.html for more details on how to use this.

Urchin::SaveData

This module performs the task of inserting data from an RSS file into the database. It's basic function is to traverse an RDF graph in order to save the data to the database.

Urchin::OutputFeed

This is the base class for the output modules. It's basic function is to extract information from the Urchin database - it implements subroutines for parsing the Urchin search syntax and constructing an XML::RSS object from the resuls of a SQL query.

Urchin::OutputFeed::XSLT

This module is used to generate output using an XSL document. It runs the XSL transformation on an RSS 1.0 serialization of an XML::RSS object.

Urchin::OutputFeed::Template

This module is used to generate output using an HTML::Template file. It sets up the channel_title, channel_description, channel_link, items, item_title, item_description and item_link variables for use in the template.

Urchin::OutputFeed::RDF::Core::Serializer
Urchin::OutputFeed::RDF::Core::Model::Serializer

These two modules are based on, and extend the functionality of, RDF::Core::Serializer and RDF::Core::Model::Serializer in order to be able to serialise an RDF model in the stricter RSS 1.0 format, and to allow the Urchin database to be used as an RDF model.

Specifically, the following capabilites were added to Urchin::OutputFeed::RDF::Core::Model::Serializer in order to be able to constrain the output to the RSS 1.0 style:

to set the default namespace
to specify an order in which elements from different namepaces are serialized
to set a prefered subject type – i.e. triples whose subject is of that type will be serialised first

Urchin::OutputFeed::RDFSerialize

This module is a subclass of Urchin::OutputFeed::RDF::Core::Model::Serializer. It constrains RDF serialisation to the RSS 1.0 format, and allows the use of the Urchin database as a source of namespace abbreviation to URI mappings.

Urchin::OutputFeed::RDF::Core::Storage::Urchin

This module mimics the API of RDF::Core::Storage to allow the Urchin database to be seen as an RDF::Core::Storage triple store. Note that it currently only allows read access to that store.