These instructions cover the basics of setting up an RSS aggregating and filtering service using Urchin – how to import RSS feeds, how to set up filters, and how to add new output styles.

The installation instructions are elsewhere, as are further details on how to write new output style templates.

1. Adding and updating RSS feeds

The urchinadm script can be used to add or update feeds, and to refresh all the feeds in the database:

urchinadm [-X] command
add Add/Update channels to the database. Reads one
URL per line from stdin or a list given as
arguments.
remove Remove channels in the database. Reads one URL
or channel_id per line from stdin.
list List all channel source URL's in the database.
Writes one URL per line to stdout.
id [-q] Look up the identification numbers for a given URL
(channel or item) and display in query form with title.
refresh Update all channels in the database.
rebuild Save the current list of source URL's, wipe the
database, and reimport all channels.
(Interactive!)
wipe Delete all data in the database, including
channel metadata. (Interactive!)
clrcache Clear the downloaded files cache (i.e. force
later reload of all documents from network).
docgrep Search the cached documents for a given string
or regex.
xdomgrep Search the cached documents for a given XML tagname
and optionally an attribute of that tag.
xpathgrep Search the cached documents for a given XPath.
-X Turns on debugging output.

e.g.
$ perl urchinadm add < feeds.txt
$ perl urchinadm add http://www.nature.com/nsu/rss.rdf \
http://slashdot.org/slashdot.rss

2. Adding and updating other data sources

The Urchin source code includes two modules for extracting RSS-like data from non-RSS sources.

The Urchin::Import::Scrape module constructs an XML::RSS object when given the URL of a document to scrape and some Perl regular expressions to extract the relevant parts of the document. The Urchin::Import::DBI module can construct an XML::RSS object when given a DBI connection and a SQL statement for extracting the relevant data. Currently, this functionality is not incorporated into the urchinadm automated import and refresh script, so you must write separate scripts, using these modules, to create or import feeds for non-RSS sources. Is is important to note that the channel.import_format_id for such sources must be set to be non-RSS, so that the automated refresh process will skip over them.

Using Urchin::Import::Scrape

The match() subroutine should be passed a hash reference of the following form:

'workarea' => $workarea_regex,
'channel' => $channelarea_regex,
'channel_title' => $channeltitle_regex,
'channel_link '=> $channellink_regex,
'channel_description' => $channeldescription_regex,
'channel_last_build_date' => $channellastbuild_regex,
'copyright' => $copyright_regex,
'language' => $language_regex,
'itemarea' => $itemarea_regex,
'item_title' => $itemtitle_regex,
'item_link' => $itemlink_regex,
'item_description' => $itemlink_regex,
'author_name' => $authorname_regex

The workarea regular expression is evaluated first, and all subsequent regular expressions are evaluated against the matched area. For items, the itemarea regular expression is evaluated as a global match - the subsequent item matches are evalutaed in turn against each of the matched areas. Note that all regular expressions should have one set of tagging parentheses. It is the content of that tagged portion that are returned as the result of the match.

Using Urchin::Import::DBI

This module uses the Urchin::OutputFeed.pm module to construct an XML::RSS object. The trick is to write your SQL statement so that Urchin::OutputFeed thinks it's seeing data from the Urchin database. This can be accomplished as follows. The Urchin::OutputFeed is expecting a results set with field names of the form tablename__fieldname, where tablename and fieldname correspond to tables and their fields in the Urchin database. So, in your SQL query, you should assign aliases for the results set. For example, in a hypothetical article database:

select articles.title as item__title,
articles.author as item__author_name,
CONCAT('http://www.example.org/articles/',
articles.issueid, '/', articles.articleid)
as item__link,
CONCAT('Publication Issue ', articles.issueid,
' Page ', articles.startpage) as item__description
from ...

3. Adding custom output styles

Custom outputs can be generated either by an XSL transformation of an Urchin RSS 1.0 output document, or using an HTML::Template template. Currently (version 0.92), the information on how to generate a custom output style must be manually inserted into the database. To fully specify an output style, the following information should be in the output_format table:

format_id – primary key, id for this custom output style
title – the output style label that will be used in the fmt= part of a query string (see below)
description – a short description of this output style
template_url – location of the HTML::Template or XSL file
template_type – 'xsl' or 'htmltemplate'
mime_type – the mime_type string for this output style, e.g. application/xml or text/html
embeddable – 1 or 0, this indicates whether the output should be embedded in the Urchin query screen, or served as a full document. 0 means serve as a full document.
base_rss_ver – this should always be '1.0'

For information on generating custom output styles, see CustomOutputs.html.

4. Urchin query syntax

The Urchin query syntax is made up of the following components:

Search keywords – plain old words to search for
Search regular expressions – these are marked with a preceding ~
e.g. ~bio.*?ogy, ~hot|heat
Functional keywords – keywords, in uppercase, that provide special functionality
e.g. the boolean operators AND, NOT, OR, search restrictors like NEW and AGGREGATE.
A full list of search keywords is given below.
Search field indicators – these specify which database fields to search. If no field is given, item title and description are assumed
e.g. author_name:cowboyneal finds items written by 'cowboyneal', c.link:~www\.nature\.com finds items that appear are from a channel whose link matches the regular expression www\.nature\.com
A full list of built-insearch fields is given below.

In addition, the search field indicator can be any predicate used in an RSS 1.0 source document where the subject of the triple is either an item or a channel. See below for more details on this.
Grouping markers – parentheses, square brackets and single quotes allow grouping of search terms, operator precedence and delimiting of search phrases
e.g. (nuclear OR fusion) AND NOT dna
e.g. author_name:[cowboyneal OR simoniker]
e.g. author_name:'George Martin'


5. Adding named filters and aggregations

Predefined filters can be stored in the database, and added to a query using the functional keyword FILTER. Currently the filter must be added manually to the output_filter table, as follows:

filter_id – primary key, the id of this named filter
filter_string – the Urchin query syntax string
title – the name of this filter
description – a short description of the filter's purpose

Aggregations of channels can also be defined, and they are be used to narrow the search to a subset of the channels in the database using the AGGREGATE functional keyword. Currently, these have to be manually defined in the aggregate and channel_aggregate tables as follows:

aggregate.aggregate_id – primary key, the id of this aggregate
aggregate.title – the name of the aggregate
aggregate.description – on optional piece of text describing the channels included in the aggregate
aggregate.inserted_on – the time the aggregate name was added
aggregate.inserted_by – the user_register.user_id of the user who added the aggregate
aggregate.updated_on – the time the aggregate was last changed
aggregate.updated_by – the user_register.user_id of the user who last changed the aggregate
channel_aggregate.channel_id and
channel_aggregate.aggregate_id – a list of paired foreign key ids to the channel and aggregate tables, associating a channel with an aggregate

6. Adding composite outputs

Composite outputs are labels associating a particular stored search with a particular custom output style. Currently, the information on how to generate a composite output must be manually inserted into the database. To fully specify a composite output, the following information should be in the output table:

output_id – primary key, id for this composite output
title – the label for this composite output
format_id – a foreign key to the output_format table, indicating the output style
filter_id – a foreign key to the output_filter table, indicating the filter to use
aggregate_id – a foreign key to the aggregate table, indicating the aggregate on which to apply the filter
description – an optional description, explaining this output
inserted_by – the user_register.user_id of the user who inserted this output
inserted_on – the time the output was inserted
updated_by – the user_register.user_id of the user who last updated this output
updated_on – the time the output was last updated

7. Urchin query string fields

The urchin query string has the following keys:

q – The search string, URI encoded
fmt – A label for the output type requested, e.g. rss10 for RSS 1.0, rss091 for RSS 0.91, or any other type that has been specified in the output_format table
out – A label for the composite output type – any name that's been specified in the output table
max – The maximum number of items to return. A blank value indicated that the number is unlimited. This is automatically set to 15 if RSS 0.91 output is requested.
ord – How to sort the items returned by the query. Possible values are date (order by item publication date), rand (order at random), raw (as the items are ordered in the Urchin database) or title (alphabetical order by item title). If ommitted, the default is to order by publication date.

Note that this ordering is done before the results list is cropped to the maximum number of items. So, for example ord=date&max=15 will return the 15 most recently published items matching a particlular query, whereas ord=rand&max=15 will return 15 random items that match the query.

For example, a request for an RSS 1.0 feed, with a maximum of 50 items that include the word 'nasa' in the title would look like:

urchin?q=title%3Anasa&fmt=rss10&max=50

8. Other admin database tables

In addition to the functionality offered in version 0.92, the urchin database has a number of tables for planned future functionality. Their purpose is briefly describe here:

user_register

This is populated with one admin user by the seed_date.sql script. It will be used in future to hold details of other users with other privileges.

user_group_access

This will be used to associated users with user groups that have certain permissions.

group_access

This will be used to list different user groups.

output

Planned development includes the ability to label a particular combination of named filter and named output_format. This will allow queries like:

urchin?out=biologynews

Where biologynews is a label defined in this table to mean 'do a particular search, and present the output using a particular custom output style'.

9. Urchin functional keywords

RECENT
...all items published in the last 3 days.

OLD
...all items published at least 6 months ago.

ENGLISH
...all items marked English, British English, American English, etc.

CURRENT
...all items present on last channel refresh.

NEW
...all items inserted on last channel refresh.

TLD
...all items displaying a title, link and description.

NEWCHANNEL
...all items from channels added in the last 3 days.

ALL
...all items.

NOT, AND, OR
...boolean operators

FILTER filtername
...recall the stored filter filtername and incorporate it into the current query.

AGGREGATE aggregatename
...recall the stored aggregate aggregatename and limit the current query to only those channels.

10. Urchin search fields

title: The item's title
description: The item's description
link: The item's link
author_name: The item's author's name, as given by a dc:creator element
author_email: The item's author's email address, as given by the rss20:author element
publication_date: The publication date of the item, as given by the rss091:pubDate element
language: The language of the item, e.g., en-gb, inherited from it's parent channel. The value is drawn from dc:language or rss091:language elements
channel_id: The channel_id, in the Urchin database, of the item's parent channel
current_ind: Indicates whether the item was present in a feed the last time that feed was checked Has a value of 1 or 0 - the CURRENT keyword means that all result items must have current_ind:1
new_ind: Indicates whether the item was new in a feed the last time that feed was checked. Has a value of 1 or 0 - the NEW keyword means that all result items must have new_ind:1
comment_url: The value of the rss20:comments or annotate:reference element


c.title: The title of the parent channel of an item
c.description: The parent channel's description
c.link: The parent channel's link
c.source_url: The URL of the resource from which the item was extracted
c.language: The language of the item's channel, as specified by dc:language or rss091:language elements in the source RSS feed
c.last_updated_on: The date the item's channel was last updated in the Urchin database
c.last_build_date: The value of a channel level dc:date element in the source RSS feed
c.generator: The value of the rss20:generator element
c.rating: The value of the rss091:rating element
c.copyright: The value of the rss091:copyright element
c.managing_editor: The value of the rss091:managingEditor element
c.webmaster: The value of the rss091:webmaster element
c.docs_url: The value of the rss20:docs element

11. Simple extensible search

In addition to storing the core RSS data items, the Urchin database stores all extra RDF-modeled data in imported RSS 1.0 feeds. From version 0.92, the Urchin search syntax has been extended to allow simple searching of this data using arbitrary search field restrictions. For example, if an imported RSS 1.0 feed contained the following data:

<rdf:RDF
  xmlns="http://purl.org/rss/1.0/"
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:dc="http://purl.org/dc/elements/1.1/"
  xmlns:reqv="http://purl.org/rss/1.0/modules/richequiv/"
>
...
<item rdf:about="http://www.example.com/item1">
  <title>Searching arbitrary data fields</title>
  <link>http://www.example.com/item1</link>
  <description>A short tutorial on how to search
  arbitrary data fields in Urchin.</description>
  <dc:creator>Ben Lund</dc:creator>
  <dc:contributor>Martin Flack</dc:contributor>
  <reqv:description rdf:parseType="Literal"
    xmlns="http://www.w3.org/1999/xhtml">
    <p>A short tutorial on how to search arbitrary
    data fields in
    <a href="http://urchin.sourceforge.net/">
    Urchin</a>.</p>
  </reqv:description>
</item>
...
</rdf:RDF>

The core RSS data fields (title, link, description) would be searchable using the built in field restrictors. For example, the searches

title:arbitrary
link:='http://www.example.com/item1'
description:~data\sfields?

would all include this item in the output.

In addition, RDF predicates can be used as search field restrictors. For example, the searches

<dc:contributor>:='Martin Flack'
<reqv:description>:Urchin

would also match this item. The syntax of these simple RDF queries is:

<namespace_abbreviation:predicate_name>:pattern

The search will match any RDF triples whose subject is an item or a channel and whose value matches pattern. As with other Urchin queries, can be an exact match, a keyword, or a regular expression.

There are a few restrictions on the use of this syntax – the namespace abbreviation must be mapped to a namespace URI in the Urchin database, and multiple arbitrary search restrictors cannot be combined using the AND operator, although the use of the OR operator will work.

12. RDF search

The Urchin database can be treated as an RDF triple store that can be queried using the RDF::Core::Query language (hereafter refered to a RCQL). For details of the language syntax, see that module's page on CPAN; here we will describe how to use this language with Urchin.

There are two modes of using RCQL – to generated tabular data output, or construct an RSS feed (which could then be used to generate other output styles that are XSL transformations of RSS 1.0 feeds). The RDQL query option for the default Urchin CGI script (urchin?cmd=rcql) offers both these option. The default behaviour is to output tabular data either as a CVS file, or as a geeneral results page. If the option for RSS 1.0 output is selected, the query must generate a list of Urchin channel and item IDs. This is done as follows:

Select ?item->urchin:channel_id, ?item->urchin:item_id
From some graph query involving ?item
Where some conditions

For example, the following query generates a list of channel and item ids for items that were written by a foaf:Person called 'James Bond'.

Select ?item->urchin:channel_id, ?item->urchin:item_id
From ?item->dc:creator{?y}, foaf:Person::?y->foaf:name{?z}
Where ?z = 'James Bond'

Note that, as with the simple extensible search above, the namespace abbreviations are pre-set – any used in a query must match an abbreviation mapped to a URI in the Urchin database.

The Urchin distribution includes a command line tool, rdfq, for querying the Urchin database using RCQL. It's usage is as follows:

rdfq [output options] [-v] [-d] [Query]
  Output Options
   -o | --output	'=s' (where s = 'text'|'csv'|'html'|'rss')
   -t | --text		tab-delimited text
   -c | --csv		CSV text
   -h | --html		HTML output
   -r | --rss		RSS 1.0 output
        For this option the query must have
        a Select clause of
        ?item->urchin:channel_id,
        ?item-urchin:item_id


   -v | --verbose	Verbose output
   -d | --debug		Switch on debuggging

  The RCQL query is taken either from the command line,
  or from STDIN.

RCQL can also be used instead of the Urchin search syntax in the standard RSS filtering query string. In this case, the query must be prefixed by RCQL:. In this mode of operation the query is soley generating RSS data – either outputing an RSS feed or one of the other pre-defined or customised output formats. Therefore, the Select clause can be ommitted from the query and urchin will prepend a clause selecting the urchin:channel_id and urchin:item_id. Urchin analyses the rest of the query to determine what RCQL variable to use in the Select clause. If there is a ?item, ?i, or ?x in the query, that is assumed to be the item variable, otherwise the first variable found is used in the Select clause.

For example, the following query is equivalent to the full RCQL query above.

RCQL:From ?item->dc:creator{?y}, foaf:Person::?y->foaf:name{?z} Where ?z = 'James Bond'

In addition, Urchin offers a keyword and regular expression matching extension to the RDF::Core::Query language. Literal strings can be prepended with LIKE:, with a % on either side of the keyword to do a keyword match, or prepended with RLIKE: to do a regular expression match. For example:

Select ?item->urchin:channel_id, ?item->urchin:item_id
From ?item->dc:category{?y},
Where ?y = 'LIKE:%cancer%'


Would produce a list of items that have given a Dublin Core category field that includes the word 'cancer'.

and:

Select ?item->urchin:channel_id, ?item->urchin:item_id
From ?item->ag:timestamp=>'RLIKE:.+T(0[0-9])'


Would produce a list of items that had been collected by Urchin between 00:00 and 09:59 on any day.