Using Urchin - Notes for webmasters

1. Adding and updating RSS feeds

The urchinadm script can be used to add or update feeds, and to refresh all the feeds in the database:

urchinadm [-X] command
   add       Add/Update channels to the database. Reads one
             URL per line from stdin or a list given as
             arguments.
   remove    Remove channels in the database. Reads one URL
             or channel_id per line from stdin.
   list      List all channel source URL's in the database.
             Writes one URL per line to stdout.
   id [-q]   Look up the identification numbers for a given URL
             (channel or item) and display in query form with title.
   refresh   Update all channels in the database.
   rebuild   Save the current list of source URL's, wipe the
             database, and reimport all channels.
             (Interactive!)
   wipe      Delete all data in the database, including
             channel metadata. (Interactive!)
   clrcache  Clear the downloaded files cache (i.e. force
             later reload of all documents from network).
   docgrep   Search the cached documents for a given string
             or regex.
   xdomgrep  Search the cached documents for a given XML tagname
             and optionally an attribute of that tag.
   xpathgrep Search the cached documents for a given XPath.
   -X        Turns on debugging output.


e.g.
$ perl urchinadm add < feeds.txt
$ perl urchinadm add http://www.nature.com/nsu/rss.rdf \
       http://slashdot.org/slashdot.rss

2. Adding and updating other data sources

The Urchin source code includes two modules for extracting RSS-like data from non-RSS sources.

The Urchin::Import::Scrape module constructs an XML::RSS object when given the URL of a document to scrape and some Perl regular expressions to extract the relevant parts of the document. The Urchin::Import::DBI module can construct an XML::RSS object when given a DBI connection and a SQL statement for extracting the relevant data. Currently, this functionality is not incorporated into the urchinadm automated import and refresh script, so you must write separate scripts, using these modules, to create or import feeds for non-RSS sources. Is is important to note that the channel.import_format_id for such sources must be set to be non-RSS, so that the automated refresh process will skip over them.

Using Urchin::Import::Scrape

The match() subroutine should be passed a hash reference of the following form:

'workarea' => $workarea_regex,
'channel' => $channelarea_regex,
'channel_title' => $channeltitle_regex,
'channel_link '=> $channellink_regex,
'channel_description' => $channeldescription_regex,
'channel_last_build_date' => $channellastbuild_regex,
'copyright' => $copyright_regex,
'language' => $language_regex,
'itemarea' => $itemarea_regex,
'item_title' => $itemtitle_regex,
'item_link' => $itemlink_regex,
'item_description' => $itemlink_regex,
'author_name' => $authorname_regex

The workarea regular expression is evaluated first, and all subsequent regular expressions are evaluated against the matched area. For items, the itemarea regular expression is evaluated as a global match - the subsequent item matches are evalutaed in turn against each of the matched areas. Note that all regular expressions should have one set of tagging parentheses. It is the content of that tagged portion that are returned as the result of the match.

Using Urchin::Import::DBI

This module uses the Urchin::OutputFeed.pm module to construct an XML::RSS object. The trick is to write your SQL statement so that Urchin::OutputFeed thinks it's seeing data from the Urchin database. This can be accomplished as follows. The Urchin::OutputFeed is expecting a results set with field names of the form tablename__fieldname, where tablename and fieldname correspond to tables and their fields in the Urchin database. So, in your SQL query, you should assign aliases for the results set. For example, in a hypothetical article database:

select articles.title as item__title,
  articles.author as item__author_name,
  CONCAT('http://www.example.org/articles/',
  articles.issueid, '/', articles.articleid)
  as item__link,
  CONCAT('Publication Issue ', articles.issueid,
  ' Page ', articles.startpage) as item__description
  from ...

3. Adding custom output styles

Custom outputs can be generated either by an XSL transformation of an Urchin RSS 1.0 output document, or using an HTML::Template template. Currently (version 0.92), the information on how to generate a custom output style must be manually inserted into the database. To fully specify an output style, the following information should be in the output_format table:

`format_id`	– primary key, id for this custom output style
`title`	– the output style label that will be used in the `fmt=` part of a query string (see below)
`description`	– a short description of this output style
`template_url`	– location of the `HTML::Template` or XSL file
`template_type`	– 'xsl' or 'htmltemplate'
`mime_type`	– the mime_type string for this output style, e.g. `application/xml` or `text/html`
`embeddable`	– 1 or 0, this indicates whether the output should be embedded in the Urchin query screen, or served as a full document. 0 means serve as a full document.
`base_rss_ver`	– this should always be '1.0'

For information on generating custom output styles, see CustomOutputs.html.

4. Urchin query syntax

The Urchin query syntax is made up of the following components:

Search keywords	– plain old words to search for
Search regular expressions	– these are marked with a preceding `~` e.g. `~bio.*?ogy`, `~hot\|heat`
Functional keywords	– keywords, in uppercase, that provide special functionality e.g. the boolean operators `AND`, `NOT`, `OR`, search restrictors like `NEW` and `AGGREGATE`. A full list of search keywords is given below.
Search field indicators	– these specify which database fields to search. If no field is given, item title and description are assumed e.g. `author_name:cowboyneal` finds items written by 'cowboyneal', `c.link:~www\.nature\.com` finds items that appear are from a channel whose link matches the regular expression `www\.nature\.com` A full list of built-insearch fields is given below. In addition, the search field indicator can be any predicate used in an RSS 1.0 source document where the subject of the triple is either an item or a channel. See below for more details on this.
Grouping markers	– parentheses, square brackets and single quotes allow grouping of search terms, operator precedence and delimiting of search phrases e.g. `(nuclear OR fusion) AND NOT dna` e.g. `author_name:[cowboyneal OR simoniker`] e.g. `author_name:'George Martin'`

5. Adding named filters and aggregations

Predefined filters can be stored in the database, and added to a query using the functional keyword FILTER. Currently the filter must be added manually to the output_filter table, as follows:

`filter_id`	– primary key, the id of this named filter
`filter_string`	– the Urchin query syntax string
`title`	– the name of this filter
`description`	– a short description of the filter's purpose

Aggregations of channels can also be defined, and they are be used to narrow the search to a subset of the channels in the database using the AGGREGATE functional keyword. Currently, these have to be manually defined in the aggregate and channel_aggregate tables as follows:

`aggregate.aggregate_id`	– primary key, the id of this aggregate
`aggregate.title`	– the name of the aggregate
`aggregate.description`	– on optional piece of text describing the channels included in the aggregate
`aggregate.inserted_on`	– the time the aggregate name was added
`aggregate.inserted_by`	– the `user_register.user_id` of the user who added the aggregate
`aggregate.updated_on`	– the time the aggregate was last changed
`aggregate.updated_by`	– the `user_register.user_id` of the user who last changed the aggregate

`channel_aggregate.channel_id`	and
`channel_aggregate.aggregate_id`	– a list of paired foreign key ids to the channel and aggregate tables, associating a channel with an aggregate

6. Adding composite outputs

Composite outputs are labels associating a particular stored search with a particular custom output style. Currently, the information on how to generate a composite output must be manually inserted into the database. To fully specify a composite output, the following information should be in the output table:

`output_id`	– primary key, id for this composite output
`title`	– the label for this composite output
`format_id`	– a foreign key to the `output_format` table, indicating the output style
`filter_id`	– a foreign key to the `output_filter` table, indicating the filter to use
`aggregate_id`	– a foreign key to the `aggregate` table, indicating the aggregate on which to apply the filter
`description`	– an optional description, explaining this output
`inserted_by`	– the `user_register.user_id` of the user who inserted this output
`inserted_on`	– the time the output was inserted
`updated_by`	– the `user_register.user_id` of the user who last updated this output
`updated_on`	– the time the output was last updated

7. Urchin query string fields

The urchin query string has the following keys:

`q`	– The search string, URI encoded
`fmt`	– A label for the output type requested, e.g. `rss10` for RSS 1.0, `rss091` for RSS 0.91, or any other type that has been specified in the `output_format` table
`out`	– A label for the composite output type – any name that's been specified in the `output` table
`max`	– The maximum number of items to return. A blank value indicated that the number is unlimited. This is automatically set to 15 if RSS 0.91 output is requested.
`ord`	– How to sort the items returned by the query. Possible values are `date` (order by item publication date), `rand` (order at random), `raw` (as the items are ordered in the Urchin database) or `title` (alphabetical order by item title). If ommitted, the default is to order by publication date. Note that this ordering is done before the results list is cropped to the maximum number of items. So, for example `ord=date&max=15` will return the 15 most recently published items matching a particlular query, whereas `ord=rand&max=15` will return 15 random items that match the query.

For example, a request for an RSS 1.0 feed, with a maximum of 50 items that include the word 'nasa' in the title would look like:

urchin?q=title%3Anasa&fmt=rss10&max=50

8. Other admin database tables

In addition to the functionality offered in version 0.92, the urchin database has a number of tables for planned future functionality. Their purpose is briefly describe here:

user_register

This is populated with one admin user by the seed_date.sql script. It will be used in future to hold details of other users with other privileges.

user_group_access

This will be used to associated users with user groups that have certain permissions.

group_access

This will be used to list different user groups.

output

Planned development includes the ability to label a particular combination of named filter and named output_format. This will allow queries like:

urchin?out=biologynews

Where biologynews is a label defined in this table to mean 'do a particular search, and present the output using a particular custom output style'.

9. Urchin functional keywords

RECENT
...all items published in the last 3 days.

OLD
...all items published at least 6 months ago.

ENGLISH
...all items marked English, British English, American English, etc.

CURRENT
...all items present on last channel refresh.

NEW
...all items inserted on last channel refresh.

TLD
...all items displaying a title, link and description.

NEWCHANNEL
...all items from channels added in the last 3 days.

ALL
...all items.

NOT, AND, OR
...boolean operators

FILTER filtername
...recall the stored filter filtername and incorporate it into the current query.

AGGREGATE aggregatename
...recall the stored aggregate aggregatename and limit the current query to only those channels.

10. Urchin search fields

`title:`	The item's `title`
`description:`	The item's `description`
`link:`	The item's `link`
`author_name:`	The item's author's name, as given by a `dc:creator` element
`author_email:`	The item's author's email address, as given by the `rss20:author` element
`publication_date:`	The publication date of the item, as given by the `rss091:pubDate` element
`language:`	The language of the item, e.g., `en-gb`, inherited from it's parent channel. The value is drawn from `dc:language` or `rss091:language` elements
`channel_id:`	The `channel_id`, in the Urchin database, of the item's parent channel
`current_ind:`	Indicates whether the item was present in a feed the last time that feed was checked Has a value of 1 or 0 - the `CURRENT` keyword means that all result items must have `current_ind:1`
`new_ind:`	Indicates whether the item was new in a feed the last time that feed was checked. Has a value of 1 or 0 - the `NEW` keyword means that all result items must have `new_ind:1`
`comment_url:`	The value of the `rss20:comments` or `annotate:reference` element

`c.title:`	The `title` of the parent channel of an item
`c.description:`	The parent channel's `description`
`c.link:`	The parent channel's `link`
`c.source_url:`	The URL of the resource from which the item was extracted
`c.language:`	The language of the item's channel, as specified by `dc:language` or `rss091:language` elements in the source RSS feed
`c.last_updated_on:`	The date the item's channel was last updated in the Urchin database
`c.last_build_date:`	The value of a channel level `dc:date` element in the source RSS feed
`c.generator:`	The value of the `rss20:generator` element
`c.rating:`	The value of the `rss091:rating` element
`c.copyright:`	The value of the `rss091:copyright` element
`c.managing_editor:`	The value of the `rss091:managingEditor` element
`c.webmaster:`	The value of the `rss091:webmaster` element
`c.docs_url:`	The value of the `rss20:docs` element

11. Simple extensible search

In addition to storing the core RSS data items, the Urchin database stores all extra RDF-modeled data in imported RSS 1.0 feeds. From version 0.92, the Urchin search syntax has been extended to allow simple searching of this data using arbitrary search field restrictions. For example, if an imported RSS 1.0 feed contained the following data:

<rdf:RDF
  xmlns="http://purl.org/rss/1.0/"
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:dc="http://purl.org/dc/elements/1.1/"
  xmlns:reqv="http://purl.org/rss/1.0/modules/richequiv/"
>
...
<item rdf:about="http://www.example.com/item1">
  <title>Searching arbitrary data fields</title>
  <link>http://www.example.com/item1</link>
  <description>A short tutorial on how to search
  arbitrary data fields in Urchin.</description>
  <dc:creator>Ben Lund</dc:creator>
  <dc:contributor>Martin Flack</dc:contributor>
  <reqv:description rdf:parseType="Literal"
    xmlns="http://www.w3.org/1999/xhtml">
    <p>A short tutorial on how to search arbitrary
    data fields in
    <a href="http://urchin.sourceforge.net/">
    Urchin</a>.</p>
  </reqv:description>
</item>
...
</rdf:RDF>

The core RSS data fields (title, link, description) would be searchable using the built in field restrictors. For example, the searches

title:arbitrary
link:='http://www.example.com/item1'
description:~data\sfields?

would all include this item in the output.

In addition, RDF predicates can be used as search field restrictors. For example, the searches

<dc:contributor>:='Martin Flack'
<reqv:description>:Urchin

would also match this item. The syntax of these simple RDF queries is:

<namespace_abbreviation:predicate_name>:pattern

The search will match any RDF triples whose subject is an item or a channel and whose value matches pattern. As with other Urchin queries, can be an exact match, a keyword, or a regular expression.

There are a few restrictions on the use of this syntax – the namespace abbreviation must be mapped to a namespace URI in the Urchin database, and multiple arbitrary search restrictors cannot be combined using the AND operator, although the use of the OR operator will work.

12. RDF search

The Urchin database can be treated as an RDF triple store that can be queried using the RDF::Core::Query language (hereafter refered to a RCQL). For details of the language syntax, see that module's page on CPAN; here we will describe how to use this language with Urchin.

There are two modes of using RCQL – to generated tabular data output, or construct an RSS feed (which could then be used to generate other output styles that are XSL transformations of RSS 1.0 feeds). The RDQL query option for the default Urchin CGI script (urchin?cmd=rcql) offers both these option. The default behaviour is to output tabular data either as a CVS file, or as a geeneral results page. If the option for RSS 1.0 output is selected, the query must generate a list of Urchin channel and item IDs. This is done as follows:

Select ?item->urchin:channel_id, ?item->urchin:item_id From some graph query involving ?item Where some conditions

For example, the following query generates a list of channel and item ids for items that were written by a foaf:Person called 'James Bond'.

Select ?item->urchin:channel_id, ?item->urchin:item_id From ?item->dc:creator{?y}, foaf:Person::?y->foaf:name{?z} Where ?z = 'James Bond'

Note that, as with the simple extensible search above, the namespace abbreviations are pre-set – any used in a query must match an abbreviation mapped to a URI in the Urchin database.

The Urchin distribution includes a command line tool, rdfq, for querying the Urchin database using RCQL. It's usage is as follows:

rdfq [output options] [-v] [-d] [Query]
  Output Options
   -o | --output	'=s' (where s = 'text'|'csv'|'html'|'rss')
   -t | --text		tab-delimited text
   -c | --csv		CSV text
   -h | --html		HTML output
   -r | --rss		RSS 1.0 output
        For this option the query must have
        a Select clause of
        ?item->urchin:channel_id,
        ?item-urchin:item_id


   -v | --verbose	Verbose output
   -d | --debug		Switch on debuggging

  The RCQL query is taken either from the command line,
  or from STDIN.

RCQL can also be used instead of the Urchin search syntax in the standard RSS filtering query string. In this case, the query must be prefixed by RCQL:. In this mode of operation the query is soley generating RSS data – either outputing an RSS feed or one of the other pre-defined or customised output formats. Therefore, the Select clause can be ommitted from the query and urchin will prepend a clause selecting the urchin:channel_id and urchin:item_id. Urchin analyses the rest of the query to determine what RCQL variable to use in the Select clause. If there is a ?item, ?i, or ?x in the query, that is assumed to be the item variable, otherwise the first variable found is used in the Select clause.

For example, the following query is equivalent to the full RCQL query above.

RCQL:From ?item->dc:creator{?y}, foaf:Person::?y->foaf:name{?z} Where ?z = 'James Bond'

In addition, Urchin offers a keyword and regular expression matching extension to the RDF::Core::Query language. Literal strings can be prepended with LIKE:, with a % on either side of the keyword to do a keyword match, or prepended with RLIKE: to do a regular expression match. For example:

Select ?item->urchin:channel_id, ?item->urchin:item_id From ?item->dc:category{?y}, Where ?y = 'LIKE:%cancer%'

Would produce a list of items that have given a Dublin Core category field that includes the word 'cancer'.

and:

Select ?item->urchin:channel_id, ?item->urchin:item_id From ?item->ag:timestamp=>'RLIKE:.+T(0[0-9])'

Would produce a list of items that had been collected by Urchin between 00:00 and 09:59 on any day.