1. Adding and updating RSS feeds
The urchinadm
script can be used to add or update
feeds, and
to refresh all the feeds in the database:
urchinadm [-X] command
add Add/Update channels to the database. Reads one
URL per line from stdin or a list given as
arguments.
remove Remove channels in the database. Reads one URL
or channel_id per line from stdin.
list List all channel source URL's in the database.
Writes one URL per line to stdout.
id [-q] Look up the identification numbers for a given URL
(channel or item) and display in query form with title.
refresh Update all channels in the database.
rebuild Save the current list of source URL's, wipe the
database, and reimport all channels.
(Interactive!)
wipe Delete all data in the database, including
channel metadata. (Interactive!)
clrcache Clear the downloaded files cache (i.e. force
later reload of all documents from network).
docgrep Search the cached documents for a given string
or regex.
xdomgrep Search the cached documents for a given XML tagname
and optionally an attribute of that tag.
xpathgrep Search the cached documents for a given XPath.
-X Turns on debugging output.
e.g.
$ perl urchinadm add < feeds.txt
$ perl urchinadm add http://www.nature.com/nsu/rss.rdf \
http://slashdot.org/slashdot.rss
2. Adding and updating other data sources
The Urchin source code includes two modules for extracting RSS-like data from non-RSS sources.
The Urchin::Import::Scrape
module constructs an
XML::RSS
object when given the URL of a document to scrape
and
some Perl regular expressions to extract the relevant parts of the
document.
The Urchin::Import::DBI
module can construct an XML::RSS
object
when given a DBI connection and a SQL statement for extracting the
relevant
data. Currently, this functionality is not incorporated into the
urchinadm
automated import and refresh script, so you must
write
separate scripts, using these modules, to create or import feeds for
non-RSS
sources. Is is important to note that the
channel.import_format_id
for such sources must be set to
be
non-RSS, so that the automated refresh process will skip over them.
Using Urchin::Import::Scrape
The match()
subroutine should be passed a hash
reference of
the following form:
'workarea' => $workarea_regex,
'channel' => $channelarea_regex,
'channel_title' => $channeltitle_regex,
'channel_link '=> $channellink_regex,
'channel_description' => $channeldescription_regex,
'channel_last_build_date' => $channellastbuild_regex,
'copyright' => $copyright_regex,
'language' => $language_regex,
'itemarea' => $itemarea_regex,
'item_title' => $itemtitle_regex,
'item_link' => $itemlink_regex,
'item_description' => $itemlink_regex,
'author_name' => $authorname_regex
The workarea
regular expression is evaluated first,
and all
subsequent regular expressions are evaluated against the matched area.
For
items, the itemarea
regular expression is evaluated as a
global
match - the subsequent item matches are evalutaed in turn against each
of the
matched areas. Note that all regular expressions should have one set of
tagging parentheses. It is the content of that tagged portion that are
returned as the result of the match.
Using Urchin::Import::DBI
This module uses the Urchin::OutputFeed.pm
module to
construct an XML::RSS
object. The trick is to write your
SQL
statement so that Urchin::OutputFeed
thinks it's seeing
data
from the Urchin database. This can be accomplished as follows. The
Urchin::OutputFeed
is expecting a results set with field
names
of the form tablename__fieldname
, where tablename
and fieldname
correspond to tables and their fields in
the
Urchin database. So, in your SQL query, you should assign aliases for
the
results set. For example, in a hypothetical article database:
select articles.title as item__title,
articles.author as item__author_name,
CONCAT('http://www.example.org/articles/',
articles.issueid, '/', articles.articleid)
as item__link,
CONCAT('Publication Issue ', articles.issueid,
' Page ', articles.startpage) as item__description
from ...
3. Adding custom output styles
Custom outputs can be generated either by an XSL transformation of
an
Urchin RSS 1.0 output document, or using an HTML::Template
template. Currently (version 0.92), the information on how to generate
a
custom output style must be manually inserted into the database. To
fully
specify an output style, the following information should be in the
output_format
table:
format_id |
– primary key, id for this custom output style |
title |
– the output style label that will be used in the fmt=
part of a query string (see below) |
description |
– a short description of this output style |
template_url |
– location of the HTML::Template or XSL file |
template_type |
– 'xsl' or 'htmltemplate' |
mime_type |
– the mime_type string for this output style, e.g. application/xml
or text/html |
embeddable |
– 1 or 0, this indicates whether the output should be embedded in the Urchin query screen, or served as a full document. 0 means serve as a full document. |
base_rss_ver |
– this should always be '1.0' |
For information on generating custom output styles, see CustomOutputs.html.
4. Urchin query syntax
The Urchin query syntax is made up of the following components:
Search keywords | – plain old words to search for |
Search regular expressions | – these are marked with a preceding ~ e.g. ~bio.*?ogy , ~hot|heat |
Functional keywords | – keywords, in uppercase, that provide special functionality e.g. the boolean operators AND , NOT , OR ,
search restrictors like NEW and AGGREGATE .A full list of search keywords is given below. |
Search field indicators | – these specify which database fields to search. If no field
is given, item title and description are assumed e.g. author_name:cowboyneal finds items written by
'cowboyneal', c.link:~www\.nature\.com finds items that
appear are from a channel whose link matches the regular expression www\.nature\.com A full list of built-insearch fields is given below. In addition, the search field indicator can be any predicate used in an RSS 1.0 source document where the subject of the triple is either an item or a channel. See below for more details on this. |
Grouping markers | – parentheses, square brackets and single quotes allow
grouping of search terms, operator precedence and delimiting of search
phrases e.g. (nuclear OR fusion) AND NOT dna e.g. author_name:[cowboyneal OR simoniker ]e.g. author_name:'George Martin'
|
5. Adding named filters and aggregations
Predefined filters can be stored in the database, and added to a
query
using the functional keyword FILTER
. Currently the filter
must
be added manually to the output_filter
table, as follows:
filter_id |
– primary key, the id of this named filter |
filter_string |
– the Urchin query syntax string |
title |
– the name of this filter |
description |
– a short description of the filter's purpose |
Aggregations of channels can also be defined, and they are be used
to
narrow the search to a subset of the channels in the database using the
AGGREGATE
functional keyword. Currently, these have to be
manually defined in the aggregate
and
channel_aggregate
tables as follows:
aggregate.aggregate_id |
– primary key, the id of this aggregate |
aggregate.title |
– the name of the aggregate |
aggregate.description |
– on optional piece of text describing the channels included in the aggregate |
aggregate.inserted_on |
– the time the aggregate name was added |
aggregate.inserted_by |
– the user_register.user_id of the user who
added the aggregate |
aggregate.updated_on |
– the time the aggregate was last changed |
aggregate.updated_by |
– the user_register.user_id of the user who
last changed the aggregate |
channel_aggregate.channel_id |
and |
channel_aggregate.aggregate_id |
– a list of paired foreign key ids to the channel and aggregate tables, associating a channel with an aggregate |
6. Adding composite outputs
Composite outputs are labels associating a particular stored search
with a
particular custom output style. Currently, the information on
how to generate a composite output must be manually inserted into the
database. To fully specify a composite output, the following
information
should be in the output
table:
output_id |
– primary key, id for this composite output |
title |
– the label for this composite output |
format_id |
– a foreign key to the output_format table,
indicating the output style |
filter_id |
– a foreign key to the output_filter table,
indicating the filter to use |
aggregate_id |
– a foreign key to the aggregate table,
indicating the aggregate on which to apply the filter |
description |
– an optional description, explaining this output |
inserted_by |
– the user_register.user_id of the user who
inserted this output |
inserted_on |
– the time the output was inserted |
updated_by |
– the user_register.user_id of the user who
last updated this output |
updated_on |
– the time the output was last updated |
7. Urchin query string fields
The urchin query string has the following keys:
q |
– The search string, URI encoded |
fmt |
– A label for the output type requested, e.g. rss10
for RSS 1.0, rss091 for RSS 0.91, or any other type that
has been specified in the output_format table |
out |
– A label for the composite output type – any name that's
been specified in the output table |
max |
– The maximum number of items to return. A blank value indicated that the number is unlimited. This is automatically set to 15 if RSS 0.91 output is requested. |
ord |
– How to sort the items returned by the query. Possible values are date (order by item publication date), rand (order at random), raw (as the items are ordered in the Urchin database) or title (alphabetical order by item title). If ommitted, the default is to order by publication date.
Note that this ordering is done before the results list is cropped to the maximum number of items. So, for example ord=date&max=15 will return the 15 most recently published items matching a particlular query, whereas ord=rand&max=15 will return 15 random items that match the query.
|
For example, a request for an RSS 1.0 feed, with a maximum of 50 items that include the word 'nasa' in the title would look like:
urchin?q=title%3Anasa&fmt=rss10&max=50
8. Other admin database tables
In addition to the functionality offered in version 0.92, the urchin database has a number of tables for planned future functionality. Their purpose is briefly describe here:
user_register
This is populated with one admin user by the seed_date.sql
script. It will be used in future to hold details of other users with
other
privileges.
user_group_access
This will be used to associated users with user groups that have certain permissions.
group_access
This will be used to list different user groups.
output
Planned development includes the ability to label a particular combination of named filter and named output_format. This will allow queries like:
urchin?out=biologynews
Where biologynews
is a label defined in this table to
mean
'do a particular search, and present the output using a particular
custom
output style'.
9. Urchin functional keywords
RECENT
...all items published in the last 3 days.
OLD
...all items published at least 6 months ago.
ENGLISH
...all items marked English, British English, American English, etc.
CURRENT
...all items present on last channel refresh.
NEW
...all items inserted on last channel refresh.
TLD
...all items displaying a title
, link
and
description
.
NEWCHANNEL
...all items from channels added in the last 3 days.
ALL
...all items.
NOT
, AND
, OR
...boolean operators
FILTER filtername
...recall the stored filter filtername
and incorporate it
into
the current query.
AGGREGATE aggregatename
...recall the stored aggregate aggregatename
and limit
the
current query to only those channels.
10. Urchin search fields
title: |
The item's title |
description: |
The item's description |
link: |
The item's link |
author_name: |
The item's author's name, as given by a dc:creator
element |
author_email: |
The item's author's email address, as given by the rss20:author
element |
publication_date: |
The publication date of the item, as given by the rss091:pubDate
element |
language: |
The language of the item, e.g., en-gb ,
inherited from it's parent channel. The value is drawn from dc:language
or rss091:language elements |
channel_id: |
The channel_id , in the Urchin database, of the
item's parent channel |
current_ind: |
Indicates whether the item was present in a feed the last
time that feed was checked Has a value of 1 or 0 - the CURRENT
keyword means that all result items must have current_ind:1 |
new_ind: |
Indicates whether the item was new in a feed the last time
that feed was checked. Has a value of 1 or 0 - the NEW
keyword means that all result items must have new_ind:1 |
comment_url: |
The value of the rss20:comments or annotate:reference
element |
c.title: |
The title of the parent channel of an item |
c.description: |
The parent channel's description |
c.link: |
The parent channel's link |
c.source_url: |
The URL of the resource from which the item was extracted |
c.language: |
The language of the item's channel, as specified by dc:language
or rss091:language elements in the source RSS feed |
c.last_updated_on: |
The date the item's channel was last updated in the Urchin database |
c.last_build_date: |
The value of a channel level dc:date element in
the source RSS feed |
c.generator: |
The value of the rss20:generator element |
c.rating: |
The value of the rss091:rating element |
c.copyright: |
The value of the rss091:copyright element |
c.managing_editor: |
The value of the rss091:managingEditor element |
c.webmaster: |
The value of the rss091:webmaster element |
c.docs_url: |
The value of the rss20:docs element |
11. Simple extensible search
In addition to storing the core RSS data items, the Urchin database stores all extra RDF-modeled data in imported RSS 1.0 feeds. From version 0.92, the Urchin search syntax has been extended to allow simple searching of this data using arbitrary search field restrictions. For example, if an imported RSS 1.0 feed contained the following data:
<rdf:RDF xmlns="http://purl.org/rss/1.0/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:reqv="http://purl.org/rss/1.0/modules/richequiv/" > ... <item rdf:about="http://www.example.com/item1"> <title>Searching arbitrary data fields</title> <link>http://www.example.com/item1</link> <description>A short tutorial on how to search arbitrary data fields in Urchin.</description> <dc:creator>Ben Lund</dc:creator> <dc:contributor>Martin Flack</dc:contributor> <reqv:description rdf:parseType="Literal" xmlns="http://www.w3.org/1999/xhtml"> <p>A short tutorial on how to search arbitrary data fields in <a href="http://urchin.sourceforge.net/"> Urchin</a>.</p> </reqv:description> </item> ... </rdf:RDF>
The core RSS data fields (title, link, description) would be searchable using the built in field restrictors. For example, the searches
title:arbitrary
link:='http://www.example.com/item1'
description:~data\sfields?
would all include this item in the output.
In addition, RDF predicates can be used as search field restrictors. For example, the searches
<dc:contributor>:='Martin Flack'
<reqv:description>:Urchin
would also match this item. The syntax of these simple RDF queries is:
<namespace_abbreviation:predicate_name>:pattern
The search will match any RDF triples whose subject is an item or a channel and whose value matches pattern. As with other Urchin queries,
There are a few restrictions on the use of this syntax – the namespace abbreviation must be mapped to a namespace URI in the Urchin database, and multiple arbitrary search restrictors cannot be combined using the AND
operator, although the use of the OR
operator will work.
12. RDF search
The Urchin database can be treated as an RDF triple store that can be queried using the RDF::Core::Query language (hereafter refered to a RCQL). For details of the language syntax, see that module's page on CPAN; here we will describe how to use this language with Urchin.
There are two modes of using RCQL – to generated tabular data output, or construct an RSS feed (which could then be used to generate other output styles that are XSL transformations of RSS 1.0 feeds). The RDQL query option for the default Urchin CGI script (urchin?cmd=rcql
) offers both these option. The default behaviour is to output tabular data either as a CVS file, or as a geeneral results page. If the option for RSS 1.0 output is selected, the query must generate a list of Urchin channel and item IDs. This is done as follows:
Select ?item->urchin:channel_id, ?item->urchin:item_id
From some graph query involving ?item
Where some conditions
For example, the following query generates a list of channel and item ids for items that were written by a foaf:Person called 'James Bond'.
Select ?item->urchin:channel_id, ?item->urchin:item_id
From ?item->dc:creator{?y}, foaf:Person::?y->foaf:name{?z}
Where ?z = 'James Bond'
Note that, as with the simple extensible search above, the namespace abbreviations are pre-set – any used in a query must match an abbreviation mapped to a URI in the Urchin database.
The Urchin distribution includes a command line tool, rdfq
, for querying the Urchin database using RCQL. It's usage is as follows:
rdfq [output options] [-v] [-d] [Query]
Output Options
-o | --output '=s' (where s = 'text'|'csv'|'html'|'rss')
-t | --text tab-delimited text
-c | --csv CSV text
-h | --html HTML output
-r | --rss RSS 1.0 output
For this option the query must have
a Select clause of
?item->urchin:channel_id,
?item-urchin:item_id
-v | --verbose Verbose output
-d | --debug Switch on debuggging
The RCQL query is taken either from the command line,
or from STDIN.
RCQL can also be used instead of the Urchin search syntax in the standard RSS filtering query string. In this case, the query must be prefixed by RCQL:
. In this mode of operation the query is soley generating RSS data – either outputing an RSS feed or one of the other pre-defined or customised output formats. Therefore, the Select
clause can be ommitted from the query and urchin will prepend a clause selecting the urchin:channel_id
and urchin:item_id
. Urchin analyses the rest of the query to determine what RCQL variable to use in the Select
clause. If there is a ?item
, ?i
, or ?x
in the query, that is assumed to be the item variable, otherwise the first variable found is used in the Select
clause.
For example, the following query is equivalent to the full RCQL query above.
RCQL:From ?item->dc:creator{?y}, foaf:Person::?y->foaf:name{?z} Where ?z = 'James Bond'
In addition, Urchin offers a keyword and regular expression matching extension to the RDF::Core::Query language. Literal strings can be prepended with LIKE:
, with a %
on either side of the keyword to do a keyword match, or prepended with RLIKE:
to do a regular expression match. For example:
Select ?item->urchin:channel_id, ?item->urchin:item_id
From ?item->dc:category{?y},
Where ?y = 'LIKE:%cancer%'
Would produce a list of items that have given a Dublin Core category field that includes the word 'cancer'.
and:
Select ?item->urchin:channel_id, ?item->urchin:item_id
From ?item->ag:timestamp=>'RLIKE:.+T(0[0-9])'
Would produce a list of items that had been collected by Urchin between 00:00 and 09:59 on any day.