Urchin requires a web server, and comes with a CGI script and a mod_perl
handler for Apache.
The CGI script should work with any CGI-enabled web server, while the mod_perl
handler requires
Apache as the web server. Urchin uses MySQL for its database functionality. The installation
instructions for these various system software components can be found at the relevant sites.
For the Apache server, see
http://httpd.apache.org/
Note that for mod_perl
functionality Urchin requires Apache 2.0.
This version of Urchin has been tested with Apache 2.0.40.
For mod_perl
, see
http://perl.apache.org/
Note that for mod_perl
funtionality Urchin uses Apache 2.0 and so requires mod_perl
2.0 (mod_perl
2.0 is
currently in development and so the actual version to use as of May 21, 2004 would be 1.99_14 or higher).
Finally for MySQL, see
http://www.mysql.com/
Note that the MySQL version must be 4.0.13 or higher.
0. Untar / CVS checkout
To install from tarball:
- Untar Urchin-0.92.tar.gz and enter the directory
$ tar xvzf Urchin-0.92.tar.gz
$ cd Urchin-0.92
To install from CVS:
- In a directory of your choice, create and enter a directory called
cvs_urchin:
$ mkdir cvs_urchin
$ cd cvs_urchin
- In cvs_urchin, run:
$ cvs -d:pserver:anonymous@cvs.sf.net:/cvsroot/urchin checkout urchin
- Enter the directory
$ cd urchin
Note that there are currently (August 20, 2004) some problems with Sourceforge CVS access. During the outage, you can get the latest development code snapshot by downloading Urchin-dev-20040820.tar.gz from the project download page.
Then:
- Install the Urchin perl modules
$ perl Makefile.PL
$ make
$ make test
$ make install
1. Install CPAN modules
Urchin requires these CPAN modules:
MODULE VER TESTED STOCK IN PERL VER ------------------------------------------------------------ Apache::compat - Apache::Const 0.01 Apache::Emulator 0.04 Apache2 - Carp 1.01 5.00307, 5.007003 CGI 2.81 5.004, 5.008 DBI 1.30 Data::Dumper 2.12 5.005, 5.007003 Encode 1.83 5.007003, 5.008001 ExtUtils::MakeMaker 6.03 5.00307, 5.008 File::Path 1.05 5.00307, 5.007003 FindBin 1.43 5.00307, 5.007003 HTML::Entities 1.23 HTML::LinkExtractor 0.11 HTML::Sanitizer 0.04 HTML::Template 2.6 HTTP::Request 1.30 HTTP::Response 1.41 HTTP::Status 1.26 LWP::RobotUA 1.18 LWP::UserAgent 2.001 POSIX 1.05 5.00307, 5.007003 Parse::RecDescent 1.94 RDF::Core 0.30 Set::Array 0.11 Sys::Hostname::Long 1.2 Text::CSV 0.01 Time::ParseDate 2003.1126 Time::Stopwatch 1.00 URI 1.21 XML::DOM 1.27 XML::RSS 1.02 XML::RSS::Tools 0.13 XML::XPath 1.13 XML::XSLT 0.45
The second column shows the version tested with Urchin, and thus much
earlier versions are unlikely to work. The third column indicates in
which version of Perl the first release of that module appeared, at
all, and the version listed in the previous column (this data from
Module::CoreList
1.96). If you have a recent version of Perl such as
5.8.0, you can skip installation of the modules with a Perl version
listed.
You can install Perl CPAN modules with the CPAN shell, e.g.:
perl -MCPAN -e shell
cpan> install RDF::Core
If you prefer system packages, Red Hat users can check
http://rpmpan.sourceforge.net/
and Debian users can check their apt
repository or man dh-make-perl
.
XML::RSS::Tools
will need XML::LibXML
and XML::LibXSLT
. This will
require you to install the system libraries that they need.
2. Database setup
***Your MySQL version must support InnoDB - use version 4.0.13 or higher***
To create a database, run the setup script. Running setup without parameters will provide help text:
cd db/scripts/mysql
./setup
An example of a working setup command might be:
./setup 'mysql -u root -p' create urchin urchin rss
3. System setup
Run the following commands as root to setup a group and directories for Urchin to write into:
groupadd urchin
# add apache user to urchin group; username may be www-data instead
gpasswd -a apache urchin
# add yourself to urchin group
gpasswd -a apache myuser
mkdir /var/cache/urchin
chgrp urchin /var/cache/urchin
chmod 2775 /var/cache/urchin
mkdir /var/log/urchin
chgrp urchin /var/log/urchin
chmod 2775 /var/log/urchin
4. Configuration
Copy the stock configuration file to the default location:
cp config /etc/urchin.conf
Edit /etc/urchin.conf
to:
- use the correct database name,
username and password for the urchin database:
dbi.source = mysql:urchin dbi.username = user dbi.password = password
- specify the directory where Urchin should cache downloaded data:
cache.path = /path/to/urchin/cache log.path = /var/log/urchin
5. CGI setup
Copy the CGI script into the appropriate directory:
cp urchin.cgi /var/www/cgi-bin
chmod a+x /var/www/cgi-bin/urchin.cgi
6. mod_perl setup (if used)
You should have the following items in your Apache configuration:
If the Urchin libraries are being installed to a special location you
must get them loaded by Perl. You can accomplish this by placing the
following command in your Apache config outside of all
VirtualHost's. If your distro has a /etc/httpd/conf.d/perl.conf
place
this command (modified with the right directory name) near the bottom
of that file, otherwise add it somewhere to
/etc/httpd/conf/httpd.conf
.
PerlSwitches -I/var/www/cgi-bin/urchin_lib
All Urchin mod_perl
instances will need the following; it can go
inside or outside a VirtualHost:
PerlModule Apache::Urchin <Location /urchinsearch> SetHandler perl-script PerlHandler Apache::Urchin </Location>
The PerlModule line is not strictly required but its recommended.
You may have to say PerlHandler Apache::Urchin::handler
on some
systems if Apache gives you an error about not being able to find or
initialize the handler.
And this can be added for some web administration:
<Location /urchinadmin> SetHandler perl-script PerlHandler Apache::Urchin PerlSetVar UrchinAdmin On PerlAuthenHandler Apache::Urchin::authen_handler AuthName "Urchin Administrative Commands" AuthType basic require valid-user </Location>
If you intend to lock out the public from even the non-admin section,
and require all users to supply a valid username/password, then you
should use a Location
block for /urchinsearch
that looks more like the
/urchinadmin example - copy the PerlAuthenHandler
, AuthName
, AuthType
and require lines over. Then in /etc/urchin.conf
you need to set
web.public_access = NO
.
7. Import RSS feeds
Prepare a file of RSS feed URLs to import - for an example see feeds.txt
Add to the urchin database:
$ perl urchinadm add < feeds.txt
Alternatively, specify URLs to import on the command line:
$ perl urchinadm add http://www.nature.com/news/rss.rdf
$ perl urchinadm add http://slashdot.org/slashdot.rss
The urchinadm command offers other convenient administrative commands;
run it without any parameters to see help text. You may find it useful
to create a symlink to urchinadm somewhere in your PATH
.
8. Set up cron job for database refresh
Edit /etc/aliases
to include urchinadm
The refresh database shell script urchin_refresh.cron
mails error reports to a user called 'urchinadm'. You should set up an
alias or aliases for this address:
urchinadm: john, paul, george, ringo
Be nice in how often you run the refresh script. More than once an hour is too often – and for most sites, even that will be unnecessary.