Home Page   the Perl Page   Programming and Computer Reviews   and Thailand Comments

Snatch

Grabbing Data off the Web with Perl

(and then doing something useful with it)

Snatch keeps its data in an XML file. The <sub> elements contain perl code to retrieve, parse and format snatched information. The output goes to an HTML page (but it doesn't have to). Snatch can use LWP (built in) or sockets to retrieve data but works fine on local files. I use regexes, usually, to parse the page but you could use HTMP::Parser; just as easily.
   The Main Idea here is to keep all the useful bits in one convenient location. You can grab the whole item and plug it into another page with no problems.

A fragment of the XML file looks like this:
Population
<item>
 <name>;population</name>
  <url>http://www.census.gov/cgi-bin/ipc/popclockw</url>
  <method>LWP</method>
  <sub>
     m!<h1>(.+?)</h1>!si;
     $_="<h3>$1 Estimated Population</h3>";
  </sub>
</item>

Snatch will go get the url using LWP and use whatever is in sub to get something from the page. Sub can contain any Perl that will run in an eval, which is apparently anything, and its only a little harder to debug buried inside there. You can also define a lurl pair which can reference a page on your hard disk till you get it right. This doesn't mean the code will work over a live HTTP connection, necessarily, but it helps. The sub code can be encased in a comment block also. The only thing this does is ensure that the perlish line noise doesn't break the XML somehow.
   BTW, my definition of well formed XML is that IE5.0 will display it properly, which fits this rather loose approach.

Note I've put some HTML formatting on what is returned. This is probably very politically incorrect as XML is supposed to be pure data without any of that presentation stuff in it But its was awfully convenient to do it right there. Some data demands to be run into a table immediately, for instance.
It was so convenient that I got downright wicked and defined the rest of the page as well.

Page Layout  
<page>
  <options> debug=2 headers lurl</options> <!-- debug headers lurl -->
   <outfile>c:/perl/news.html</outfile>
   <head>
      $_= "<html\><head\><title\>The News of the Day"</title\></head\>\n";
   </head>
   <body>
      $_="<H1>The News of the Day</H1>\n";
   </body>
   <footer>$_= "</body></html>\n"; </footer>
</page>

Snatch has a little micro parser that looks at the XML config file and tucks the essentials into a list of hashes. The parser uses a DTD to know what to look for. In this case I feed it the page element then fiddle with options as needed. (Options went into the config hmmm,,, because it was easy), then print the head and (upper) body portions of the page. At this point I feed the parser the item element and send the hashes off to LWP::UserAgent to be retrieved and printed. Then I print the footer and its done.

The parser is really the star of the show. It will pick up anything in the DTD and return the contents in a hash. This makes it quite easy to add, rename or delete fields. Adding the options took all of 10 minutes including changing the code to look in the hash instead for a $variable.

You don't need to use the page layout, if you'd rather not. But it has some useful things in it.
<options>
     debug - up to 4, which only prints out the hash and retreats
     headers - prints out the retrieved headers to a log file if you need them.
     lurl - use the local url instead of fresh data.
</options>
<outfile> - specify where to print (or STDOUT)</outfile>
<dontwant> skip these items </dontwant>

Note!
lurl still uses LWP to fetch the file! So you need a webserver running locally or it won't work.

If no <method> is specified we just eval whatever is in sub.
As an example this gets foreign currency rates using sockets:
Bank Rates with Sockets
 
<item>
  <name>
BankRates</name>
  <sub><!--
use IO::Socket;
my ($request_string,$rate,$reply,$conn,$len); 
my $base="THB";
my @quotes=("USD","MYR","GBP","AUD");
foreach (@quotes)  {
  $s= $s . sprintf "%s %s ",$_, fxp($base,$_);
}
$_="<b>Thai Baht to $s</B>";

sub fxp  {
my ($base,$quotecurrency) = @_;
$conn=IO::Socket::INET->new(
PeerAddr => "www.oanda.com",
PeerPort => 5011,
Proto => 'tcp');
die "Couldn't connect to host www.oanda.com\n" unless $conn;

$request_string="fxp/1.1\nbasecurrency: $base\nquotecurrency: $quotecurrency\n\n";
$len = length($request_string);
unless (syswrite($conn,$request_string,$len) == $len) {
  print "www.oanda.com closed connection\n";
  $conn->close();
  die "No connection to www.oanda.com\n";
  }
  while ($reply=<$conn>)  {
if ($reply=~/^\d+\.+\d*/)  {
        $rate=$reply;
        last;
    }
  }
  $rate;
}
  --></sub>
</item>

<method>LWP chunk</method> and <method>LWP</method> are equivalent because chunk is the default. Note: Make sure that sub returns false when chunking till you have what you want or it won't work right!
This example gets weather from Yahoo. I grab 4 of these for various locations and have a picture of the whole country.
Weather from Yahoo
<item>
<name>WeatherBKK</name>
<url>http://weather.yahoo.com/forecast/Bangkok_TH_c.html</url>
<lurl>http://localhost/weatherpage/Bangkok_TH_c.html</lurl>
<method>LWP chunk</method>
<sub>
my $rebegin='<!--Begin Extended Forecast-->';
my $reend='<!--End forecast table-->';
m/^.*?$rebegin\s*(.*?)\s*$reend/is;
my $s=$1;
$s=~ s!/?graphics/new_icons/!!sg if $s;
$s;
</sub>
</item>

While I'm getting the news and weather I take care of setting the PC's clock too, this needs the Net::Time module.
Setting the Clock
<item>
   <name>
   <sub>

use Net::Time qw(inet_time);    #set the pc clock
my $host="mozart.inet.co.th";
my $t=inet_time($host,'udp');
my ($sec, $min, $hour) = (localtime($t))[0,1,2];
my($old_sec, $old_min, $old_hour) = (localtime(time))[0,1,2];
my $min_diff = $min - $old_min;
my $hour = $old_hour;
if (abs($min_diff) < 30 ) {    #if more than 30 minutes out set by hand!
  $sec_diff = 60*(60*$hour + $min) + $sec - (60*(60*$old_hour + $old_min) + $old_sec);
  if ($sec_diff > 3) {    # set if more than 3 seconds off
    my $time_new = "$hour:$min:$sec";
    my $rc=system("time $time_new");
   }
}
$timestr=sprintf("%s %02d:%02d:%02d %+d",substr($now,0,16),$hour,$min,$sec,$sec_diff);
   </sub>
</item>

<method>LWP all</method> is available for shorter pages. It just grabs everything in one shot and returns it in a scalar. And <method>LWP file=/perl/somefile.htm</method> is also available. Note this last returns the filename only (or an error message if something went wrong). The Perl sub must open it to do something useful, unless you only want to store the file somewhere, like this example that returns a GIF.

Bank exchange Rates
<item>
   <name>BangkokBank</name>
   <url>http://www.bbl.co.th/cgi-bin/cgiwrap/nabbl/bankrates.cgi</url>
   <method>LWP file=/localweb/images/bankrates.gif</method>
   <sub>$_</sub>
</item>
That $_ in the sub will print out any error messages, as most of them do.

Calling the parser from a<sub> slashdot.xml is a kind of headline news in XML format. Here is an example that will grab this XML file, attach a DTD (so we know what to look for), parse it on the fly and then print it.
Parse XML
<item>
    <name>slashdotxml</name>
    <url>http://slashdot.org/slashdot.xml</url>
    <lurl>http://localhost/savesite/slashdot.xml</lurl>
    <method>LWP all</method>
    <sub><!--
my $dtd=<< "DTD";
<?xml version="1.0"?>
<!DOCTYPE ultramode [
   <!ELEMENT story   (title, url, time, author, department, topic, comments, section, image)>
]>
DTD
$s= $dtd . $_;
my @ll=parseXML($s,"story");
die "Parse Failed" if not @ll;
my (@l, $temp, $text);
push @l,"<h2> Slashdot XML Mode</h2>\n";
push @l,"<table> \n";
foreach $temp (@ll)  {
     $text= << "ARTICLE";
     <tr>
     <td><font size='-1' >$temp->{'department'}</font></td>
     <td><font size='-1'><b>$temp->{'topic'}</b></font></td>
     <td><font size='-1'><A HREF=\"$temp->{'link'}\">$temp->{'title'}</a></font></td>
     <td><font size='-1' color='red'>$temp->{'author'}</font></td></tr>
ARTICLE
push @l, $text;
}
push @l,"</table>\n";
$_=join' ',@l;
    --></sub>
</item>


Linking multiple XML files
The link method just stuffs the XML files into a list and calls them. There is a %sys hash that can be used to pass along information.
<item>

     <name>do all these</name>
     <method>link news.xml weather.xml</method>
</item>

Timers
Named timers are useful items, starting it up is easy.
<item>

     <name>start the timer</name>
     <method>timer able</method>
</item>
get the results in any later routine, or even a linked file like so.
    $_=sprintf "%.3f seconds",$sys{'timer able'}/1000;

The Snatch code is quite small for what it does. Since most of the perl code and config stuff is buried in the XML file and everything gets passed around in hashes its still only about 400 lines.

Snatch.pl   config.xml
package Snatch.zip 4k
Limitations:
The microparser doesn't handle nested items.
This was developed and tested using Activestate's 519 release of Perl on Win32.
Changes: Timestamp: 11 June, 1999
LWP required on method.
Various changes to make it easier to add routines later.
Timestamp: 26 November, 1999
Page element de-emphasised and folded in a bit more though it still gets called first and <body> now depreciated but still supported.
Option now works for any item.
Timer and Link added.

Where this is going

While I'm happy with it for the moment I really should do something about checking headers to see something has changed. And it will probably be useful to embed POST and GET and some mail or news routines..... Should be fairly easy to load a different parser, e.g. fiddle XML::Parser to return a list of anonymous hashes, but I haven't needed it yet.. I'm trying to keep this simple.

Background processing of various kinds both local and remote. Needs a sleep hook, time handling routines and a proper scheduler built in. I want the minimum built in.

Second thing is a way to catalog and store the code fragments. For one I've got too many little XML files running around. These are adding up and I am losing track of them. This needs a general mechanism for storing and locating code with comments and addenda of various types, which prolly means a fancier dtd and parser. Another thing is an include element that will probably fall out of the above.

comments and suggestions here yzorderex@yahoo.com
Anything not explained will be explained later especially if you ask.