Snatch keeps its data in an XML file. The <sub> elements
contain perl code to retrieve, parse and format snatched information. The
output goes to an HTML page (but it doesn't have to). Snatch can use LWP
(built in) or sockets to retrieve data but works fine on local files. I use
regexes, usually, to parse the page but you could use HTMP::Parser; just as easily.
The Main Idea here is to keep all the useful bits in one
convenient location. You can grab the whole item and plug it into another
page with no problems.
A fragment of the XML file looks like this:
Population
<item>
<name>;population</name>
<url>http://www.census.gov/cgi-bin/ipc/popclockw</url>
<method>LWP</method>
<sub>
m!<h1>(.+?)</h1>!si;
$_="<h3>$1 Estimated Population</h3>";
</sub>
</item>
Snatch will go get the url using LWP and use whatever is in
sub to get something from the page. Sub can contain any Perl
that will run in an eval, which is apparently anything, and its only a little
harder to debug buried inside there. You can also define a lurl pair
which can reference a page on your hard disk till you get it right. This
doesn't mean the code will work over a live HTTP connection, necessarily, but it
helps. The sub code can be encased in a comment block also. The only
thing this does is ensure that the perlish line noise doesn't break the XML
somehow.
BTW, my definition of well formed XML is that IE5.0 will display
it properly, which fits this rather loose approach.
Note I've put some HTML formatting on what is returned. This is probably
very politically incorrect as XML is supposed to be pure data without any of
that presentation stuff in it But its was awfully convenient to
do it right there. Some data demands to be run into a table immediately, for
instance.
It was so convenient that I got downright wicked and defined the rest of the
page as well.
Page Layout
<page>
<options> debug=2 headers lurl</options>
<!-- debug headers lurl -->
<outfile>c:/perl/news.html</outfile>
<head>
$_= "<html\><head\><title\>The
News of the Day"</title\></head\>\n";
</head>
<body>
$_="<H1>The News of the
Day</H1>\n";
</body>
<footer>$_= "</body></html>\n";
</footer>
</page>
Snatch has a little micro parser that looks at the XML config file
and tucks the essentials into a list of hashes. The parser uses a DTD to know
what to look for. In this case I feed it the page element then fiddle
with options as needed. (Options went into the config hmmm,,, because it was
easy), then print the head and (upper) body portions of the
page. At this point I feed the parser the item element and send the
hashes off to LWP::UserAgent to be retrieved and printed. Then I print the
footer and its done.
The parser is really the star of the show. It will pick up anything in the DTD and return the contents in a hash. This makes it quite easy to add, rename or delete fields. Adding the options took all of 10 minutes including changing the code to look in the hash instead for a $variable.
You don't need to use the page layout, if you'd rather not. But it
has some useful things in it.
<options>
debug - up to 4, which only
prints out the hash and retreats
headers - prints out the
retrieved headers to a log file if you need them.
lurl - use the local url instead
of fresh data.
</options>
<outfile> - specify where to print (or
STDOUT)</outfile>
<dontwant> skip these items </dontwant>
While I'm getting the news and weather I take care of setting the PC's
clock too, this needs the Net::Time module.
Setting the Clock
<item>
<name>
<sub>
use Net::Time qw(inet_time); #set the pc clock
my $host="mozart.inet.co.th";
my $t=inet_time($host,'udp');
my ($sec, $min, $hour) = (localtime($t))[0,1,2];
my($old_sec, $old_min, $old_hour) = (localtime(time))[0,1,2];
my $min_diff = $min - $old_min;
my $hour = $old_hour;
if (abs($min_diff) < 30 ) { #if more than 30 minutes
out set by hand!
$sec_diff = 60*(60*$hour + $min) + $sec - (60*(60*$old_hour +
$old_min) + $old_sec);
if ($sec_diff > 3) { # set if more than 3
seconds off
my $time_new = "$hour:$min:$sec";
my $rc=system("time $time_new");
}
}
$timestr=sprintf("%s %02d:%02d:%02d
%+d",substr($now,0,16),$hour,$min,$sec,$sec_diff);
</sub>
</item>
<method>LWP all</method> is available for shorter pages. It
just grabs everything in one shot and returns it in a scalar. And
<method>LWP file=/perl/somefile.htm</method> is also available.
Note this last returns the filename only (or an error message if something
went wrong). The Perl sub must open it to do something useful, unless you
only want to store the file somewhere, like this example that returns a
GIF.
Bank exchange Rates
<item>
<name>BangkokBank</name>
<url>http://www.bbl.co.th/cgi-bin/cgiwrap/nabbl/bankrates.cgi</url>
<method>LWP
file=/localweb/images/bankrates.gif</method>
<sub>$_</sub>
</item>
That $_ in the sub will print out any error messages, as most of them do.
Calling the parser from a<sub>
slashdot.xml is a kind of
headline news in XML format. Here is an example that will grab this XML file,
attach a DTD (so we know what to look for), parse it on the fly and then print
it.
Parse XML
<item>
<name>slashdotxml</name>
<url>http://slashdot.org/slashdot.xml</url>
<lurl>http://localhost/savesite/slashdot.xml</lurl>
<method>LWP all</method>
<sub><!--
my $dtd=<< "DTD";
<?xml version="1.0"?>
<!DOCTYPE ultramode [
<!ELEMENT story (title, url, time, author, department, topic, comments, section, image)>
]>
DTD
$s= $dtd . $_;
my @ll=parseXML($s,"story");
die "Parse Failed" if not @ll;
my (@l, $temp, $text);
push @l,"<h2> Slashdot XML Mode</h2>\n";
push @l,"<table> \n";
foreach $temp (@ll) {
$text= << "ARTICLE";
<tr>
<td><font size='-1' >$temp->{'department'}</font></td>
<td><font size='-1'><b>$temp->{'topic'}</b></font></td>
<td><font size='-1'><A HREF=\"$temp->{'link'}\">$temp->{'title'}</a></font></td>
<td><font size='-1' color='red'>$temp->{'author'}</font></td></tr>
ARTICLE
push @l, $text;
}
push @l,"</table>\n";
$_=join' ',@l;
--></sub>
</item>
Linking multiple XML files
The link method just stuffs the XML files into a list and calls them. There is a
%sys hash that can be used to pass along information.
<item>
<name>do all these</name>
<method>link news.xml weather.xml</method>
</item>
Timers
Named timers are useful items, starting it up is easy.
<item>
<name>start the timer</name>
<method>timer able</method>
</item>
get the results in any later routine, or even a linked file like so.
$_=sprintf "%.3f seconds",$sys{'timer able'}/1000;
The Snatch code is quite small for what it does. Since most of the perl code and config stuff is buried in the XML file and everything gets passed around in hashes its still only about 400 lines.
Snatch.pl config.xmlWhile I'm happy with it for the moment I really should do something about
checking headers to see something has changed. And it will probably be useful to embed POST and GET
and some mail or news routines.....
Should be fairly easy to load a different parser, e.g. fiddle XML::Parser to return a list of anonymous hashes, but I haven't needed
it yet.. I'm trying to keep this simple.
Background processing of various kinds both local and remote. Needs a sleep hook, time handling routines and a proper scheduler built in. I want the minimum built in.
Second thing is a way to catalog and store the code fragments. For one I've got too many little XML files running around. These are adding up and I am losing track of them. This needs a general mechanism for storing and locating code with comments and addenda of various types, which prolly means a fancier dtd and parser. Another thing is an include element that will probably fall out of the above.
comments and suggestions here robert@bangkokwizard.com
Anything not explained will be explained later especially if you ask.