Home Page
  the Perl page
Programming and Computer Reviews
  and Thailand Comments


Grabbing Data off the Web with Perl

(and then doing something useful with it)

This runs off of an XML file that looks like this:
     $_="<h3>$1 Estimated Population</h3>";

WebSnatch will go get the url using LWP and use whatever is in sub to get something from the page. Sub can contain any Perl that will run in an eval, which is apparently anything, and its only a little harder to debug buried inside there. You can also define a lurl pair which can reference a page on your hard disk till you get it right. This doesn't mean the code will work over an HTTP connection, necessarily, but it helps. The sub code can be encased in a comment block also. The only thing this does is ensure that the perlish line noise doesn't break the XML somehow. BTW, my definition of well formed XML is that IE5.0 will display it properly, which fits this rather loose approach.

Note I've put some HTML formatting on what is returned. This is probably very politically incorrect as XML is supposed to be pure data without any of that presentation stuff in it But its was awfully convenient to do it right there.
So convenient that I got downright wicked and defined the rest of the page as well.

  <options> debug=2 headers lurl</options> <!-- debug headers lurl -->
      $_= "<html\><head\><title\>The News of the Day"</title\></head\>\n";
      $_="<H1>The News of the Day</H1>\n";
   <footer>$_= "</body></html>\n"; </footer>

WebSnatch has a little micro parser that looks at the XML config file and tucks the essentials into a list of hashes. The parser uses a DTD to know what to look for. In this case I feed it the page element then fiddle with options as needed. (Options went into the config hmmm,,, because it was easy), then print the head and (upper) body portions of the page. At this point I feed the parser the item element and send the hashes off to LWP::UserAgent to be retrieved and printed. Then I print the footer and its done.

The parser is really the star of the show. It will pick up anything in the DTD and return the contents in a hash. This makes it quite easy to add, rename or delete fields. Adding the options took all of 10 minutes including changing the code to look in the hash instead for a $variable.

<method>chunk</method> is the default if no method is specified. Make sure that sub returns false till you have what you want or it won't work right!

<method> all </method> is available for shorter pages and <method>file=/perl/somefile.htm</method> is also available. Note this last returns the filename only. The Perl sub must open it to do something useful (unless you only want to store the file somewhere).

<method>nourl</method> is good for socket handling, I've added an example with foreign exchange rates. Also good for just outputting some HTML. The others call LWP whereas this just evals sub.

The WebSnatch code is quite small for what it does. Since most of the perl code and config stuff is buried in the XML file and everything gets passed around in hashes its still only about 200 lines
WebSnatch.pl   config.xml
package WebSnatch.zip 4k
The microparser doesn't handle nested items.
This was developed and tested using Activestate's 509 release of Perl on Win32 Only.
Changes: Timestamp: 5 May, 1998
norul tag to run some code that doesn't need LWP.
Clean up the code a bit.

comments and suggestions here robert@bangkokwizard.com
Anything not explained will be explained later especially if you complain about it.