Home Page the Perl Page Programming and Computer Reviews and Thailand Comments

Snatch DTD

<!DOCTYPE config [
<!ELEMENT item (

name, method?, sub?,
time?, interval?, when?,
url?, lurl?, dontwant?, options?, flush?)>

<!ELEMENT page (head?, footer?, outfile?, options?, dontwant? )>
<!ELEMENT name (#PCDATA)>
]>

Wednesday, February 23, 2000 20:25:06
No Nested Elements
No Attributes

Doctype

This is a name for this Document Descriptor. The Config DTD is intended to provide access to all methods and data types. Its also the test DTD.
I was also thinking of an alarm doctype as an example of a more restricted type.
<!DOCTYPE alarm [ <!ELEMENT alarm ( name, sub?, time?, interval?, when?)> ]>
Page may evolve to a doctype of its own, and the page specific code in Snatch might migrate to a library.

item

<!ELEMENT item (name, method?, sub?, time?, interval?, when?....
The item tag contains names that Snatch knows what to do with. item is itself a list of anonymous hashes, @item in the main namespace.
name is the only required element. (everything else has a question mark after it.)
method runs a vast and growing switch switch statement to a variety of built in services.
time, interval, when are by name entrances into the world of @alarm and the alarm method.
sub is perl code or a filename. sub is not always needed.
url, lurl are fairly obvious. lurl is a mirror local file good for testing.
dontwant is an array, @dontwant that gets scanned before Snatch does anything with an item . @dontwant can be manipulated directly from sub code, for alarms that turn themselves off (or on) and for turning off later items in the same file.
options control things like debugging, using local rather than Internet wide URLS for LWP. Statements such as <options>EnableCrashmode=true<options> result in a global like this:
$sys{'EnableCrashmode'}='true'.
%sys is the major global variable and contains most everything interesting. Many of the elements of item are global Snatch variables or code portals.

page

<!ELEMENT page (head?, footer?, outfile?, options?, dontwant? )>
The page element assumes (at this point) that you are going to be printing to an outputfile at some point but if not then to STDOUT .
head and footer methods can be defined. This will be generic HTML or perl code spitting out HTML. If no head element is defined Snatch generates minimal startup code. footer does the same at the end of the document although it is declared at the beginning. page elements are not scanned together with the item elements because of the way the parser works so they are position insensitive in the XML document. This is taken care of inside Snatch, which handles page first, opens the file and saves footer for later and then scans and runs the items .
NB: a page method is needed, also, to enable opening and closing output files from the same document.
NB: an include method would probably be generally useful.

The Small Parser

The built in parser takes a parameter, item is the default, and builds a list of regexes to match anything on the element list. So our DTD:
<!ELEMENT item (name, method?, sub?, url, outfile, when?, options?....
returns a hash that looks kinda like this.
%h=(

name => 'LWP Simpler',
options => 'lurl headers debug=3',
url => 'http://slashdot.com/',
outfile => 'slashdot.html',
method => 'LWP all',
when => 'every morning',
sub=> 'update.pl'

);
Anything not on the DTD doesn't make it into the hash. Conversely, if you want to add a tag just put it in the DTD and it will show up as a hash element. To put it another way,,,

<!ELEMENT item (name, EnableCrashmode? enables
<EnableCrashmode>true<EnableCrashmode> and can be found like this:
$mode=$item->{'EnableCrashmode'};

Note that the default behavior for strange methods that are contained in braces, {} , is to run it as perl code, with the sub run last. So these are good for little one item conditional code. For variables that cross items in a file or series of files the option variation is available.

Snatch XML

With the Small Parser we don't get involved with nested tags and attributes. XML-Parser can be called if necessary (after I add a function to escape some stuff :-), but simple is doing just fine right now.
The criteria for judging the well formedness of the XML file is whether chokes the current version of IE5. The Small Parser itself is more forgiving and will merely eat an erroneous item (possibly silently), so if a method isn't working do check the XML file.

<item>
     <name>population</name>
     <method>LWP chunk</method>
     <url>http://www.census.gov/cgi-bin/ipc/popclockw/</url>
     <sub>
            m!<h1>(.+?)</h1>!si;
            $_="<h3>$1 Estimated Population</h3>";
     </sub>
<item>

No Nested Elements
See the way the tags line up? No nesting.
If you need something more complicated you can:
1. write a little DTD for the items and reparse it, inserting into the original @item list by hand.
2. write a new 20 line Small Parser that handles them.
3. rethink your problem
(still trying to keep it simple :-).
No <tag/>
these kind of single tags won't make it through the Small Parser either.
No Attributes
this is ok:
< name > population < name >
this is not ok:
< name att="some attribute" > population < name>
These limitations are not severe. Snatch is for processing small chunks of XML in a linear fashion, not as a general purpose device.
NB: reparsing, as in solution 1, above, works very well if you really need it.

notes on sub

The sub element has a few tricks to it.

<sub> 
<sub>

the HTML comments  can be included. The above example doesn't need them but sometimes perl code can break the XML, usually when there are comments in it. The comments are stripped out later.

<sub> /perl/snatch/DoSomethingBig.pl <sub>
If it looks like a filename assume it is a filename and run it. See Xanslate.XML, the XML-with-embedded-perl-code to HTML translator for an example of something not even comments can help with.