User Contribute XML::Filter::Digest(3pm) NAME XML::Filter::Digest SYNOPSIS use strict; use XML::Filter::Digest; use XML::Handler::YAWriter; use IO::File; my $digest = new XML::Filter::Digest( 'Handler'=> new XML::Handler::YAWriter( 'Output' => new IO::File( ">-" ), 'Pretty' => { 'AddHiddenNewLine' => 1 } ), 'Script' => new XML::Script::Digest( 'Source' => { 'SystemId' => $ARGV[0] } )->parse(), 'Source' => { 'SystemId' => $ARGV[1] } )->parse(); 0; DESCRIPTION Most XML tools are aimed to parse some simple XML and to produce some formatted output. XML::Filter::Digest is aimed on the opposite. Many formats can now be parsed by a SAX Driver. XPath offers a smart way to write queries on XML. XML::Filter::Digest is a PerlSAX Filter to query XML and to provide some simpler digest as a result. XML::Filter::Digest is using an own script language that can be parsed by XML::Script::Digest to formulate those digest queries. To tell you straight, a digest script is well formed XML. The following script define, that the result XML should have a root element called extract, containing several elements called section starting from the 4th HTML header. Those section elements contain id, title and intro elements, containing the XPath string-value of their node as character data. The digest script parser does silently ignore anything else than digest elements and collect elements. The digest element needs a name attribute defining the name of the root element, while the collect element needs an additional node attribute defining XPath queries for nested elements. Only a single digest element should exist within a script document, but there is no need that the digest script is the root element of the document. Nested within the digest element should be collect elements. They may contain several other collect elements recursivly. METHODS The XML::Filter::Digest object may act as a Filter to receive SAX events, or directly as a Driver if you provide a Source option to the parse method. The filter is reusable, if you arange that the chain of Handlers is also reusable to batch multiple documents. The filter requires a Handler and a Script option before the start_document method is called. The XML::Script::Digest object may act as a Handler to receive SAX events, or directly if you provide a Source option to the parse method. The script object is reusable and a single script object can be used for several filter objects. new Creates a new XML::Driver::HTML object. Default options for parsing, described below, are passed as key-value pairs or as a single hash. Options may be changed directly in the object. parse Parses a document by embedding XML::Parser::PerlSAX. This allows to use XML::Filter::Digest directly as a Driver and simplyfies generating a ready to use XML::Script::Object. Options, described below, are passed as key-value pairs or as a single hash. Options passed to parse() override the default options in the object for the duration of the parse. start_document Notifies the object about the start of a new document. The object will do its cleanup if its reused. end_document Notifies the object about the end of the document. Return value of XML::Script::Digest is $self, to become used as the return value of the parse method. XML::Filter::Digest will walk through the script object to generate a stream SAX events for its Handler. Return value of XML::Filter::Digest is the return value of the end_document method of the Handler object. OPTIONS Script XML::Script::Digest objects can be used for several XML::Filter::Digest objects. Handler Default SAX Handler to receive events from XML::Filter::Digest objects. Source XML::Filter::Digest and XML::Script can be used on raw XML directly, by calling the parse() method. To do this the Source option is required for embedding the PerlSAX parser. The `Source' hash may contain the following parameters: ByteStream The raw byte stream (file handle) containing the document. String A string containing the document. SystemId The system identifier (URI) of the document. Encoding A string describing the character encoding. If more than one of `ByteStream', `String', or `SystemId', then preference is given first to `ByteStream', then `String', then `SystemId'. NOTES The XML::Filter::Digest is not a streaming filter, but a buffering filter, as any processing is done at the end_document method. This could cause the Perl interpreter to run out of memory on large XML files. At best define an ulimit to prevent the system going offline for several minutes, till it detects that there is realy no memory to seize somewhere in the network. Adding network swapspace ad infinitum only make things worse, so I have the following line in my .bashrc. Other operating systems offer similar constrains. ulimit -v 98304 -d 98304 -m 98304 This line is ok on a single user machine with 32M ram and 128MB swap. I can raise this value, if I know that I wanna walk the dog. BUGS not yet implemented: reuse of XML::Filter::Digest objects. XML::XPath bugs: XML::Filter::Digest is tested with XML::XPath 0.20 and 0.24. Both versions worked after a patch to XML::XPath::Builder. Version 0.25 and better are expected to work out of the box. *** ../XML-XPath-0.24-orig/XPath/Builder.pm Thu Feb 24 20:46:03 2000 --- XPath/Builder.pm Thu May 11 03:57:03 2000 *************** *** 30,37 **** #$node->[node_namespace] = $e->namespace($tag); $node->[node_children] = []; ! while (@$attribs) { ! my ($key, $val) = (shift @$attribs, shift @$attribs); my @newattr; $newattr[node_parent] = $node; $newattr[node_key] = $key; --- 30,37 ---- #$node->[node_namespace] = $e->namespace($tag); $node->[node_children] = []; ! foreach (keys %$attribs) { ! my ($key, $val) = ($_, $attribs->{$_}); my @newattr; $newattr[node_parent] = $node; $newattr[node_key] = $key; other bugs: The NotSoFree License is incompatible to the GNU General Public License. AUTHOR Michael Koehne, Kraehe@Copyleft.De (c) 2000 NotSoFree License SEE ALSO the XML::Parser::PerlSAX manpage and the XML::XPath manpage 11/May/2000 perl 5.005, patch 03 4