NAME
HTML::Data::Parser - parses data embedded in HTML
SYNOPSIS
Be like Google! Google Rich Snippets supports RDFa, Microdata and
Microformats, so why shouldn't you?
use RDF::Trine;
use HTML::Data::Parser;
my $parser = HTML::Data::Parser->new(
parse_rdfa => 1,
parse_grddl => 0,
parse_microformats => undef,
parse_microdata => undef,
parse_n3 => undef,
);
my $model = RDF::Trine::Model->temporary_model;
my $writer = RDF::Trine::Serializer->new('RDFXML');
$parser->parse_into_model($base_uri, $markup, $model);
print $writer->serialize_model_to_string($model);
DESCRIPTION
This module parses data embedded in HTML. It understands the following
standards and patterns for embedding data:
* RDFa
* Microformats
* GRDDL
* Microdata
* N3-in-HTML
This module is just a wrapper around RDF::RDFa::Parser,
HTML::Microformats, XML::GRDDL, HTML::HTML5::Microdata::Parser and
HTML::Embedded::Turtle. It is a subclass of RDF::Trine::Parser so
inherits the same interface as that.
Constructor Options
The methods presented by an HTML::Data::Parser object are exactly the
same as any other common-or-garden RDF::Trine::Parser object. The
constructor options are where it gets interesting.
The options accepted are:
* dom_parser - set to 'xml' or 'html5' to determine which parser to
use on input strings. Defaults to 'html5'.
* named_graphs - boolean; whether to return quads to handler in
"parse" method. For advanced use only.
* on_error - what to do when an error occurs. (Currently only a subset
of all possible errors are covered by this option. Some errors
thrown by other modules won't be caught.) Set to 'warn', 'die',
'ignore' or a callback coderef (which is passed one parameter - the
error message as a string). Defaults to 'warn'.
* options_microdata - a hashref of options to pass through to
"HTML::HTML5::Microdata::Parser->new" if/when parsing microdata.
* options_rdfa - an RDF::RDFa::Parser::Config object to use if/when
parsing RDFa.
* parse_grddl - a "troolean" (yes/no/maybe-so). Set to true to
indicate that you want GRDDL to be parsed. Set to false to indicate
that you want it to be ignored. Set to undef if you don't really
care: this will parse GRDDL if the required module (XML::GRDDL) is
installed and working, but won't complain if it's not. Defaults to
false.
* parse_microdata - another troolean. Defaults to undef.
* parse_microformats - another troolean. Defaults to undef.
* parse_n3 - another troolean. Defaults to undef.
* parse_rdfa - another troolean. Defaults to 1.
SEE ALSO
This module is just a wrapper around: RDF::RDFa::Parser,
HTML::Microformats, XML::GRDDL, HTML::HTML5::Microdata::Parser,
HTML::Embedded::Turtle.
And around these DOM parsers: HTML::HTML5::Parser, XML::LibXML.
It sits on top of Trine: RDF::Trine, RDF::Trine::Parser,
RDF::TrineShortcuts.
For more information on processing web data in Perl:
.
AUTHOR
Toby Inkster .
COPYRIGHT
Copyright 2010 Toby Inkster
This library is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.