NAME HTML::HTML5::Parser - parse HTML reliably SYNOPSIS use HTML::HTML5::Parser; my $parser = HTML::HTML5::Parser->new; my $doc = $parser->parse_string(<<'EOT');
Foo bar.
BazQuux.
EOT
my $fdoc = $parser->parse_file( $html_file_name );
my $fhdoc = $parser->parse_fh( $html_file_handle );
DESCRIPTION
This library is substantially the same as the non-CPAN module
Whatpm::HTML. Changes include:
* Provides an XML::LibXML-like DOM interface. If you usually use
XML::LibXML's DOM parser, this should be a drop-in solution for
tag soup HTML.
* Constructs an XML::LibXML::Document as the result of parsing.
* Via bundling and modifications, removed external dependencies on
non-CPAN packages.
Constructor
"new"
$parser = HTML::HTML5::Parser->new;
The constructor does not do anything interesting.
XML::LibXML-Compatible Methods
"parse_file", "parse_html_file"
$doc = $parser->parse_file( $html_file_name [,\%opts] );
This function parses an HTML document from a file or network;
$html_file_name can be either a filename or an URL.
Options include 'encoding' to indicate file encoding (e.g. 'utf-8')
and 'user_agent' which should be a blessed "LWP::UserAgent" object
to be used when retrieving URLs.
If requesting a URL and the response Content-Type header indicates
an XML-based media type (such as XHTML), XML::LibXML::Parser will be
used automatically (instead of the tag soup parser). The XML parser
can be told to use a DTD catalogue by setting the option
'xml_catalogue' to the filename of the catalogue.
HTML (tag soup) parsing can be forced using the option 'force_html',
even when an XML media type is returned. If an options hashref was
passed, parse_file will set $options->{'parser_used'} to the name of
the class used to parse the URL, to allow the calling code to
double-check which parser was used afterwards.
If an options hashref was passed, parse_file will set
$options->{'response'} to the HTTP::Response object obtained by
retrieving the URI.
"parse_fh", "parse_html_fh"
$doc = $parser->parse_fh( $io_fh [,\%opts] );
"parse_fh()" parses a IOREF or a subclass of "IO::Handle".
Options include 'encoding' to indicate file encoding (e.g. 'utf-8').
"parse_string", "parse_html_string"
$doc = $parser->parse_string( $html_string [,\%opts] );
This function is similar to "parse_fh()", but it parses an HTML
document that is available as a single string in memory.
Options include 'encoding' to indicate file encoding (e.g. 'utf-8').
"load_xml", "load_html"
Wrappers for the parse_* functions. These should be roughly
compatible with the equivalently named functions in XML::LibXML.
Note that "load_xml" first attempts to parse as real XML, falling
back to HTML5 parsing; "load_html" just goes straight for HTML5.
The push parser and SAX-based parser are not supported. Trying to change
an option (such as recover_silently) will make HTML::HTML5::Parser carp
a warning. (But you can inspect the options.)
Additional Methods
The module provides a few additional methods to obtain additional,
non-DOM data from DOM nodes.
"error_handler"
Get/set an error handling function. Must be set to a coderef or
undef.
The error handling function will be called with a single parameter,
a HTML::HTML5::Parser::Error object.
"errors"
Returns a list of errors that occurred during the last parse.
See HTML::HTML5::Parser::Error.
"compat_mode"
$mode = $parser->compat_mode( $doc );
Returns 'quirks', 'limited quirks' or undef (standards mode).
"dtd_public_id"
$pubid = $parser->dtd_public_id( $doc );
For an XML::LibXML::Document which has been returned by
HTML::HTML5::Parser, using this method will tell you the Public
Identifier of the DTD used (if any).
"dtd_system_id"
$sysid = $parser->dtd_system_id( $doc );
For an XML::LibXML::Document which has been returned by
HTML::HTML5::Parser, using this method will tell you the System
Identifier of the DTD used (if any).
"source_line"
($line, $col) = $parser->source_line( $node );
$line = $parser->source_line( $node );
In scalar context, "source_line" returns the line number of the
source code that started a particular node (element, attribute or
comment).
In list context, returns a line/column pair. (Tab characters count
as one column, not eight.)
SEE ALSO