NAME HTML::HTML5::Parser - parse HTML reliably SYNOPSIS use HTML::HTML5::Parser; my $parser = HTML::HTML5::Parser->new; my $doc = $parser->parse_string(<<'EOT'); Foo

Foo bar.

Baz
Quux. EOT my $fdoc = $parser->parse_file( $html_file_name ); my $fhdoc = $parser->parse_fh( $html_file_handle ); DESCRIPTION This library is substantially the same as the non-CPAN module Whatpm::HTML. Changes include: * Provides an XML::LibXML-like DOM interface. If you usually use XML::LibXML's DOM parser, this should be a drop-in solution for tag soup HTML. * Constructs an XML::LibXML::Document as the result of parsing. * Via bundling and modifications, removed external dependencies on non-CPAN packages. Constructor "new" $parser = HTML::HTML5::Parser->new; The constructor does not do anything interesting. XML::LibXML-Compatible Methods "parse_file", "parse_html_file" $doc = $parser->parse_file( $html_file_name [,\%opts] ); This function parses an HTML document from a file or network; $html_file_name can be either a filename or an URL. Options include 'encoding' to indicate file encoding (e.g. 'utf-8') and 'user_agent' which should be a blessed "LWP::UserAgent" object to be used when retrieving URLs. If requesting a URL and the response Content-Type header indicates an XML-based media type (such as XHTML), XML::LibXML::Parser will be used automatically (instead of the tag soup parser). The XML parser can be told to use a DTD catalogue by setting the option 'xml_catalogue' to the filename of the catalogue. HTML (tag soup) parsing can be forced using the option 'force_html', even when an XML media type is returned. If an options hashref was passed, parse_file will set $options->{'parser_used'} to the name of the class used to parse the URL, to allow the calling code to double-check which parser was used afterwards. If an options hashref was passed, parse_file will set $options->{'response'} to the HTTP::Response object obtained by retrieving the URI. "parse_fh", "parse_html_fh" $doc = $parser->parse_fh( $io_fh [,\%opts] ); "parse_fh()" parses a IOREF or a subclass of "IO::Handle". Options include 'encoding' to indicate file encoding (e.g. 'utf-8'). "parse_string", "parse_html_string" $doc = $parser->parse_string( $html_string [,\%opts] ); This function is similar to "parse_fh()", but it parses an HTML document that is available as a single string in memory. Options include 'encoding' to indicate file encoding (e.g. 'utf-8'). The push parser and SAX-based parser are not supported. Trying to change an option (such as recover_silently) will make HTML::HTML5::Parser carp a warning. (But you can inspect the options.) Additional Methods The module provides a few additional methods to obtain additional, non-DOM data from DOM nodes. "compat_mode" $mode = $parser->compat_mode( $doc ); Returns 'quirks', 'limited quirks' or undef (standards mode). "dtd_public_id" $pubid = $parser->dtd_public_id( $doc ); For an XML::LibXML::Document which has been returned by HTML::HTML5::Parser, using this method will tell you the Public Identifier of the DTD used (if any). "dtd_system_id" $sysid = $parser->dtd_system_id( $doc ); For an XML::LibXML::Document which has been returned by HTML::HTML5::Parser, using this method will tell you the System Identifier of the DTD used (if any). "source_line" ($line, $col) = $parser->source_line( $node ); $line = $parser->source_line( $node ); In scalar context, "source_line" returns the line number of the source code that started a particular node (element, attribute or comment). In list context, returns a line/column pair. (Tab characters count as one column, not eight.) SEE ALSO AUTHOR Toby Inkster, COPYRIGHT AND LICENSE Copyright (C) 2007-2010 by Wakaba Copyright (C) 2009-2010 by Toby Inkster This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.1 or, at your option, any later version of Perl 5 you may have available.