NAME Web::Scraper - Web Scraping Toolkit using HTML and CSS Selectors or XPath expressions SYNOPSIS use URI; use Web::Scraper; use Encode; # First, create your scraper block my $authors = scraper { # Parse all TDs inside 'table[width="100%]"', store them into # an array 'authors'. We embed other scrapers for each TD. process 'table[width="100%"] td', "authors[]" => scraper { # And, in each TD, # get the URI of "a" element process "a", uri => '@href'; # get text inside "small" element process "small", fullname => 'TEXT'; }; }; my $res = $authors->scrape( URI->new("http://search.cpan.org/author/?A") ); # iterate the array 'authors' for my $author (@{$res->{authors}}) { # output is like: # Andy Adler http://search.cpan.org/~aadler/ # Aaron K Dancygier http://search.cpan.org/~aakd/ # Aamer Akhter http://search.cpan.org/~aakhter/ print Encode::encode("utf8", "$author->{fullname}\t$author->{uri}\n"); } The structure would resemble this (visually) { authors => [ { fullname => $fullname, link => $uri }, { fullname => $fullname, link => $uri }, ] } DESCRIPTION Web::Scraper is a web scraper toolkit, inspired by Ruby's equivalent Scrapi. It provides a DSL-ish interface for traversing HTML documents and returning a neatly arranged Perl data structure. The *scraper* and *process* blocks provide a method to define what segments of a document to extract. It understands HTML and CSS Selectors as well as XPath expressions. METHODS scraper $scraper = scraper { ... }; Creates a new Web::Scraper object by wrapping the DSL code that will be fired when *scrape* method is called. scrape $res = $scraper->scrape(URI->new($uri)); $res = $scraper->scrape($html_content); $res = $scraper->scrape(\$html_content); $res = $scraper->scrape($http_response); $res = $scraper->scrape($html_element); Retrieves the HTML from URI, HTTP::Response, HTML::Tree or text strings and creates a DOM object, then fires the callback scraper code to retrieve the data structure. If you pass URI or HTTP::Response object, Web::Scraper will automatically guesses the encoding of the content by looking at Content-Type headers and META tags. Otherwise you need to decode the HTML to Unicode before passing it to *scrape* method. You can optionally pass the base URL when you pass the HTML content as a string instead of URI or HTTP::Response. $res = $scraper->scrape($html_content, "http://example.com/foo"); This way Web::Scraper can resolve the relative links found in the document. process scraper { process "tag.class", key => 'TEXT'; process '//tag[contains(@foo, "bar")]', key2 => '@attr'; process '//comment()', 'comments[]' => 'TEXT'; }; *process* is the method to find matching elements from HTML with CSS selector or XPath expression, then extract text or attributes into the result stash. If the first argument begins with "//" or "id(" it's treated as an XPath expression and otherwise CSS selector. # 2008/12/21 # date => "2008/12/21" process ".date", date => 'TEXT'; #
foo
# link => URI->new("http://example.com/") process ".body > a", link => '@href'; #
foo
# comment => " HTML Comment here " # # NOTES: A comment nodes are accessed when installed # the HTML::TreeBuilder::XPath (version >= 0.14) and/or # the HTML::TreeBuilder::LibXML (version >= 0.13) process "//div[contains(@class, 'body')]/comment()", comment => 'TEXT'; #
foo
# link => URI->new("http://example.com/"), text => "foo" process ".body > a", link => '@href', text => 'TEXT'; # # list => [ "foo", "bar" ] process "li", "list[]" => "TEXT"; # # list => [ { id => "1", text => "foo" }, { id => "2", text => "bar" } ]; process "li", "list[]" => { id => '@id', text => "TEXT" }; process_first "process_first" is the same as "process" but stops when the first matching result is found. # 2008/12/21 # 2008/12/22 # date => "2008/12/21" process_first ".date", date => 'TEXT'; result "result" allows to return not the default value after processing but a single value specified by a key or a hash reference built from several keys. process 'a', 'want[]' => 'TEXT'; result 'want'; EXAMPLES There are many examples in the "eg/" dir packaged in this distribution. It is recommended to look through these. NESTED SCRAPERS Scrapers can be nested thus allowing to scrape already captured data. # # friends => [ {href => 'foo1'}, {href => 'foo2'} ]; process 'li', 'friends[]' => scraper { process 'a', href => '@href', }; FILTERS Filters are applied to the result after processing. They can be declared as anonymous subroutines or as class names. process $exp, $key => [ 'TEXT', sub { s/foo/bar/ } ]; process $exp, $key => [ 'TEXT', 'Something' ]; process $exp, $key => [ 'TEXT', '+MyApp::Filter::Foo' ]; Filters can be stacked process $exp, $key => [ '@href', 'Foo', '+MyApp::Filter::Bar', \&baz ]; More about filters you can find in Web::Scraper::Filter documentation. XML backends By default HTML::TreeBuilder::XPath is used, this can be replaces by a XML::LibXML backend using Web::Scraper::LibXML module. use Web::Scraper::LibXML; # same as Web::Scraper my $scraper = scraper { ... }; AUTHOR Tatsuhiko Miyagawa LICENSE This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself. SEE ALSO HTML::TreeBuilder::XPath