NAME HTML::Untemplate - web scraping assistant VERSION version 0.004 DESCRIPTION Suppose you have a set of HTML documents generated by populating the same template with the data from some kind of database. HTML::Untemplate is a set of command-line tools ("xpathify", "untemplate") and modules (HTML::Linear and it's dependencies) which assist in original data retrieval. To achieve this goal, HTML tree nodes are presented as XPath/content pairs. HTML documents linearized this way can be easily inspected manually or with a diff tool. Please refer to "EXAMPLES". Despite being named similarly to HTML::Template, this distribution is not directly related to it. Instead, it attempts to reverse the templating action, whatever the template agent used. Why? Suppose you have a CMS. Typical CMS works roughly as this (data flows bottom-down): RDBMS scripting language HTML HTTP server (...) HTTP agent layout engine screen user Consider the first 3 steps: "RDBMS => scripting language => HTML" This is "applying template". Now, consider this: "HTML => scripting language => RDBMS" I would call that "un-applying template", or "untemplate" ":)" The practical application of this set of tools to assist in creation of web scrappers. EXAMPLES xpathify The xpathify tool flatterns the HTML tree into key/value list:
This is a sample HTMLBeware!
HTML is not XML!Have a nice day. Becomes: *(HTML block)* The keys are in XPath format, while the values are respective content from the HTML tree. Theoretically, it could be possible to reassemble the HTML tree from the flat key/value list this tool generates. untemplate The untemplate tool flatterns a set of HTML documents using the algorithm from xpathify. Then, it strips the shared key/value pairs. The "rest" is composed of original values fed into the template engine. And this is how the result actually looks like with some simple real-world examples (quotes 1839