For the completion of my Master of Science in Artificial Intelligence I wrote the thesis 'An analysis of the tree-edit-distance for wrapperinduction of HTML-trees' (pdf). The thesis is written in Dutch but an English paper is forthcoming. The source code will be released when the issuescraper is out of beta.

Abstract:

This paper discusses how the tree-edit-distance may be used for the problem of wrapper induction. The tree-edit-distance is used to find a mapping with minimal cost between the tree representation of HTML pages. With this mapping a template may be constructed in which only the elements common to the trees are kept. The parts specific to each tree are represented as wildcards. The template will thus be the most specific generalization of the pages and may be used for recognizing other pages of the same semantic type. By using the template for extraction on similar pages the instance specific information may be retrieved.

This paper shows that the domain of automatically generated HTML pages contains a number of characteristics with which the tree-edit-distance may be approached and calculated faster. A number of post processing steps are considered to make the templates more condensed en protect them from overfit. It is found that pruning always gives good results.

Edit | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r3 < r2 < r1 | More topic actions
 
just

Home
laugh

Blog
out

Twiki
loud

Blogroll
.net

About
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback