For the completion of my Master of Science in Artificial Intelligence I wrote the thesis 'An analysis of the tree-edit-distance for wrapperinduction of HTML-trees' (
pdf). The thesis is written in Dutch but an English paper is forthcoming. The source code will be released when the
issuescraper is out of beta.
Abstract:
This paper discusses how the tree-edit-distance may be used for the problem of wrapper induction. The tree-edit-distance is used to find a mapping with minimal cost between the tree representation of HTML pages. With this mapping a template may be constructed in which only the elements common to the trees are kept. The parts specific to each tree are represented as wildcards. The template will thus be the most specific generalization of the pages and may be used for recognizing other pages of the same semantic type. By using the template for extraction on similar pages the instance specific information may be retrieved.
This paper shows that the domain of automatically generated HTML pages contains a number of characteristics with which the tree-edit-distance may be approached and calculated faster. A number of post processing steps are considered to make the templates more condensed en protect them from overfit. It is found that pruning always gives good results.