Unsupervised Approach For Semi-Structured Data Record Extraction From Multiple Pages Using Tag Tree Similarities
In this paper we present a novel unsupervised approach for data records extraction from multiple similar web
pages using tag tree similarities. Extracting the data records from multiple web pages consist of following sequences. We
first identify the related web pages from the web source. Next we construct the DOM tree for related web pages using html
parser. We then compare two or more web pages to eliminate unwanted regions such as header, menu bar, navigation bar,
advertisements, etc and find the region containing data records also referred to as data region. We then traverse sub trees of
data region to extract individual data record and store them in required form such as XML. The main contribution of this
paper is in developing a fully unsupervised algorithm for extracting both structured as well as semi-structured data records
from multiple related web pages. Our proposed system can extract valuable data records from many commercial web sources
more precisely. Hence it can serve as a tool for integrating information from various commercial websites. This integrated
information can then be used for providing various value added services such as comparative shopping, market intelligence,
meta-querying and search.
Keywords - Data Record Detection, Information Extraction, Semi-Structured data, Wrapper Generation.