wEB
iNOFMATION
ExTRACTION
& iNTEGRATION


¡¡
With the widespread adoption of WWW by the general public, more and more
companies are managing their business and advertising their products and
services on the Web. New product types, updated configurations and
features, and the latest price lists etc., are frequently shown in
(semi-)structured web pages. Given their sheer number, it is difficult
and even impossible to explore dynamic Web documents using the manual
approach. There is therefore a need to track and re-organize the pages
efficiently.
The ability to automatically extract complex information (such as
products) from WWW could therefore help to tackle the urgent business
problem of collating, comparing and analyzing business information on
the Web. Following the intense interests in Semantic Web, research on
extracting and integrating information from the web has become more and
more important.
Most popular semi-structured documents include product category and
listing, member collection, financial statement, travel information, and
so on. An example of a typical semi-structured document ¨C a product
description page ¨C is shown as the following figure. The dotted box
divides the content into two clusters: inside the dotted box is the
object region containing product descriptions, while outside
contains various irrelevant components such as the advertisement bar,
search and filtering panel, and navigator bar, etc. Most applications
want to extract only the description of desired product without other
details. The required information about individual products (which we
call object data or object instances) usually cluster
within the contiguous object region in the web page. Typically, most
semi-structured documents contain only one object region surrounded by
some irrelevant components, and each region lists one or more pieces of
object data.
To support various value-added applications, we need to extract and
collate such object data from multiple sites. Our study will support the
following interesting and important application scenario: automatically
extracting structured objects (i.e. hierarchy) from complicated pages
that do not require any pre-defined template and labeled training
corpus.
In conclusion, our objectives include:
¡¤ Extract
key object data, such as products and services from semi-structured
web pages
¡¤ Induce
the site-specific model from data
¡¤ Integrate
the set of analogous models into a common ontology

Layout of a typical
semi-structured web document