¡¡

Home
Up
Profile
Project
System
Publication

wEB iNOFMATION ExTRACTION & iNTEGRATION

Motivation Outline System Members Corpus Samples References Schedule

¡¡

With the widespread adoption of WWW by the general public, more and more companies are managing their business and advertising their products and services on the Web. New product types, updated configurations and features, and the latest price lists etc., are frequently shown in (semi-)structured web pages. Given their sheer number, it is difficult and even impossible to explore dynamic Web documents using the manual approach. There is therefore a need to track and re-organize the pages efficiently.

The ability to automatically extract complex information (such as products) from WWW could therefore help to tackle the urgent business problem of collating, comparing and analyzing business information on the Web. Following the intense interests in Semantic Web, research on extracting and integrating information from the web has become more and more important.

Most popular semi-structured documents include product category and listing, member collection, financial statement, travel information, and so on. An example of a typical semi-structured document ¨C a product description page ¨C is shown as the following figure. The dotted box divides the content into two clusters: inside the dotted box is the object region containing product descriptions, while outside contains various irrelevant components such as the advertisement bar, search and filtering panel, and navigator bar, etc. Most applications want to extract only the description of desired product without other details. The required information about individual products (which we call object data or object instances) usually cluster within the contiguous object region in the web page. Typically, most semi-structured documents contain only one object region surrounded by some irrelevant components, and each region lists one or more pieces of object data.

To support various value-added applications, we need to extract and collate such object data from multiple sites. Our study will support the following interesting and important application scenario: automatically extracting structured objects (i.e. hierarchy) from complicated pages that do not require any pre-defined template and labeled training corpus.

In conclusion, our objectives include:

¡¤ Extract key object data, such as products and services from semi-structured web pages

¡¤ Induce the site-specific model from data

¡¤ Integrate the set of analogous models into a common ontology

Layout of a typical semi-structured web document