CHIMEResearchPeopleContact
     
 

Digital Libraries Research


Record Matching in Digital Library Metadata


When data stores grow large, data quality, cleaning and integrity become issues. In digital libraries, these problems manifest urgently in the metadata that describe the library¡¯s holdings. Without metadata cleaning, libraries may end up listing multiple records for the same item, causing circulation problems and skewing the distribution of its holdings. In addition, when different authors share the same name (e.g. Wei Wang, J. Brown), author disambiguation must be performed to correctly link authors to their respective monographs and articles, and not to others. While the above de-duplication and disambiguation problems differ in specifics, a common key operation is to determine whether two data records match. Research in this direction has led to implementation of toolkits for query analysis and user interface, an example of which is a public access catalogue shown below.


Retrieval of Scanned Documents from Digital Libraries


With the need of current fast evolving digital libraries, an increasing amount of documents are being digitized into an electronic format for easy archival and dissemination purposes. Thus Document Image Retrieval (DIR) as part of Information Retrieval (IR) paradigm is receiving attentions among the IR communities in recent years. We have developed methods for retrieving document images based on two kinds of queries: (1) query by input keywords, and (2) by a query document image. The technique behind is based on a scheme of word shape coding that captures features of word images. A number of coding schemes have been proposed and experimented. These word coding schemes perform matching at the word level thus overcoming the problem of connected characters due to poor image quality. With the DIR method, we have developed a web-based retrieval prototype system that retrieves document images online from digital libraries as illustrated below.


Text and Graphics Retrieval in U.S. Patent Document Database


The U.S. patent database, run by the United States Patent and Trademark Office, maintains both patent text and patent images. The Web Patent Full-Text Database (PatFT) contains the full-text of over 3,000,000 patents, while the Web Patent Full-Page Images Database (PatImg) contains over 70,000,000 images. The retrieval requirement is huge in this database. The Web Patent Database now serves over 25,000,000 pages of text (over 150,000,000 hits) per month to over 350,000 customers each month. While providing a convenient means for access, the patent document is hard to read due to its layout which contains five distinctive sections, namely, the abstract, figure, description (text), claim and reference sections. As all figures appear together before the description section, the reader often has to scroll back and forth in cross referencing between texts and figures. Using techniques in document image analysis and graphics recognition, we have developed a user interface for instant visualization of relevant figures corresponding to their respective references in the text description.


 
     
NUS Home SoC Home CHIME Home Welcome Your Feedback