CHIMEResearchPeopleContact
     
 

Document Analysis Research


Document Restoration using Inpainting and Shape-from-Shading


We present a restoration framework to reduce undesirable distortions in imaged documents. Our framework is based on two components: (1) an image inpainting procedure that can separate non- uniform illumination (and other) artifacts from the printed content; and (2) a Shape-from-Shading (SfS) formulation that can reconstruct the 3D shape of the document¡¯s surface. Used either piecewise or in its entirety, this framework can correct a variety of distortions including shading, shadow, ink-bleed, show- through, perspective and geometric distortions, for both camera-imaged and flatbed-imaged documents. In addition, our SfS formulation can be easily modified to target various illumination conditions to suit different real-world applications.

Document Restoration through Grid Modeling and Regularization


For document images captured by a digital camera, perspective and geometric distortions make it hard to recognize the document content properly. We propose an integrated document restoration technique, which is capable of removing perspective and geometric distortions, and producing a flattened and fronto-parallel text image that is friendly to the generic OCR systems. The proposed document restoration is accomplished through grid modeling, which divides camera images into multiple quadrilateral grids using vertical text directions and the x lines and base lines. The global distortions are then removed through grid regularization that transforms the quadrilateral grids together with the pixel contents to the regular square grids. Experimental results show the proposed method is fast and easy for implementation.


Script and Language Identification in Noisy and Degraded Document Images


We propose an identification technique that differentiates scripts and languages in noisy and degraded document images. The method transforms each document image into an electronic document vector that characterizes the shape and frequency of the contained character or word images. The document vectorization is accomplished using character extremum points, vertical component cuts, and the number of horizontal word cuts, all of which are tolerant to text fonts and styles, noise, and various types of document degradation. For each script or language under study, a script or language template is first constructed through a learning process. The script and language of the query document are then determined according to the distances between the query document vector and the constructed script and language templates.


Enhanced Document Understanding with Scientific Chart Recognition


We have developed a system for recognizing scientific charts contained in document images. This is done by associating the recognition results of textual and graphical information contained in the scientific charts. Text components are first located in the input image and then recognized using OCR. On the other hand, the graphical objects are segmented and form high level symbols. Both logical and semantic correspondence between text and graphical symbols are identified. The association of text and graphics allows us to capture the semantic meaning carried by scientific chart images in a more complete way. The acquired semantics from the charts are then further added into the OCR textual information from the original document. This enhances better understanding the document which could not be done with ordinary OCR engines as they cannot handle graphics images. This is demonstrated with a better Q&A performance of the documents containing scientific charts that can now be understood by the machine.


Ground Truthed Dataset for Scientific Chart Recognition


We have developed ground truthing tools to generate scientific chart images with ground truth to facilitate evaluation of scientific chart recognition research results. We adopt semi-automatic and automatic approaches resulting in two independent subsets with ground truths. The dataset is available for public access. Click here to download the dataset.

Restoration of ink-bled historical handwritten document images


This is a project in collaboration with the National Archives of Singapore, which keeps a large number of historical handwritten documents. Due to ink-bleeding through the paper over long periods of storage, many of these documents are hard to read due to interference by ink strokes from the back side of the page. Various techniques have been explored in our research and reported in the literature. One method is to use a wavelet technique to enhance front page strokes and to weaken the back page strokes by mapping both sides of the images. Another method is to use directions of the strokes as a clue to differentiate front and back texts as most of the writing styles in these documents are heavily slanted. A recent approach is to use a machine learning method with human intervention to teach the machine to identify origins of the strokes on sample images. The method is based on a novel dual-layer Markov Random Field formulation.


 
     
NUS Home SoC Home CHIME Home Welcome Your Feedback