Word Searching in Imaged Documents of PDF Files

 Background

In the past decades, millions of digital documents have been generated. The most widespread format for these digital documents is text in which the characters of the documents are represented by machine-readable codes. On the other hand, modern technology has made it possible to produce, process, store, and transmit document images(i.e. imaged documents) efficiently. As we look through the documents stored in digital  libraries or the Internet, large quantities of them are simply scanned and archived in image form.

PDF has become a popular file format for both text documents and imaged documents, especially in the web environments. Searching for a specified word that is interesting for users in these PDF documents has its  practical value. For this purpose, Adobe Acrobat provides a "search" tool for finding user specified words in the text documents. However, it does NOT work for the imaged documents.


 Plug-in Tool

We have developed a tool using Acrobat SDK, based on document image analysis technique, for searching words in PDF files that contain imaged documents. When a PDF file is opened in Adobe Acrobat, the plug-in tool is able to detect and locate the user's specified words in the imaged documents, like the tool "Search"  provided by Adobe Acrobat for word search in text format documents. Our tool can work on the PDF files
opened by Acrobat from local PCs or from websites.


 How to Use it

(1) Create a subdirectory "AcrobatSDK" under C:\Program Files\Adobe\Acrobat X.X\Acrobat\Plug_ins\
       where X.X=5.0 if your Acrobat version 5.0 or X.X=6.0 if your Acrobat version 6.0
(2) Download the plug-in from here (click right button of your mouse to save)
(3) Copy the downloaded file NUSFind.api to
       C:\Program Files\Adobe\Acrobat X.X\Acrobat\Plug_ins\AcrobatSDK
(4) A new icon "NUS" will appear in your Acrobat


 Announcement

All the copyrights related to the plug-in tool are reserved. Should you have any suggestion or find any problem, please contact us:

       Prof. TAN Chew Lim
              Email: tancl@comp.nus.edu.sg
              Tel: (65) 6874 2900
       Miss  Zhang Li
              Email: zhangli@comp.nus.edu.sg
              Tel: (65) 6874 2784


 Related Papers

[1] Lu Y, Tan C L. A Nearest-neighbor-chain Based Approach to Skew Estimation in Document Images. Pattern Recognition Letters, 2003, 24(14): 2315-2323.
[2] Lu Y, Tan C L. Document Retrieval From Compressed Images. Pattern Recognition, 2003, 36(4): 987-996.
[3] Lu Y, Tan C L. Improved Nearest Neighbor Based Approach to Accurate Document Skew Estimation. The 7th International Conference on Document Analysis and Recognition (ICDAR'03), pp.503-507, Edinburgh, August 3-6, 2003.
[4] Lu Y, Tan C L, Lin L. An Approach to Matching Partial Word Image and Its Application to Document Image Retrieval. Proc. of SPIE Vol. 4929 Optical Information Processing Technology, pp.379-387, Shanghai, China, October 14-18, 2002.
[5] Lu Y, Tan C L. Word Searching in Document Images using Word Portion Matching. Fifth IAPR International Workshop on Document Analysis Systems, Princeton, New Jersey, USA, August 19-21, 2002. D. Lopresti, J. Hu, and R. Kashi (Eds.) Lecture Notes in Computer Science, Vol. 2423, pp.319-328, Springer-Verlag.
[6] Lu Y, Tan C L, Huang W, Fan L. An Approach to Word Image Matching Based on Weighted Hausdorff Distance. Proc. of the 6th International Conf. on Document Analysis and Recognition, 2001, Seattle, USA, pp.921-925.

[7] Zhang L, Lu Y, Tan C L. A Web-based System for Retrieving Document Images from Digital Library. International Workshop on Document Image Analysis and Retrieval (in conjunction with IEEE CVPR 2003), Wisconsin, 2003.  

[8] Lu Y, Zhang L, Tan C L. Retrieving Imaged Documents in Digital Libraries Based on Word Image Coding. International Workshop on Document Image Analysis for Libraries, CA, USA, 2004. 
[9] Lu Y, Zhang L, Tan C L. A Search Engine for Imaged Documents in PDF Files. 27th Annual International ACM SIGIR Conference, Sheffield, UK, 2004.
[10] Zhang L, Lu Y, Tan C L. Italic Font Recognition Using Stroke Pattern Analysis on Wavelet Decomposed Word Images. International Conference of Pattern Recognition, Cambridge, UK, 2004.