Textual images
Module 4 Min-Yen KAN
*Portions of this lecture based on Managing Gigabytes textbook

Cost basis for archives

Digitization
Scanning
Binding
Planetary scanner
Resolution of scan
300 dpi for access
600 or higher for archival copy

Digitization
Purpose:
________
Quality
Stability in the
long term
_______
Delivery
Editing
Annotation
Initiate the digitalization project
Establish start-up costs and secure funding
Prepare a detailed project plan include milestones and deliverables
Assess and select materials for digitization
Digitize materials (prepare source materials, digitize, check quality)
Post-process digital materials: edit, OCR, store, catalog and index
Deliver and make materials accessible
Support and maintenance of materials
-- From Chowdhury and Chowdhury (03)

Document capture costs in USD

Images of text
You’ve scanned in an image like this…
What to do with it?
How would we like to store and access this information?

Storing a textual image
Mostly bi-level (two-tone)
CCITT Fax III and IV
Bi-level transmission and storage standard
Optimized for Roman alphabet
Textual image compression
Codebook of marks
A level for access and one for preservation

CCITT Fax IV

Slide 9

CCITT fax group IV

Textual image compression
Find and isolate marks (connected group of black pixels)
Construct library of symbols
Identify the symbol closes to each mark and get coordinates
Store information
*Store additional information to reconstruct original image

Text image outline
Storage √
CCITT Fax Group III and IV √
Textual image compression √
Access
De-skew
Segmentation
Media detection

De-Skew
Projection profile
Accumulate Y-axis pixel histogram
Rotate to find most crisp histogram
One of three common algorithms

Segmentation
Top-down
(e.g., _______)
Bottom-up
(e.g. _________)

Classification
Separate:
Images
Text
Line art
Equations
Tables
One technique:
Slope Histogram (Hough transform)