Notes
Slide Show
Outline
1
Textual images
  • Module 4 Min-Yen KAN
  • *Portions of this lecture based on Managing Gigabytes textbook
2
Cost basis for archives
3
Digitization
  • Scanning
    • Binding
    • Planetary scanner


  • Resolution of scan
    • 300 dpi for access
    • 600 or higher for archival copy


4
Digitization
  • Purpose:
    • ________
      • Quality
      • Stability in the
        long term


    • _______
      • Delivery
      • Editing
      • Annotation


  • Initiate the digitalization project
  • Establish start-up costs and secure funding
  • Prepare a detailed project plan include milestones and deliverables
  • Assess and select materials for digitization
  • Digitize materials (prepare source materials, digitize, check quality)
  • Post-process digital materials: edit, OCR, store, catalog and index
  • Deliver and make materials accessible
  • Support and maintenance of materials


  • -- From Chowdhury and Chowdhury (03)
5
Document capture costs in USD
6
Images of text
  • You’ve scanned in an image like this…


  • What to do with it?


  • How would we like to store and access this information?
7
Storing a textual image
  • Mostly bi-level (two-tone)


  • CCITT Fax III and IV
    • Bi-level transmission and storage standard
    • Optimized for Roman alphabet

  • Textual image compression
    • Codebook of marks
    • A level for access and one for preservation
8
CCITT Fax IV
9
 
10
CCITT fax group IV
11
Textual image compression
  • Find and isolate marks (connected group of black pixels)
  • Construct library of symbols
  • Identify the symbol closes to each mark and get coordinates
  • Store information
  • *Store additional information to reconstruct original image
12
Text image outline
  • Storage √
    • CCITT Fax Group III and IV √
    • Textual image compression √


  • Access
    • De-skew
    • Segmentation
    • Media detection
13
De-Skew
  • Projection profile
    • Accumulate Y-axis pixel histogram
    • Rotate to find most crisp histogram

  • One of three common algorithms
14
Segmentation
  • Top-down
  • (e.g., _______)
  • Bottom-up
  • (e.g. _________)


15
Classification
  • Separate:
    • Images
    • Text
    • Line art
    • Equations
    • Tables

  • One technique:
    • Slope Histogram (Hough transform)