Representation and digitization of multimedia*
Week 2  Min-Yen KAN
*Heavily scaled down from
original lecture outline :-(

Media types in the DL

Distribution of media types in the library
LoC NUS U Toronto
Library Type Gov’t Acad Acad
Books and manuscripts 19 M 2.2M 9.1 M
Maps 4 M 278 K
Photographs 12 M 22.1 K 622 K
Music 2.7M 186 K
Motion pictures .9 M 21 K
CD-ROM Databases 1.4K 2.1 K
Question: is the distribution of what we’d like in the digital library the same as in the automated library?

Outline
Representation / Digitization
Textual images
Images
Audio
Coordinated multimedia

Textual images

Cost basis for archives

Digitization
Scanning
Binding
Planetary scanner
Resolution of scan
300 dpi* for access
600 or higher for archival copy

Digitization
Purpose:
Archival
Quality
Stability in the
long term
Accessibility
Delivery
Editing
Annotation
Initiate the digitalization project
Establish start-up costs and secure funding
Prepare a detailed project plan include milestones and deliverables
Assess and select materials for digitization
Digitize materials (prepare source materials, digitize, check quality)
Post-process digital materials: edit, OCR, store, catalog and index
Deliver and make materials accessible
Support and maintenance of materials
-- From Chowdhury and Chowdhury (03)

Document capture costs in USD (ca. 1999)

Images of text
You’ve scanned in an image like this…
What to do with it?
How would we like to store and access this information?

Storing a textual image
Mostly bi-level (two-tone) until recently
CCITT Fax III and IV
Bi-level transmission and storage standard
Optimized for Roman alphabet
Textual image compression
Codebook of marks
A level for access and one for preservation

CCITT Fax IV

Slide 13

CCITT fax group IV

Textual image compression
Find and isolate marks (connected group of black pixels)
Construct library of symbols
Identify the symbol closes to each mark and get coordinates
Store information
*Store additional information to reconstruct original image

Library

Residue

Text image outline
Storage √
CCITT Fax Group III and IV √
Textual image compression √
Access
De-skew
Segmentation
Media detection

De-Skew
Projection profile
Accumulate Y-axis pixel histogram
Rotate to find most crisp histogram
One of three common algorithms

Segmentation
Top-down
(e.g., X-Y cut)
Bottom-up
(e.g. smearing)

Classification
Separate:
Images
Text
Line art
Equations
Tables
One technique:
Slope Histogram (Hough transform)

Hough Transform
A line-to-point transform
In practice, used to find lines in an image (e.g., set of pixels on a line)

Hough Transform
Create virtual lines for each point
Accumulate counts for bin in Hough space

Robust Document Understanding
OCR and document understanding are (currently) fragile technologies
Full scan Þ OCR Þ store pipeline makes many assumptions
What are some?
________________
________________
________________
________________
________________

A solution (one of many)
Courtesy Henry Baird’s ICDAR 03 slides.
http://www.cse.lehigh.edu/~baird/Talks/icdar03.ppt#21

Image data
Raster graphics
As an array of pixels
Vector graphics
As a collection of vectors
Which format appropriate for which images?
Maps
Photographs
Line art
For which use?
Fidelity?
Re-scaling?
Compression?

GIF / PNG
GIF (‘jiff’, Graphics Interchange Format)
Stable, lossless color format
Compression achieved by:
8-bit format (256 colors)
LZW encoding (Unisys patent)
__________________________________.
Interlacing options for low-bandwidth accessibility
PNG (‘ping’, Portable Network Graphics)
Uses ____________________________
Up to 48 bits of color (compared to 8 in GIF)
Support for alpha channels (transparency) and gamma correction (white balancing)

Joint Photography Experts Group
Breaks image into 8×8 pixel blocks, each pixel 24 bits (YUV channels = 3×8 bits each)
Compresses each block separately, __________________

JPEG, continued
Transform yields coefficients
Ordered from low frequency (gradual change) to high frequency
Gradual changes well represented
Good for scenery, natural images
JPEG 2000 incorporates wavelet compression
  Better for sharp edges

Postscript
A programming language whose operators draw graphics on the page.
Text is a deemed a type of graphic
To “draw” a page, you construct a paths used to create the image.
A stack based, usually interpreted language
Uses reverse polish notation

A simple Postscript example
A method to place some text down the left margin of the a page.
You can use this after the marker for the beginning of a page.
gsave % save graphics state on stack
90 rotate % rotate 90 degrees
100 .55 -72 mul moveto % go to coords 100, (.55*-72)
/Times-Roman findfont % Get the font (set of operators) Times-Roman
10 scalefont % set the font size
setfont % Use the specified font
0.3 setgray % Change the color to gray
(PUT NOTE HERE) show % call the individual operators P,U,T …
% to draw letters
grestore % restore the graphics state

Portable Document Format
An object database
Subset of Postscript, makes it faster to process
Can use several different compression techniques (e.g., LZW and Huffman)
Proprietary
Has capabilities for hyperlinks

Geospatial Datasets
Which image format is best for maps?
Hmm, let’s think about it.  What goes into a map?
___________________,
which provides the position and shapes of specific geographic features.
____________________,
which provides additional non-graphic information about each feature.
_________________,
which describes how the features will appear on the screen.

Audio
Limit representation to what people can hear
Humans: ~ ____________ KHz
Highest frequency (pitch) determines storage size.
Speech: limited range: up to 3 KHz
Music: full dynamic range, 20 KHz
Can be referred to as its bandwidth

Sampling
Take continuous signal and discretize
Higher sampling rate = better fidelity
Nyquist and Shannon show minimum sampling rate = 2 × bandwidth
Music: full dynamic range: ~ 22K × 2 = 44K
Speech: 4K × 2 = 8 K

Amplitude and Channels
Sampling at these time intervals to get amplitude of signal
a total of ~30-60 dB in loudness
Human ear more sensitive to soft sounds
Compand amplitude (________________________________________________________)
1 or 2 bytes
For each time interval, may have to sample one or more channels
Differential coding (joint stereo)
Dolby AC 3 = ____ channels
Stereo = 2 channels

Storage Requirements (bitrate)
Digital Music:
44 K samples/sec × 16 bits/sample ×
2 channels = ~1.4 M bits/sec
Digital Voice:
8 K samples/sec × 8 bits/sample ×
1 channel = ~64 K bits/sec
Analog
FM stereo: 40 K samples/sec × 8 bits/sample ×
3 channels = ~900 K bits/sec
Telephony: ~6 K samples/sec × 2 bits/sample  × 
1 channel = ~12 K bits/sec
Formats
AAC: ______________________
MP3: ______________________
GSM: ______________________

Putting media together
Have multimedia, will travel…

XML
A basis for many other technologies
No semantics (eXtensible, not rigid), just allows for hierarchical containment
A meta markup language

XML, continued
Features:
Separation of content from presentation
Content: Document Type Definition (DTD), optional
Presentation: _____________________,
__________________
Enhanced hyperlinking capabilities
Bidirectional linking
Finer grained linking (XPointer)

Text Encoding
Initiative
To encode knowledge “of literary
 and linguistic texts for online
research and teaching”
better interchange and integration of scholarly data
support for all texts, in all languages, from all periods
guidance for the perplexed: ___ to encode --- hence, a user-driven codification of existing best practice
assistance for the specialist: ___ to encode --- hence, a loose framework into which unpredictable extensions can be fitted
The “beef” in XML.  All the semantics and none of the filling.  It’s quite filling, weighing in at 600 K words! (Think 8 kg of books)

Synchronized Multimedia
Integration Language :-)
A script for orchestrating a presentation
Think TV news
Basics:
Define a root window
Layers
Timing
<par> parallel playback
<seq> sequential playback
Media clips have begin and end attributes
To think about: what’s the alternative format to SMIL?  How does it enhance presentation?

Summary
Representation of knowledge
The more you know about the media, the faster, smaller you can transmit and store it
Different formats for different purposes, difference isn’t superficial
Multimedia representation
Trend toward accessibility, not compressibility
Separation of compression from format

References
More on SMIL: http://www.bu.edu/webcentral/learning/smil1/
SMIL demos: http://www.ludicrum.org/demos/SMILTimingForTheWeb-Demos.html
http://www.geocomm.com/ and http://www.usgs.gov are good spots for GIS information.
Genomic DL indexing and retrieval: http://goanna.cs.rmit.edu.au/~jz/fulltext/ieeekade02.pdf
JPEG: Pennebaker and Mitchell (93), The JPEG Still Image Data Compression Standard
TEI Pizza talk:
http://www.tei-c.org/Talks/