Representation and digitization of multimedia*

Week 2 Min-Yen KAN

*Heavily scaled down from
original lecture outline :-(

Distribution of media types in the library

LoC NUS U Toronto

Library Type Gov’t Acad Acad

Books and manuscripts 19 M 2.2M 9.1 M

Maps 4 M 278 K

Photographs 12 M 22.1 K 622 K

Music 2.7M 186 K

Motion pictures .9 M 21 K

CD-ROM Databases 1.4K 2.1 K

Question: is the distribution of what we’d like in the digital library the same as in the automated library?

Outline

Representation / Digitization

Textual images

Images

Audio

Coordinated multimedia

Textual images

Cost basis for archives

Digitization

Scanning

Binding

Planetary scanner

Resolution of scan

300 dpi* for access

600 or higher for archival copy

Digitization

Purpose:

Archival

Quality

Stability in the
long term

Accessibility

Delivery

Editing

Annotation

Initiate the digitalization project

Establish start-up costs and secure funding

Prepare a detailed project plan include milestones and deliverables

Assess and select materials for digitization

Digitize materials (prepare source materials, digitize, check quality)

Post-process digital materials: edit, OCR, store, catalog and index

Deliver and make materials accessible

Support and maintenance of materials

-- From Chowdhury and Chowdhury (03)

Document capture costs in USD (ca. 1999)

Images of text

You’ve scanned in an image like this…

What to do with it?

How would we like to store and access this information?

Storing a textual image

Mostly bi-level (two-tone) until recently

CCITT Fax III and IV

Bi-level transmission and storage standard

Optimized for Roman alphabet

Textual image compression

Codebook of marks

A level for access and one for preservation

CCITT Fax IV

Slide 13

CCITT fax group IV

Textual image compression

Find and isolate marks (connected group of black pixels)

Construct library of symbols

Identify the symbol closes to each mark and get coordinates

Store information

*Store additional information to reconstruct original image

Library

Residue

Text image outline

Storage √

CCITT Fax Group III and IV √

Textual image compression √

Access

De-skew

Segmentation

Media detection

De-Skew

Projection profile

Accumulate Y-axis pixel histogram

Rotate to find most crisp histogram

One of three common algorithms

Segmentation

Top-down

(e.g., X-Y cut)

Bottom-up

(e.g. smearing)

Classification

Separate:

Images

Text

Line art

Equations

Tables

One technique:

Slope Histogram (Hough transform)

Hough Transform

A line-to-point transform

In practice, used to find lines in an image (e.g., set of pixels on a line)

Hough Transform

Create virtual lines for each point

Accumulate counts for bin in Hough space

Robust Document Understanding

OCR and document understanding are (currently) fragile technologies

Full scan Þ OCR Þ store pipeline makes many assumptions

What are some?

________________

________________

________________

________________

________________

A solution (one of many)

Courtesy Henry Baird’s ICDAR 03 slides.

http://www.cse.lehigh.edu/~baird/Talks/icdar03.ppt#21

Image data

Raster graphics

As an array of pixels

Vector graphics

As a collection of vectors

Which format appropriate for which images?

Maps

Photographs

Line art

For which use?

Fidelity?

Re-scaling?

Compression?

GIF / PNG

GIF (‘jiff’, Graphics Interchange Format)

Stable, lossless color format

Compression achieved by:

8-bit format (256 colors)

LZW encoding (Unisys patent)

__________________________________.

Interlacing options for low-bandwidth accessibility

PNG (‘ping’, Portable Network Graphics)

Uses ____________________________

Up to 48 bits of color (compared to 8 in GIF)

Support for alpha channels (transparency) and gamma correction (white balancing)

Joint Photography Experts Group

Breaks image into 8×8 pixel blocks, each pixel 24 bits (YUV channels = 3×8 bits each)

Compresses each block separately, __________________

JPEG, continued

Transform yields coefficients

Ordered from low frequency (gradual change) to high frequency

Gradual changes well represented

Good for scenery, natural images

JPEG 2000 incorporates wavelet compression

Better for sharp edges

Postscript

A programming language whose operators draw graphics on the page.

Text is a deemed a type of graphic

To “draw” a page, you construct a paths used to create the image.

A stack based, usually interpreted language

Uses reverse polish notation

A simple Postscript example

A method to place some text down the left margin of the a page.

You can use this after the marker for the beginning of a page.

gsave % save graphics state on stack

90 rotate % rotate 90 degrees

100 .55 -72 mul moveto % go to coords 100, (.55*-72)

/Times-Roman findfont % Get the font (set of operators) Times-Roman

10 scalefont % set the font size

setfont % Use the specified font

0.3 setgray % Change the color to gray

(PUT NOTE HERE) show % call the individual operators P,U,T …

% to draw letters

grestore % restore the graphics state

Portable Document Format

An object database

Subset of Postscript, makes it faster to process

Can use several different compression techniques (e.g., LZW and Huffman)

Proprietary

Has capabilities for hyperlinks

Geospatial Datasets

Which image format is best for maps?

Hmm, let’s think about it. What goes into a map?

___________________,
which provides the position and shapes of specific geographic features.

____________________,
which provides additional non-graphic information about each feature.

_________________,
which describes how the features will appear on the screen.

Audio

Limit representation to what people can hear

Humans: ~ ____________ KHz

Highest frequency (pitch) determines storage size.

Speech: limited range: up to 3 KHz

Music: full dynamic range, 20 KHz

Can be referred to as its bandwidth

Sampling

Take continuous signal and discretize

Higher sampling rate = better fidelity

Nyquist and Shannon show minimum sampling rate = 2 × bandwidth

Music: full dynamic range: ~ 22K × 2 = 44K

Speech: 4K × 2 = 8 K

Amplitude and Channels

Sampling at these time intervals to get amplitude of signal

a total of ~30-60 dB in loudness

Human ear more sensitive to soft sounds

Compand amplitude (________________________________________________________)

1 or 2 bytes

For each time interval, may have to sample one or more channels

Differential coding (joint stereo)

Dolby AC 3 = ____ channels

Stereo = 2 channels

Storage Requirements (bitrate)

Digital Music:

44 K samples/sec × 16 bits/sample ×
2 channels = ~1.4 M bits/sec

Digital Voice:

8 K samples/sec × 8 bits/sample ×
1 channel = ~64 K bits/sec

Analog

FM stereo: 40 K samples/sec × 8 bits/sample ×
3 channels = ~900 K bits/sec

Telephony: ~6 K samples/sec × 2 bits/sample ×
1 channel = ~12 K bits/sec

Formats

AAC: ______________________

MP3: ______________________

GSM: ______________________

Putting media together

Have multimedia, will travel…

XML

A basis for many other technologies

No semantics (eXtensible, not rigid), just allows for hierarchical containment

A meta markup language

XML, continued

Features:

Separation of content from presentation

Content: Document Type Definition (DTD), optional

Presentation: _____________________,

__________________

Enhanced hyperlinking capabilities

Bidirectional linking

Finer grained linking (XPointer)

Text Encoding
Initiative

To encode knowledge “of literary
and linguistic texts for online
research and teaching”

better interchange and integration of scholarly data

support for all texts, in all languages, from all periods

guidance for the perplexed: ___ to encode --- hence, a user-driven codification of existing best practice

assistance for the specialist: ___ to encode --- hence, a loose framework into which unpredictable extensions can be fitted

The “beef” in XML. All the semantics and none of the filling. It’s quite filling, weighing in at 600 K words! (Think 8 kg of books)

Synchronized Multimedia
Integration Language :-)

A script for orchestrating a presentation

Think TV news

Basics:

Define a root window

Layers

Timing

<par> parallel playback

<seq> sequential playback

Media clips have begin and end attributes

To think about: what’s the alternative format to SMIL? How does it enhance presentation?

Summary

Representation of knowledge

The more you know about the media, the faster, smaller you can transmit and store it

Different formats for different purposes, difference isn’t superficial

Multimedia representation

Trend toward accessibility, not compressibility

Separation of compression from format

References

More on SMIL: http://www.bu.edu/webcentral/learning/smil1/

SMIL demos: http://www.ludicrum.org/demos/SMILTimingForTheWeb-Demos.html

http://www.geocomm.com/ and http://www.usgs.gov are good spots for GIS information.

Genomic DL indexing and retrieval: http://goanna.cs.rmit.edu.au/~jz/fulltext/ieeekade02.pdf

JPEG: Pennebaker and Mitchell (93), The JPEG Still Image Data Compression Standard

TEI Pizza talk:
http://www.tei-c.org/Talks/


	Week 2 Min-Yen KAN

	*Heavily scaled down from original lecture outline :-(


	LoC NUS U Toronto
	Library Type Gov’t Acad Acad
	Books and manuscripts 19 M 2.2M 9.1 M
	Maps 4 M 278 K
	Photographs 12 M 22.1 K 622 K
	Music 2.7M 186 K
	Motion pictures .9 M 21 K
	CD-ROM Databases 1.4K 2.1 K

	Question: is the distribution of what we’d like in the digital library the same as in the automated library?


	Representation / Digitization
	Textual images
	Images
	Audio
	Coordinated multimedia


	Scanning
		Binding
		Planetary scanner

	Resolution of scan
		300 dpi* for access
		600 or higher for archival copy


Purpose:
	Archival
		Quality
		Stability in the long term
	Accessibility
		Delivery
		Editing
		Annotation

Initiate the digitalization project
Establish start-up costs and secure funding
Prepare a detailed project plan include milestones and deliverables
Assess and select materials for digitization
Digitize materials (prepare source materials, digitize, check quality)
Post-process digital materials: edit, OCR, store, catalog and index
Deliver and make materials accessible
Support and maintenance of materials

-- From Chowdhury and Chowdhury (03)


	You’ve scanned in an image like this…

	What to do with it?

	How would we like to store and access this information?


	Mostly bi-level (two-tone) until recently

	CCITT Fax III and IV
		Bi-level transmission and storage standard
		Optimized for Roman alphabet

	Textual image compression
		Codebook of marks
		A level for access and one for preservation


	Find and isolate marks (connected group of black pixels)
	Construct library of symbols
	Identify the symbol closes to each mark and get coordinates
	Store information
	*Store additional information to reconstruct original image


	Storage √
		CCITT Fax Group III and IV √
		Textual image compression √

	Access
		De-skew
		Segmentation
		Media detection


	Projection profile
		Accumulate Y-axis pixel histogram
		Rotate to find most crisp histogram

	One of three common algorithms


	Separate:
		Images
		Text
		Line art
		Equations
		Tables

	One technique:
		Slope Histogram (Hough transform)


	A line-to-point transform
	In practice, used to find lines in an image (e.g., set of pixels on a line)


	Create virtual lines for each point
	Accumulate counts for bin in Hough space


OCR and document understanding are (currently) fragile technologies
	Full scan Þ OCR Þ store pipeline makes many assumptions
	What are some?
		________________
		________________
		________________
		________________
		________________


	Courtesy Henry Baird’s ICDAR 03 slides.

		http://www.cse.lehigh.edu/~baird/Talks/icdar03.ppt#21


	Raster graphics
		As an array of pixels

	Vector graphics
		As a collection of vectors

	Which format appropriate for which images?
		Maps
		Photographs
		Line art
	For which use?
		Fidelity?
		Re-scaling?
		Compression?


GIF (‘jiff’, Graphics Interchange Format)
	Stable, lossless color format
	Compression achieved by:
		8-bit format (256 colors)
		LZW encoding (Unisys patent)
	__________________________________.
	Interlacing options for low-bandwidth accessibility

PNG (‘ping’, Portable Network Graphics)
	Uses ____________________________
	Up to 48 bits of color (compared to 8 in GIF)
	Support for alpha channels (transparency) and gamma correction (white balancing)


	Breaks image into 8×8 pixel blocks, each pixel 24 bits (YUV channels = 3×8 bits each)
	Compresses each block separately, __________________


	Transform yields coefficients
	Ordered from low frequency (gradual change) to high frequency

	Gradual changes well represented
		Good for scenery, natural images

	JPEG 2000 incorporates wavelet compression
		Better for sharp edges


	A programming language whose operators draw graphics on the page.
		Text is a deemed a type of graphic
		To “draw” a page, you construct a paths used to create the image.
	A stack based, usually interpreted language
	Uses reverse polish notation


	A method to place some text down the left margin of the a page.

	You can use this after the marker for the beginning of a page.

	gsave % save graphics state on stack
	90 rotate % rotate 90 degrees
	100 .55 -72 mul moveto % go to coords 100, (.55*-72)
	/Times-Roman findfont % Get the font (set of operators) Times-Roman
	10 scalefont % set the font size
	setfont % Use the specified font
	0.3 setgray % Change the color to gray
	(PUT NOTE HERE) show % call the individual operators P,U,T …
	% to draw letters
	grestore % restore the graphics state


	An object database

		Subset of Postscript, makes it faster to process
		Can use several different compression techniques (e.g., LZW and Huffman)
		Proprietary
		Has capabilities for hyperlinks


	Which image format is best for maps?
	Hmm, let’s think about it. What goes into a map?

	___________________, which provides the position and shapes of specific geographic features.
	____________________, which provides additional non-graphic information about each feature.
	_________________, which describes how the features will appear on the screen.


	Limit representation to what people can hear
		Humans: ~ ____________ KHz

	Highest frequency (pitch) determines storage size.
		Speech: limited range: up to 3 KHz
		Music: full dynamic range, 20 KHz
		Can be referred to as its bandwidth


	Take continuous signal and discretize
	Higher sampling rate = better fidelity



	Nyquist and Shannon show minimum sampling rate = 2 × bandwidth
		Music: full dynamic range: ~ 22K × 2 = 44K
		Speech: 4K × 2 = 8 K


	Sampling at these time intervals to get amplitude of signal
		a total of ~30-60 dB in loudness
		Human ear more sensitive to soft sounds
		Compand amplitude (________________________________________________________)
		1 or 2 bytes

	For each time interval, may have to sample one or more channels
		Differential coding (joint stereo)
		Dolby AC 3 = ____ channels
		Stereo = 2 channels


	Digital Music:
		44 K samples/sec × 16 bits/sample × 2 channels = ~1.4 M bits/sec
	Digital Voice:
		8 K samples/sec × 8 bits/sample × 1 channel = ~64 K bits/sec

	Analog
		FM stereo: 40 K samples/sec × 8 bits/sample × 3 channels = ~900 K bits/sec
		Telephony: ~6 K samples/sec × 2 bits/sample × 1 channel = ~12 K bits/sec

	Formats
		AAC: ______________________
		MP3: ______________________
		GSM: ______________________


	A basis for many other technologies
	No semantics (eXtensible, not rigid), just allows for hierarchical containment
	A meta markup language


Features:
	Separation of content from presentation
		Content: Document Type Definition (DTD), optional
		Presentation: _____________________,
		__________________

	Enhanced hyperlinking capabilities
		Bidirectional linking
		Finer grained linking (XPointer)


	To encode knowledge “of literary and linguistic texts for online research and teaching”

	better interchange and integration of scholarly data
	support for all texts, in all languages, from all periods
	guidance for the perplexed: ___ to encode --- hence, a user-driven codification of existing best practice
	assistance for the specialist: ___ to encode --- hence, a loose framework into which unpredictable extensions can be fitted

	The “beef” in XML. All the semantics and none of the filling. It’s quite filling, weighing in at 600 K words! (Think 8 kg of books)