Audio in Digital Library
Huang Wendong
Wang Gang

Motivation
Availability of text data, music data and speech data in the digital library.
How to conveniently satisfy the need of user information seeking in digital library, especially for the music data.

Introduction of problem
Two convenient query methods in digital library:
   music retrieval by humming and natural language speech query
Audio processing in digital library usually
   involves three levels:

Definition of Problem
For level 1~level 2, suppose a sequence of meaningful units can be perceived by  Human Beings from the raw audio data as:
                      {                 }
   The problem is how to detect the same  sequence  from the raw data automatically.
For level 2~level 3, suppose the sequence obtained from low level signal processing is:{                 }
    The problem is how to use high level knowledge to correct the error introduced by low level signal processing and get the sequence of
                         {                  }

System architecture for query by humming

Music Query in Digital Library
Feature selection:
Melody:  a rhythmic succession of single tones (pitch) organized as an aesthetic whole.
Melody as the feature:
Advantages: robust to error and distinct to different songs/music
Tracking Pitch in Hummed Queries:
   a string with a three letter alphabet (U,D,S).
   where U,D,S represents: a note is higher, lower than previous note, or the same respectively

Symbol representation of melody
The beginning part can be described as:
{ UUUUDDDSUDSS…}

Feature extraction:
Derive the formant parameters from the physical model of human voice organ with mean value
Synthesize the pitch with the above parameters.
Use autocorrelation method to track pitches positions of the synthesized pitch
Use these pitch positions to extract the  pitch of input hummed data

Natural language speech query processing (Wang 02 and Wang 03)

Query Model
Query model (QM) is used to analyze the query and extract the core semantic string (CSS) that contains the main semantic of the query.

CSS Extraction by Query Model

Multi-tier query term mapping
In order to further eliminate the speech recognition errors, a multi-tier approach is used to map basic components in CSS into known phrases by using a combination of matching techniques.
To account for possible errors in CSS components, we perform similarity, instead of exact, matching at the three levels. Given the basic CSS component qi, and a phrase cj in the dictionary, we compute:

Evaluation
For music retrieval:
For a test set of 183 songs, 10~12 pitch transitions are sufficient to discriminate 90% of the songs
For speech query:
Method a:  Here we assume that the natural language query is a bag of words with stop word removed (Ricardo, 1999). Currently, most search engines are based on this approach.
Method b: We applied our query model to extract CSS and employed the multi-tier mapping approach to extract and correct the errors in the CSS components.

                Conclusion
Audio content application in digital library need not only low level processing to represent the content but also high level knowledge to correct the errors.

Reference:
A. Ghias, J. Logan, D. Chamberlin, and B. C. Smith. Query by humming - musical information retrieval in an audio database. In ACM Multimedia 95, 1995
Gang Wang, Tat-Seng Chua and Yong-Cheng Wang. Extracting Key Semantic Terms from Chinese Speech Query for Web Searches. 41st Annual Meeting of the Association for Computational Linguistics (ACL’03), Sapporo, Japan. July 7-12, 2003. 248-255.

    Question and answering
                Thanks!