Audio in Digital Library
Motivation
|
|
|
Availability of text data, music data
and speech data in the digital library. |
|
|
|
How to conveniently satisfy the need of
user information seeking in digital library, especially for the music data. |
|
|
Introduction of problem
|
|
|
Two convenient query methods in digital
library: |
|
music retrieval by humming and natural language speech query |
|
Audio processing in digital library
usually |
|
involves three levels: |
Definition of Problem
|
|
|
For level 1~level 2, suppose a sequence
of meaningful units can be perceived by
Human Beings from the raw audio data as: |
|
{ } |
|
The problem is how to detect the same
sequence from the raw data
automatically. |
|
For level 2~level 3, suppose the
sequence obtained from low level signal processing is:{ } |
|
The problem is how to use high level knowledge to correct the error
introduced by low level signal processing and get the sequence of |
|
{ } |
System architecture for
query by humming
Music Query in Digital
Library
|
|
|
|
Feature selection: |
|
|
|
Melody:
a rhythmic succession of single tones (pitch) organized as an
aesthetic whole. |
|
|
|
Melody as the feature: |
|
Advantages: robust to error and
distinct to different songs/music |
|
Tracking Pitch in Hummed Queries: |
|
a string with a three letter alphabet (U,D,S). |
|
where U,D,S represents: a note is higher, lower than previous note, or
the same respectively |
|
|
Symbol representation of
melody
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The beginning part can be described as: |
|
{ UUUUDDDSUDSS…} |
Feature extraction:
|
|
|
Derive the formant parameters from the
physical model of human voice organ with mean value |
|
Synthesize the pitch with the above
parameters. |
|
Use autocorrelation method to track
pitches positions of the synthesized pitch |
|
Use these pitch positions to extract
the pitch of input hummed data |
Natural language speech
query processing (Wang 02 and Wang 03)
Query Model
|
|
|
|
Query model (QM) is used to analyze the
query and extract the core semantic string (CSS) that contains the main
semantic of the query. |
|
|
|
|
CSS Extraction by Query
Model
Multi-tier query term
mapping
|
|
|
In order to further eliminate the
speech recognition errors, a multi-tier approach is used to map basic
components in CSS into known phrases by using a combination of matching
techniques. |
|
To account for possible errors in CSS
components, we perform similarity, instead of exact, matching at the three
levels. Given the basic CSS component qi, and a phrase cj in the dictionary,
we compute: |
|
|
Evaluation
|
|
|
For music retrieval: |
|
For a test set of 183 songs, 10~12
pitch transitions are sufficient to discriminate 90% of the songs |
|
|
|
For speech query: |
|
Method a: Here we assume that the natural language
query is a bag of words with stop word removed (Ricardo, 1999). Currently,
most search engines are based on this approach. |
|
Method b: We applied our query model to
extract CSS and employed the multi-tier mapping approach to extract and
correct the errors in the CSS components. |
|
|
Conclusion
|
|
|
Audio content application in digital
library need not only low level processing to represent the content but also
high level knowledge to correct the errors. |
|
|
Reference:
|
|
|
A. Ghias, J. Logan, D. Chamberlin, and
B. C. Smith. Query by humming - musical information retrieval in an audio
database. In ACM Multimedia 95, 1995 |
|
|
|
Gang Wang, Tat-Seng Chua and Yong-Cheng
Wang. Extracting Key Semantic Terms from Chinese Speech Query for Web
Searches. 41st Annual Meeting of the Association for Computational
Linguistics (ACL’03), Sapporo, Japan. July 7-12, 2003. 248-255. |
Question and answering