|
1
|
|
|
2
|
- Availability of text data, music data and speech data in the digital
library.
- How to conveniently satisfy the need of user information seeking in
digital library, especially for the music data.
|
|
3
|
- Two convenient query methods in digital library:
- music retrieval by humming and
natural language speech query
- Audio processing in digital library usually
- involves three levels:
|
|
4
|
- For level 1~level 2, suppose a sequence of meaningful units can be
perceived by Human Beings from
the raw audio data as:
- { }
- The problem is how to detect
the same sequence from the raw data automatically.
- For level 2~level 3, suppose the sequence obtained from low level signal
processing is:{ }
- The problem is how to use high
level knowledge to correct the error introduced by low level signal
processing and get the sequence of
- { }
|
|
5
|
|
|
6
|
- Feature selection:
- Melody: a rhythmic succession of
single tones (pitch) organized as an aesthetic whole.
- Melody as the feature:
- Advantages: robust to error and distinct to different songs/music
- Tracking Pitch in Hummed Queries:
- a string with a three letter
alphabet (U,D,S).
- where U,D,S represents: a note
is higher, lower than previous note, or the same respectively
|
|
7
|
- The beginning part can be described as:
- { UUUUDDDSUDSS…}
|
|
8
|
- Derive the formant parameters from the physical model of human voice
organ with mean value
- Synthesize the pitch with the above parameters.
- Use autocorrelation method to track pitches positions of the synthesized
pitch
- Use these pitch positions to extract the
pitch of input hummed data
|
|
9
|
|
|
10
|
- Query model (QM) is used to analyze the query and extract the core
semantic string (CSS) that contains the main semantic of the query.
|
|
11
|
|
|
12
|
- In order to further eliminate the speech recognition errors, a
multi-tier approach is used to map basic components in CSS into known
phrases by using a combination of matching techniques.
- To account for possible errors in CSS components, we perform similarity,
instead of exact, matching at the three levels. Given the basic CSS
component qi, and a phrase cj in the dictionary, we compute:
|
|
13
|
- For music retrieval:
- For a test set of 183 songs, 10~12 pitch transitions are sufficient to
discriminate 90% of the songs
- For speech query:
- Method a: Here we assume that the
natural language query is a bag of words with stop word removed
(Ricardo, 1999). Currently, most search engines are based on this
approach.
- Method b: We applied our query model to extract CSS and employed the
multi-tier mapping approach to extract and correct the errors in the CSS
components.
|
|
14
|
- Audio content application in digital library need not only low level
processing to represent the content but also high level knowledge to
correct the errors.
|
|
15
|
- A. Ghias, J. Logan, D. Chamberlin, and B. C. Smith. Query by humming -
musical information retrieval in an audio database. In ACM Multimedia
95, 1995
- Gang Wang, Tat-Seng Chua and Yong-Cheng Wang. Extracting Key Semantic
Terms from Chinese Speech Query for Web Searches. 41st Annual Meeting of
the Association for Computational Linguistics (ACL’03), Sapporo, Japan.
July 7-12, 2003. 248-255.
|
|
16
|
|