1	Audio in Digital Library Huang Wendong Wang Gang
2	Motivation Availability of text data, music data and speech data in the digital library. How to conveniently satisfy the need of user information seeking in digital library, especially for the music data.
3	Introduction of problem Two convenient query methods in digital library: music retrieval by humming and natural language speech query Audio processing in digital library usually involves three levels:
4	Definition of Problem For level 1~level 2, suppose a sequence of meaningful units can be perceived by Human Beings from the raw audio data as: { } The problem is how to detect the same sequence from the raw data automatically. For level 2~level 3, suppose the sequence obtained from low level signal processing is:{ } The problem is how to use high level knowledge to correct the error introduced by low level signal processing and get the sequence of { }
5	System architecture for query by humming
6	Music Query in Digital Library Feature selection: Melody: a rhythmic succession of single tones (pitch) organized as an aesthetic whole. Melody as the feature: Advantages: robust to error and distinct to different songs/music Tracking Pitch in Hummed Queries: a string with a three letter alphabet (U,D,S). where U,D,S represents: a note is higher, lower than previous note, or the same respectively
7	Symbol representation of melody The beginning part can be described as: { UUUUDDDSUDSS…}
8	Feature extraction: Derive the formant parameters from the physical model of human voice organ with mean value Synthesize the pitch with the above parameters. Use autocorrelation method to track pitches positions of the synthesized pitch Use these pitch positions to extract the pitch of input hummed data
9	Natural language speech query processing (Wang 02 and Wang 03)
10	Query Model Query model (QM) is used to analyze the query and extract the core semantic string (CSS) that contains the main semantic of the query.
11	CSS Extraction by Query Model
12	Multi-tier query term mapping In order to further eliminate the speech recognition errors, a multi-tier approach is used to map basic components in CSS into known phrases by using a combination of matching techniques. To account for possible errors in CSS components, we perform similarity, instead of exact, matching at the three levels. Given the basic CSS component qi, and a phrase cj in the dictionary, we compute:
13	Evaluation For music retrieval: For a test set of 183 songs, 10~12 pitch transitions are sufficient to discriminate 90% of the songs For speech query: Method a: Here we assume that the natural language query is a bag of words with stop word removed (Ricardo, 1999). Currently, most search engines are based on this approach. Method b: We applied our query model to extract CSS and employed the multi-tier mapping approach to extract and correct the errors in the CSS components.
14	Conclusion Audio content application in digital library need not only low level processing to represent the content but also high level knowledge to correct the errors.
15	Reference: A. Ghias, J. Logan, D. Chamberlin, and B. C. Smith. Query by humming - musical information retrieval in an audio database. In ACM Multimedia 95, 1995 Gang Wang, Tat-Seng Chua and Yong-Cheng Wang. Extracting Key Semantic Terms from Chinese Speech Query for Web Searches. 41st Annual Meeting of the Association for Computational Linguistics (ACL’03), Sapporo, Japan. July 7-12, 2003. 248-255.
16	Question and answering Thanks!