Audio in Digital Library

Huang Wendong

Wang Gang

Motivation

Availability of text data, music data and speech data in the digital library.

How to conveniently satisfy the need of user information seeking in digital library, especially for the music data.

Introduction of problem

Two convenient query methods in digital library:

music retrieval by humming and natural language speech query

Audio processing in digital library usually

involves three levels:

Definition of Problem

For level 1~level 2, suppose a sequence of meaningful units can be perceived by Human Beings from the raw audio data as:

                      {                 }

   The problem is how to detect the same sequence from the raw data automatically.

For level 2~level 3, suppose the sequence obtained from low level signal processing is:{                 }

    The problem is how to use high level knowledge to correct the error introduced by low level signal processing and get the sequence of

                         {                  }

System architecture for query by humming

Music Query in Digital Library

Feature selection:

Melody: a rhythmic succession of single tones (pitch) organized as an aesthetic whole.

Melody as the feature:

Advantages: robust to error and distinct to different songs/music

Tracking Pitch in Hummed Queries:

a string with a three letter alphabet (U,D,S).

where U,D,S represents: a note is higher, lower than previous note, or the same respectively

Symbol representation of melody

The beginning part can be described as:

{ UUUUDDDSUDSS…}

Feature extraction:

Derive the formant parameters from the physical model of human voice organ with mean value

Synthesize the pitch with the above parameters.

Use autocorrelation method to track pitches positions of the synthesized pitch

Use these pitch positions to extract the pitch of input hummed data

Natural language speech query processing (Wang 02 and Wang 03)

Query Model

Query model (QM) is used to analyze the query and extract the core semantic string (CSS) that contains the main semantic of the query.

CSS Extraction by Query Model

Multi-tier query term mapping

In order to further eliminate the speech recognition errors, a multi-tier approach is used to map basic components in CSS into known phrases by using a combination of matching techniques.

To account for possible errors in CSS components, we perform similarity, instead of exact, matching at the three levels. Given the basic CSS component qi, and a phrase cj in the dictionary, we compute:

Evaluation

For music retrieval:

For a test set of 183 songs, 10~12 pitch transitions are sufficient to discriminate 90% of the songs

For speech query:

Method a: Here we assume that the natural language query is a bag of words with stop word removed (Ricardo, 1999). Currently, most search engines are based on this approach.

Method b: We applied our query model to extract CSS and employed the multi-tier mapping approach to extract and correct the errors in the CSS components.

Conclusion

Audio content application in digital library need not only low level processing to represent the content but also high level knowledge to correct the errors.

Reference:

A. Ghias, J. Logan, D. Chamberlin, and B. C. Smith. Query by humming - musical information retrieval in an audio database. In ACM Multimedia 95, 1995

Gang Wang, Tat-Seng Chua and Yong-Cheng Wang. Extracting Key Semantic Terms from Chinese Speech Query for Web Searches. 41st Annual Meeting of the Association for Computational Linguistics (ACL’03), Sapporo, Japan. July 7-12, 2003. 248-255.

Question and answering

Thanks!


	Availability of text data, music data and speech data in the digital library.

	How to conveniently satisfy the need of user information seeking in digital library, especially for the music data.


	Two convenient query methods in digital library:
	music retrieval by humming and natural language speech query
	Audio processing in digital library usually
	involves three levels:


	For level 1~level 2, suppose a sequence of meaningful units can be perceived by Human Beings from the raw audio data as:
	{ }
	The problem is how to detect the same sequence from the raw data automatically.
	For level 2~level 3, suppose the sequence obtained from low level signal processing is:{ }
	The problem is how to use high level knowledge to correct the error introduced by low level signal processing and get the sequence of
	{ }


	Feature selection:

		Melody: a rhythmic succession of single tones (pitch) organized as an aesthetic whole.

	Melody as the feature:
	Advantages: robust to error and distinct to different songs/music
	Tracking Pitch in Hummed Queries:
	a string with a three letter alphabet (U,D,S).
	where U,D,S represents: a note is higher, lower than previous note, or the same respectively


	Derive the formant parameters from the physical model of human voice organ with mean value
	Synthesize the pitch with the above parameters.
	Use autocorrelation method to track pitches positions of the synthesized pitch
	Use these pitch positions to extract the pitch of input hummed data


	Query model (QM) is used to analyze the query and extract the core semantic string (CSS) that contains the main semantic of the query.


	In order to further eliminate the speech recognition errors, a multi-tier approach is used to map basic components in CSS into known phrases by using a combination of matching techniques.
	To account for possible errors in CSS components, we perform similarity, instead of exact, matching at the three levels. Given the basic CSS component qi, and a phrase cj in the dictionary, we compute:


	For music retrieval:
	For a test set of 183 songs, 10~12 pitch transitions are sufficient to discriminate 90% of the songs

	For speech query:
	Method a: Here we assume that the natural language query is a bag of words with stop word removed (Ricardo, 1999). Currently, most search engines are based on this approach.
	Method b: We applied our query model to extract CSS and employed the multi-tier mapping approach to extract and correct the errors in the CSS components.


	Audio content application in digital library need not only low level processing to represent the content but also high level knowledge to correct the errors.


	A. Ghias, J. Logan, D. Chamberlin, and B. C. Smith. Query by humming - musical information retrieval in an audio database. In ACM Multimedia 95, 1995

	Gang Wang, Tat-Seng Chua and Yong-Cheng Wang. Extracting Key Semantic Terms from Chinese Speech Query for Web Searches. 41st Annual Meeting of the Association for Computational Linguistics (ACL’03), Sapporo, Japan. July 7-12, 2003. 248-255.