1	Survey on Metadata Extraction and Indexing Chen Xi Wang Xiaohang
2	Agenda Extracting metadata Evaluating the metadata extracted as index terms
3	Automatic Document Metadata Extraction using Support Vector Machine Hui Han, Lee Giles, et al.
4	Metadata elements Tagable Metadata Elements: title, author, affiliation, etc. Method: directly identified Non-tagable Metadata Elements: subject, coverage, genre etc. Method: classified into given topics via TC
5	Single-class vs. Multi-classes lines Single-class line Each line with a single element e.g. <title> Automatic Document Metadata Extraction using Support Vector Machine </title> Multi-classes line Each line with several elements e.g. <author> Hui Han, C. Lee Giles, Ere Manavoglu</author>
6	Two-Step Extraction Line classification Line single class / multi-classes Chunk identification Multi-classes line multiple elements
7	Line Classification – step 1 Independent classification SVM Feature selection: Word-specific “X. Wang” single capital :: capital non-dictionary word Line-specific no. of words, line number, percentage of non-dictionary words etc.
8	Line Classification – step 2 Contextual classification Predications of step 1 provides context e.g. title lines, author line, affiliation line… Context among lines can be used as features e.g. title lines followed by author line Combining contextual features and other features, then trained iteratively
9	Chunk identification Chunking N-class line = Finding N-1 chunk boundaries Boundary candidates Space authors - “Hui Han C. Lee Giles” Punctuation affiliation & address – “NUS, 10 Kent Ridge Crescent, Singapore 119260”
10	Summary Classification-based method using SVM Independent and contextual features
11	Automatic Identification and Organization of Index Terms for Interactive Browsing Nina Wacholder, David K.Evans, Judith L.Klavans
12	Problems discussed Automatically identify phrases as index terms in a dynamic text browser (DTB) What constitutes useful index terms. Standard IR metrics do not apply to their tasks.
13	Index Identification ---- Head sorting method Head: most important word syntactically and semantically. Filter is the head of coffee filter, oil filter, etc Simplex NPs (SNP) are identified instead of Complex NP (CNP) “A form” and “cancer causing asbestos” are two SNP “A form of cancer causing asbestos” is a CNP. NPs are sorted by head in terms of their significances (frequency)
14	What constitutes useful index ---- how to evaluate Precision and Recall do not apply to index terms. Three basic criteria: Coherence Usefulness of index terms Thoroughness of coverage of document content
15	Coherence - 3 ratings Coherent: coherent && NP Incoherent: not coherent &&not NP Intermediate: coherent
16	Usefulness of index terms Qualitative judgment With a scale from 1—5 ( High quality – junk) Terms identified by 3 domain-independent techniques: Key words Technical terms Head sorted NPs
17	Usefulness of index terms
18	Thoroughness of coverage of document Evaluation number of terms identified relative to the size of the text. (Unique NPs#)/ (Words# in a document) is around .27
19	Summary A new way to evaluate index terms besides precision and recall. Qualitative + Quantitative metrics What is a good ratio of thoroughness
20	Issues Extraction Tagable Metadata – Information Extraction UnTagable Metadata- Text Categorization/Summaization Evaluation How to make use of modifiers.
21	Thank You!! =) Q&A time