|
1
|
|
|
2
|
- Extracting metadata
- Evaluating the metadata extracted as index terms
|
|
3
|
- Hui Han, Lee Giles, et al.
|
|
4
|
- Tagable Metadata
- Elements: title, author, affiliation, etc.
- Method: directly identified
- Non-tagable Metadata
- Elements: subject, coverage,
genre etc.
- Method: classified into given topics via TC
|
|
5
|
- Single-class line
- Each line with a single element
- e.g. <title> Automatic Document Metadata Extraction using Support
Vector Machine </title>
- Multi-classes line
- Each line with several elements
- e.g. <author> Hui Han, C. Lee Giles, Ere Manavoglu</author>
|
|
6
|
- Line classification
- Line single class /
multi-classes
- Chunk identification
- Multi-classes line multiple elements
|
|
7
|
- Independent classification
- SVM
- Feature selection:
- Word-specific
- “X. Wang” single
capital :: capital non-dictionary word
- Line-specific
- no. of words, line number, percentage of non-dictionary words etc.
|
|
8
|
- Contextual classification
- Predications of step 1 provides context
- e.g. title lines, author line, affiliation line…
- Context among lines can be used as features
- e.g. title lines followed by author line
- Combining contextual features and other features, then trained
iteratively
|
|
9
|
- Chunking N-class line =
- Finding N-1 chunk boundaries
- Boundary candidates
- Space
- authors - “Hui Han C. Lee Giles”
- Punctuation
- affiliation & address –
- “NUS, 10 Kent Ridge Crescent,
Singapore 119260”
|
|
10
|
- Classification-based method using SVM
- Independent and contextual features
|
|
11
|
- Nina Wacholder, David K.Evans, Judith L.Klavans
|
|
12
|
- Automatically identify phrases as index terms in a dynamic text browser
(DTB)
- What constitutes useful index terms.
- Standard IR metrics do not apply to their tasks.
|
|
13
|
- Head: most important word syntactically and semantically.
- Filter is the head of coffee filter, oil filter, etc
- Simplex NPs (SNP) are identified instead of Complex NP (CNP)
- “A form” and “cancer causing asbestos” are two SNP
- “A form of cancer causing asbestos” is a CNP.
- NPs are sorted by head in terms of their significances (frequency)
|
|
14
|
- Precision and Recall do not apply to index terms.
- Three basic criteria:
- Coherence
- Usefulness of index terms
- Thoroughness of coverage of document content
|
|
15
|
- Coherent: coherent && NP
- Incoherent: not coherent &¬ NP
- Intermediate: coherent
|
|
16
|
- Qualitative judgment
- With a scale from 1—5 ( High quality – junk)
- Terms identified by 3 domain-independent techniques:
- Key words
- Technical terms
- Head sorted NPs
|
|
17
|
|
|
18
|
- Evaluation number of terms identified relative to the size of the text.
- (Unique NPs#)/ (Words# in a document) is around .27
|
|
19
|
- A new way to evaluate index terms besides precision and recall.
- Qualitative + Quantitative metrics
- What is a good ratio of thoroughness
|
|
20
|
- Extraction
- Tagable Metadata – Information Extraction
- UnTagable Metadata- Text Categorization/Summaization
- Evaluation
- How to make use of modifiers.
|
|
21
|
|