Notes
Slide Show
Outline
1
Fundamentals of
Information Retrieval
  • Module 3                      Min-Yen KAN
  • Implementation of
    (Textual) Information Retrieval
2
What is information retrieval?
  • Part of the information seeking process
  • Matches a query with most relevant documents
  • View a query as a mini-document


3
Searching in books
  • ___
  • ___
  • ___


  • Procedure:
    • Look up topic
    • Find the page
    • Skim page to find topic
  • …
  • Index, 11, 103-151, 443
  • Audio, 476
  • Comparison of methods 143-145
  • Granularity, 105, 112
  • N-gram, 170-172
  • Of integer sequences, 11
  • Of musical themes, 11
  • Of this book, 103, 507ff
  • Within inverted file entry, see skipping
  • Index compression, 114-129, 198-201, 235-237
  • Batched, 125,128
  • Bernoulli, 119-122, 128, 150, 247, 421
  • Context-sensitive, 125-126
  • Global, 115-121
  • Hyperbolic model, 123-124, 150
  • In MG, 421-423
  • Interpolative coding, 126-128
  • Local, 115, 121-122, 247
  • Nonparameterized, 115-119
  • Observed frequency, 121, 124-125, 128, 247
  • Parameterized, 115
  • Performance of, 128-129. 421
  • Skewed Bernoulli, 122-123, 138, 150
  • Within-document frequencies, 198-201
  • Index Construction, 223-261 (see also inversion)
  • bitmaps, 255-256
  • …
4
Information retrieval
  • Algorithm
    • (Permute query to fit index)
    • Search index
    • Go to resource
    • (Permute query to fit item)
    • (Search for item)
5
What to index?
  • Books indices have key words and phrases
  • Search engines index (all) words


  • Why the disparity?
  • What do people really search for?



6
Trading precision for size
  • Can save up to 32% without too much loss:


  • Stemming
    • Usually just word inflection
    • Information → Inform = Informal, Informed

  • Case folding
    • N.B.: keep odd variants (e.g., NeXT, LaTeX)

  • Stop words
    • Don’t index common words, people won’t search on them anyways

  • Pop Quiz: Which of these techniques are more effective?
7
Indexing output
  • Output = Lw,DD,IW×D


  • Inverted File (Index)
    • Postings (e.g., wt → (d1,fwt,d1), (d2,fwt,d),  …, (dn,fwt,dn)
    • Variable length records

  • Lexicon:
    • String Wt
    • Document frequency ft
    • Address within inverted file It
    • Sorted, fixed length records
  • ×       D1 D2 D3 D4 D5 D6 … Dm


  • W1           1        1
  • W2       2            1
  • W3        1
  • W4                         1           1
  • W5        1           1
  • W6            1       1   1
  • …
  • Wn





8
Trading precision for size, redux
  • Pop Quiz: Which of these techniques are more effective?


  • Typical:
  • Lexicon = 30 MB Inverted File: 400 MB
  • Stemming
    • Affects Lexicon

  • Case folding
    • Affects Lexicon


  • Stop words
    • Affects Inverted File
9
Is fine-grained indexing worthwhile?
  • Problem: still have to scan document to find the term.





  • Cons:
    • Need access methods to take advantage
    • Extra storage space overhead (variable sized)
  • Alternative methods:
    • Hierarchical encoding (doc #, para #, sent #, word #) to shrink offset size
    • Split long documents into n shorter ones.
10
Inverted file compression
  • Clue: Encode gap length instead of offset
  • Use small number of bits to encode more common gap lengths
    • (e.g., Huffman encoding)
  • Better: Use a distribution of expected gap length (e.g., Bernoulli process)
    • If p = prob that any word x appears in doc y, then
    • Then pgap size z = (1-p)z p .  This constructs a geometric distribution.

  • Works for intra and inter-document index compression
    • _________________________________
11
Building the index – Memory based inversion
  • Takes lots of main memory, ugh!
  • Can we reduce the memory requirement?
12
Sort-based inversion
  • Idea: try to make random access of disk (memory) sequential


  • // Phase I – collection of term appearances on disk
  • For each document Dd in collection, 1 ≤ d ≤ N
  • Read Dd, parsing it into index terms
  • For each index term t in Dd
  • Calculate fd,t
  • Dump to file a tuple (t,d,fd,t)
  • // Phase II – sort tuples
  • Sort all the tuples (t,d,f) using External Mergesort


  • // Phase III – write output file
  • Read the tuples in sorted order and create inverted file
13
Sort based inversion: example
  • <a,1,2>
  • <b,1,2>
  • <c,1,1>
  • <a,2,2>
  • <d,2,1>
  • <b,2,1>
  • <b,3,1>
  • <d,3,1>


14
Using a first pass for the lexicon
  • Gets us fd,t and N
    • Savings: For any t, we know fd,t, so can use an array vs. LL (shrinks record by 40%!)
15
Lexicon-based inversion
  • Partition inversion as |I|/|M| = k smaller problems
    • build 1/k of inverted index on each pass
    • (e.g., a-b, b-c, …, y-z)
    • Tuned to fit amount of main memory in machine
    • Just remember boundary words

  • Can pair with disk strategy
    • Create k temporary files and write tuples (t,d,fd,t) for each partition on first pass
    • Each second pass builds index from temporary file


16
Inversion – Summary of Techniques
  • How do these techniques stack up?
  • Assume a 5 GB corpus and 40 MB main memory machine


  • Technique       Memory Disk Time
  • (MB) (GB) (Hours)
  • *Linked lists (memory)  4000 0 6
  • Linked lists (disk) 30 4 1100
  • Sort-based 40 8 20
  • Lexicon-based 40 0 79
  • Lexicon w/ disk 40 4 12
17
Query Matching
  • Now that we have an index, how do we answer queries?
18
Query Matching
  • Assuming a simple word matching engine:


  • For each query term t
  • Stem t
  • Search lexicon
  • Record ft and its inverted entry address, It
  • Select a query term t
  • Set list of candidates, C = It
  • For each remaining term t
  • Read its It
  • For each d in C, if d not in It set C = C – {d}


    • X and Y and Z – high precision
    • X or Y or Z – high recall



19
Boolean Model
  • Query processing strategy:
    • ______________________
    • Even in ORs, as merging takes longer than lookup


  • Problems with Boolean model:
    • _______________________________
    • Longer documents are tend to match more often because they have a larger vocabulary
    • Need ranked retrieval to help out
20
Deciding ranking
  • Boolean assigns same importance to all terms in a query


  • F4 concert dates in Singapore
    • Problem: “F4” has same weight as “date”

  • One way:
    • Assign weights to the words, make more important words worth more
    • Process results in q and d vectors: (word, weight), (word, weight) … (word, weight)
21
Term Frequency
  • Xxxxxxxxxxxxxx IBM xxxxxxxxxxx xxxxxxxx xxxxxxxxxxx IBM xxxxxxx xxxxxxxxxx xxxxxxxx Apple.  Xxxxxxxxxx xxxxxxxxxx IBM xxxxxxxx.  Xxxxxxxxxx xxxxxxxx Compaq.  Xxxxxxxxx xxxxxxx IBM.


  • (Relative) term frequency can indicate importance.
  • Rd,f = fd,t
  • Rd,t = 1 + ln fd,t
  • Rd,t = (K + (1-K)           )
22
Inverse Document Frequency
  • Consider a future device for individual use, which is a sort of mechanized private file and library. It needs a name, and, to coin one at random, "memex" will do.


23
Inverse Document Frequency
  • Consider a future device for individual use, which is a sort of mechanized private file and library. It needs a name, and, to coin one at random, "memex" will do.


  • Words with higher ft are less discriminative.
  • Use inverse to measure importance:
    • wt = 1/ft
    • wt = ln (1+ N/ft) ß this one is most common
    • wt = ln (1 + fm/ft), where fm is the max observed frequency



24
This is TF*IDF
  • Many variants, but all capture:
    • Term frequency:
      Rd,t as _____________________


    • Inverse Document Frequency:
      Wt as ________________________

  • Standard formulation is:
    wd,t = rd,t × wt
    = (1+ ln(fd,t)) × ln (1 + N/ft)


  • Problem:
    • rd,t grows as document grows, need to normalize; otherwise biased towards _______________
25
Calculating Similarity
  • Euclidean Distance - bad
    • M(Q,Dd) = sqrt (Σ |wq,t – wd,t|2)
    • Dissimilarity Measure; use reciprocal
    • Has problem with long documents, why?

  • Actually don’t care about vector length, just their direction
    • Want to measure difference in direction
26
Cosine Similarity
  • If X and Y are two n-dimensional vectors:
    • X · Y = |X| |Y| cos θ
    • cos θ = X · Y / |X| |Y|


    • = 1 when identical
    • = 0 when orthogonal
27
Calculating the ranked list
  • To get the ranked list, we use doc. accumulators:
  • For each query term t, in order of increasing ft,
  • Read its inverted file entry It
  • Update acc. for each doc in It: Ad+= ln (1 + fd,t) ×wt
  • For each Ad in A
  • Ad /= Wd // that’s basically cos θ, don’t use wq
  • Report top r of A


28
Accumulator Storage
  • Holding all possible accumulators is expensive
    • Could need one for each document if query is broad


  • In practice, use fixed |A| wrt main memory.  What to do when all used?
    • Quit: _____________________
    • Continue ____________________
29
Selecting r entries from accumulators
  • Want to return documents with largest cos values.


  • How? Use a ____________
    • Load r A values into the _____
    • Process remaining A-r values
    • If Ad > min{H} then
    • Delete min{H}, add Ad, and sift
    • // H now contains the top r exact cosine values


30
To think about
  • How do you deal with a dynamic collection?
  • How do you support phrasal searching?
  • What about wildcard searching?
    • What types of wildcard searching are common?