1	Fundamentals of Information Retrieval Module 3 Min-Yen KAN Implementation of (Textual) Information Retrieval
2	What is information retrieval? Part of the information seeking process Matches a query with most relevant documents View a query as a mini-document
3	Searching in books ___ ___ ___ Procedure: Look up topic Find the page Skim page to find topic … Index, 11, 103-151, 443 Audio, 476 Comparison of methods 143-145 Granularity, 105, 112 N-gram, 170-172 Of integer sequences, 11 Of musical themes, 11 Of this book, 103, 507ff Within inverted file entry, see skipping Index compression, 114-129, 198-201, 235-237 Batched, 125,128 Bernoulli, 119-122, 128, 150, 247, 421 Context-sensitive, 125-126 Global, 115-121 Hyperbolic model, 123-124, 150 In MG, 421-423 Interpolative coding, 126-128 Local, 115, 121-122, 247 Nonparameterized, 115-119 Observed frequency, 121, 124-125, 128, 247 Parameterized, 115 Performance of, 128-129. 421 Skewed Bernoulli, 122-123, 138, 150 Within-document frequencies, 198-201 Index Construction, 223-261 (see also inversion) bitmaps, 255-256 …
4	Information retrieval Algorithm (Permute query to fit index) Search index Go to resource (Permute query to fit item) (Search for item)
5	What to index? Books indices have key words and phrases Search engines index (all) words Why the disparity? What do people really search for?
6	Trading precision for size Can save up to 32% without too much loss: Stemming Usually just word inflection Information → Inform = Informal, Informed Case folding N.B.: keep odd variants (e.g., NeXT, LaTeX) Stop words Don’t index common words, people won’t search on them anyways Pop Quiz: Which of these techniques are more effective?
7	Indexing output Output = L_w,D_D,I_W_×_D Inverted File (Index) Postings (e.g., w_t → (d₁,f_wt,d1), (d₂,f_wt,d), …, (d_n,f_wt,dn) Variable length records Lexicon: String W_t Document frequency f_t Address within inverted file I_t Sorted, fixed length records × D₁ D₂ D₃ D₄ D₅ D₆ … D_m W₁ 1 1 W₂ 2 1 W₃ 1 W₄1 1 W₅ 1 1 W₆ 1 1 1 … W_n
8	Trading precision for size, redux Pop Quiz: Which of these techniques are more effective? Typical: Lexicon = 30 MB Inverted File: 400 MB Stemming Affects Lexicon Case folding Affects Lexicon Stop words Affects Inverted File
9	Is fine-grained indexing worthwhile? Problem: still have to scan document to find the term. Cons: Need access methods to take advantage Extra storage space overhead (variable sized) Alternative methods: Hierarchical encoding (doc #, para #, sent #, word #) to shrink offset size Split long documents into n shorter ones.
10	Inverted file compression Clue: Encode gap length instead of offset Use small number of bits to encode more common gap lengths (e.g., Huffman encoding) Better: Use a distribution of expected gap length (e.g., Bernoulli process) If p = prob that any word x appears in doc y, then Then p_{gap size z}= (1-p)^zp . This constructs a geometric distribution. Works for intra and inter-document index compression _________________________________
11	Building the index – Memory based inversion Takes lots of main memory, ugh! Can we reduce the memory requirement?
12	Sort-based inversion Idea: try to make random access of disk (memory) sequential // Phase I – collection of term appearances on disk For each document D_d in collection, 1 ≤ d ≤ N Read D_d, parsing it into index terms For each index term t in D_d Calculate fd,t Dump to file a tuple (t,d,f_d,t) // Phase II – sort tuples Sort all the tuples (t,d,f) using External Mergesort // Phase III – write output file Read the tuples in sorted order and create inverted file
13	Sort based inversion: example <a,1,2> <b,1,2> <c,1,1> <a,2,2> <d,2,1> <b,2,1> <b,3,1> <d,3,1>
14	Using a first pass for the lexicon Gets us f_d,t and N Savings: For any t, we know f_d,t, so can use an array vs. LL (shrinks record by 40%!)
15	Lexicon-based inversion Partition inversion as \|I\|/\|M\| = k smaller problems build 1/k of inverted index on each pass (e.g., a-b, b-c, …, y-z) Tuned to fit amount of main memory in machine Just remember boundary words Can pair with disk strategy Create k temporary files and write tuples (t,d,f_d,t) for each partition on first pass Each second pass builds index from temporary file
16	Inversion – Summary of Techniques How do these techniques stack up? Assume a 5 GB corpus and 40 MB main memory machine Technique Memory Disk Time (MB) (GB) (Hours) *Linked lists (memory) 4000 0 6 Linked lists (disk) 30 4 1100 Sort-based 40 8 20 Lexicon-based 40 0 79 Lexicon w/ disk 40 4 12
17	Query Matching Now that we have an index, how do we answer queries?
18	Query Matching Assuming a simple word matching engine: For each query term t Stem t Search lexicon Record f_t and its inverted entry address, I_t Select a query term t Set list of candidates, C = I_t For each remaining term t Read its I_t For each d in C, if d not in I_t set C = C – {d} X and Y and Z – high precision X or Y or Z – high recall
19	Boolean Model Query processing strategy: ______________________ Even in ORs, as merging takes longer than lookup Problems with Boolean model: _______________________________ Longer documents are tend to match more often because they have a larger vocabulary Need ranked retrieval to help out
20	Deciding ranking Boolean assigns same importance to all terms in a query F4 concert dates in Singapore Problem: “F4” has same weight as “date” One way: Assign weights to the words, make more important words worth more Process results in q and d vectors: (word, weight), (word, weight) … (word, weight)
21	Term Frequency Xxxxxxxxxxxxxx IBM xxxxxxxxxxx xxxxxxxx xxxxxxxxxxx IBM xxxxxxx xxxxxxxxxx xxxxxxxx Apple. Xxxxxxxxxx xxxxxxxxxx IBM xxxxxxxx. Xxxxxxxxxx xxxxxxxx Compaq. Xxxxxxxxx xxxxxxx IBM. (Relative) term frequency can indicate importance. R_d,f = f_d,t R_d,t = 1 + ln f_d,t Rd,t = (K + (1-K) )
22	Inverse Document Frequency Consider a future device for individual use, which is a sort of mechanized private file and library. It needs a name, and, to coin one at random, "memex" will do.
23	Inverse Document Frequency Consider a future device for individual use, which is a sort of mechanized private file and library. It needs a name, and, to coin one at random, "memex" will do. Words with higher f_t are less discriminative. Use inverse to measure importance: w_t = 1/f_t w_t = ln (1+ N/f_t) ß this one is most common w_t = ln (1 + f^m/f_t), where f^m is the max observed frequency
24	This is TF*IDF Many variants, but all capture: Term frequency: R_d,t as _____________________ Inverse Document Frequency: W_tas ________________________ Standard formulation is: w_d,t = r_d,t × w_t = (1+ ln(f_d,t)) × ln (1 + N/f_t) Problem: r_d,t grows as document grows, need to normalize; otherwise biased towards _______________
25	Calculating Similarity Euclidean Distance - bad M(Q,D_d) = sqrt (Σ \|w_q,t – w_d,t\|²) Dissimilarity Measure; use reciprocal Has problem with long documents, why? Actually don’t care about vector length, just their direction Want to measure difference in direction
26	Cosine Similarity If X and Y are two n-dimensional vectors: X · Y = \|X\| \|Y\| cos θ cos θ = X · Y / \|X\| \|Y\| = 1 when identical = 0 when orthogonal
27	Calculating the ranked list To get the ranked list, we use doc. accumulators: For each query term t, in order of increasing f_t, Read its inverted file entry I_t Update acc. for each doc in I_t: A_d+= ln (1 + f_d,t) ×w_t For each A_d in A A_d /= W_d// that’s basically cos θ, don’t use w_q Report top r of A
28	Accumulator Storage Holding all possible accumulators is expensive Could need one for each document if query is broad In practice, use fixed \|A\| wrt main memory. What to do when all used? Quit: _____________________ Continue ____________________
29	Selecting r entries from accumulators Want to return documents with largest cos values. How? Use a ____________ Load r A values into the _____ Process remaining A-r values If A_d > min{H} then Delete min{H}, add A_d, and sift // H now contains the top r exact cosine values
30	To think about How do you deal with a dynamic collection? How do you support phrasal searching? What about wildcard searching? What types of wildcard searching are common?