‹header›

‹date/time›

Click to edit Master text styles

Second level

Third level

Fourth level

Fifth level

‹footer›

‹#›

1.Not for sale or private use,

2.Building, structure and for the public

3.Preservation and accessibility

4.Subscription and privatized versions

5.Although not necessarily for knowledge, primarily so

6.Learning, to pass on knowledge to a new person

7.Sharing to conserve resources

8.Sum total of a species

Taken from L. Tang’s course slides. Used with permission.

Here a page is a document and the corpus is the book

TOC and index are secondary sources, access methods for the primary material

Already note: skim page implies that work is done in the after the indexing too.

Translating from how we do it in books, then we get the following algorithm:

We’ll focus on the construction of an index first

Fixed records to save space in data structure

4 numeric character limit to

Stemming can be helpful as well as harmful

Stemming and case folding saves entries in the lexicon but the number of pointers is the same, so less effective Stop wording effectiveness a function of the threshold. With Zipf’s law, eliminating the first 30 most frequently occurring terms can save 30% in space.

Because most of these collections are not randomly built. Think TREC/Bible.

4 GB of main memory for 5 GB of text, assuming that a record takes 10 bytes (4 for document number, 4 for next pointer, 2 for frequency count) and that there are 400 million distinct records to encode. This process if used without main memory can swap a lot (because disk access is not sequential)

Note that the sentence numbers are monotonically increasing.

Which algorithm is the above ? AND or OR

Again, to normalize. Don’t want a hapax legomana to be more than double in importance.

In this scheme, long documents are long vectors, short documents, short vectors

We normalize versus the