‹header›
‹date/time›
Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level
‹footer›
‹#›
1.Not for sale or private use,
2.Building, structure and for the public
3.Preservation and accessibility
4.Subscription and privatized versions
5.Although not necessarily for knowledge, primarily so
6.Learning, to pass on knowledge to a new person
7.Sharing to conserve resources
8.Sum total of a species
Taken from L. Tang’s course slides.  Used with permission.
Taken from L. Tang’s course slides.  Used with permission.
Taken from L. Tang’s course slides.  Used with permission.
Taken from L. Tang’s course slides.  Used with permission.
Taken from L. Tang’s course slides.  Used with permission.
Here a page is a document and the corpus is the book
TOC and index are secondary sources, access methods for the primary material
Already note: skim page implies that work is done in the after the indexing too.
Translating from how we do it in books, then we get the following algorithm:
We’ll focus on the construction of an index first
Fixed records to save space in data structure
4 numeric character limit to
Stemming can be helpful as well as harmful
Stemming and case folding saves entries in the lexicon but the number of pointers is the same, so less effective Stop wording effectiveness a function of the threshold.  With Zipf’s law, eliminating the first 30 most frequently occurring terms can save 30% in space.
Because most of these collections are not randomly built.  Think TREC/Bible.
4 GB of main memory for 5 GB of text, assuming that a record takes 10 bytes (4 for document number, 4 for next pointer, 2 for frequency count) and that there are 400 million distinct records to encode. This process if used without main memory can swap a lot (because disk access is not sequential)
Note that the sentence numbers are monotonically increasing.
Which algorithm is the above ?  AND or OR
Again, to normalize.  Don’t want a hapax legomana to be more than double in importance.
In this scheme, long documents are long vectors, short documents, short vectors
We normalize versus the