Menu

[ IVLE ]

[ Overview ]
[ Syllabus ]
[ Grading ]
[ Homework
  HW 1
  HW 2
  >HW 3
  HW 4
  HW 5 ]
[ Misc. ]

Last updated: Mon Feb 21 10:45:19 SGT 2011 - First release

In Homework 3, you will implement a searcher that supports positional indexing and postings compression. To do this, you can choose to further modify either your solution to Homework 2 or use Ziheng's sample solution to Homework 2.

Commonalities with Homework 2

The indexing and query commands will use an identical input format to Homework 2, so that you need not modify any of your code to deal with command line processing. To recap:

Indexing: $ python index.py -i directory-of-documents -d dictionary-file -p postings-file

Searching: $ python search.py -d dictionary-file -p postings-file -q file-of-queries -o output-file-of-results

We will also again use the Reuters training data set provided by NLTK. Depending on where you specified the directory for NLTK data when you first installed the NLTK data (recall that the installation is triggered by nltk.download()), this data set is located under a path like: .../nltk_data/corpora/reuters/training/.

As in Homework 2, your search process must not keep the postings file in memory in answering queries, but instead use a file cursor to randomly access parts of the postings file. However, there is no such restriction for the indexer (you are not asked to implement BSBI or SPIMI).

Positional Indexing

Your searching code now needs to additionally support phrasal queries, such as:

"New York" OR Boston

To do this, you need to change your data structure for postings to include information about where the words occur. You only have to implement correct phrasal query handling (no proximity queries). You will need to modify your indexing scheme to change the data structure used for postings. Slightly more tricky is the modifications you need to make to your retrieval code (We can regard a phrasal query as an AND query but with specific positional constraints).

Your implementation needs to make proper use of skip pointers at least at the document level. Skip pointers are also helpful in storing and accessing positional information, but you do not have to implement skip pointers for the positional information (but you may if you want, no bonus will be given to this, although it may help your search engine's efficiency).

Index Compression

Your code should implement postings compression as described in Chapter 5 of IIR. Implement gap-based compression for postings (both for document postings as well as positional postings). Use variable byte encoding (VBE) as described in the textbook (this requires bit manipulation in Python; note you can only read bytes in Python and subsequently manipulate bits using bit operations). Make sure to implement this functionality over several steps so that you may answer the essay question properly, see below.

What to turn in?

You are required to submit index.py, search.py, dictionary.txt, and postings.txt. Please do not include the Reuters data in your submission.

Essay questions

You are also asked to answer the following essay questions. These are to test your understanding of the lecture materials. Note that these are open-ended questions and do not have gold standard answers. A paragraph or two are usually sufficient for each question.

  1. Benchmark the efficiency of your compression, similar in style to the Table 5.1 in IIR. Measure the size of your postings files, (a) originally after Homework #2, (b) after adding in the capability of positional postings, (c) after adding in compression for positional postings, (d) after adding in compression for document postings, and (e) after adding in both compression for document and positional postings.
  2. For the Reuters 21578 collection, do you think using different variable sized encoding (4, 16 or 32 bits) would help? For encoding document postings? For encoding positional postings?
  3. Imagine that we disallow searching for stopwords and numbers by themselves, but only allow searching for them as part of phrasal queries. Furthermore, assume such phrasal queries must start with a non-stopword. In this case searches like he and "the who" are disallowed but "bmw 325" and "national unversity of singapore" are allowed. Could we make any cost savings in indexing at either indexing or query time? ("Cost savings" here covers both space and/or speed).

Submission Formatting

The instructions below are repeated for clarity sake. Only parts highlighted in red are different from previous assignments.

You are allowed to do this assignment individually or as a team of two. There will be no difference in grading criteria if you do the assignment as a team or individually. Only one student in a team of two need to submit the assignment. For the submission information below, simply replace any mention of a matric number with the two matric numbers concatenated with a separating dash (e.g., U000000X-U000001Y).

For us to grade this assignment in a timely manner, we need you to adhere strictly to the following submission guidelines. They will help me grade the assignment in an appropriate manner. You will be penalized if you do not follow these instructions. Your matric number in all of the following statements should not have any spaces and any letters should be in CAPITALS. You are to turn in the following files:

These files will need to be suitably zipped in a single file called submission-<matric number>.zip. Please use a zip archive and not tar.gz, bzip, rar or cab files. Make sure when the archive unzips that all of the necessary files are found in a directory called submission-<matric number>. Upload the resulting zip file to the IVLE workbin by the due date: Thursday, 17 Mar 11:59:59 pm SGT. There absolutely will be no extensions to the deadline of this assignment. Read the late policy if you're not sure about grade penalties for lateness.

Grading Guidelines

The grading criteria for the assignment is tentatively:

Disclaimer: percentage weights may vary without prior notice.

Hints


Min-Yen Kan <kanmy@comp.nus.edu.sg> Mon Feb 21 11:55:32 2011 | Version: 1.0 | Last modified: Thu Mar 31 23:09:40 2011