Menu

[ IVLE ]

[ Overview ]
[ Syllabus ]
[ Grading ]
[ Homework
  HW 1
  HW 2
  >HW 3
  HW 4
  HW 5 ]
[ Misc. ]

Last updated: Mon Feb 27 09:11:40 SGT 2012 - First release

In Homework 3, you will implement a searcher that supports positional indexing and dictionary compression. To do this, you can choose to further modify either your solution to Homework 2 or Hadi's sample solution to Homework 2.

Commonalities with Homework 2

The indexing and query commands will use an identical input format to Homework 2, so that you need not modify any of your code to deal with command line processing. To recap:

Indexing: $ python index.py -i directory-of-documents -d dictionary-file -p postings-file

Searching: $ python search.py -d dictionary-file -p postings-file -q file-of-queries -o output-file-of-results

We will also again use the Reuters training data set provided by NLTK. Depending on where you specified the directory for NLTK data when you first installed the NLTK data (recall that the installation is triggered by nltk.download()), this data set is located under a path like: .../nltk_data/corpora/reuters/training/.

As in Homework 2, your search process must not keep the postings file in memory in answering queries, but instead use a file cursor to randomly access parts of the postings file. However, there is no such restriction for the indexer (you are not asked to implement BSBI or SPIMI).

Positional Indexing

Your searching code now needs to additionally support phrasal queries, such as:

"New York" OR Boston

To do this, you need to change your data structure for postings to include information about where the words occur. You only have to implement correct phrasal query handling (no proximity queries). You will need to modify your indexing scheme to change the data structure used for postings. Slightly more tricky is the modifications you need to make to your retrieval code (We can regard a phrasal query as an AND query but with specific positional constraints).

Dictionary Compression

Your code should implement dictionary as a string compression as described in Chapter 5 of IIR (Slide 15 in Week 6's lecture notes). Use a tree (binary or other tree structure of your choice) to implement the dictionary search structure. The version of the dictionary-as-a-string you should use uses a single string to store the dictionary, without the use of blocking (please do not implement the technique shown in lecture slides 17-20). Both the term and postings "pointers" should be byte offsets into the appropriate disk file or string.

What to turn in?

You are required to submit index.py, search.py, dictionary.txt, and postings.txt. Please do not include the Reuters data in your submission.

Essay questions

You are also asked to answer the following essay questions. These are to test your understanding of the lecture materials. Note that these are open-ended questions and do not have gold standard answers. A paragraph or two are usually sufficient for each question, but some of the questions will require some calculation support on your part, and for such questions, may be a bit longer. You should recap the design of your tree data structure if necessary in your essay section, if necessary, to justify your computations or explanation.

  1. Imagine that we disallow searching for stopwords and numbers by themselves, but only allow searching for them as part of phrasal queries. Furthermore, assume such phrasal queries must start with a non-stopword. In this case searches like he and "the who" are disallowed but "bmw 325" and "national unversity of singapore" are allowed. Could we make any cost savings in indexing at either indexing or query time? ("Cost savings" here covers both space and/or speed).
  2. Give statistics of your dictionary compression, similar in style to the Table 5.1 in IIR. Measure or compute (after making any necessary assumptions) the size of your dictionary (a) originally after Homework #2, (b) after using dictionary as a string compression, and (c) after using blocking with k=4 (this step is not implemented in your assignment, but just for you to think about. Your calculations need not be exact.
  3. Imagine that our query processor also needs to handle single keyword suffix wildcards, where the stem part of the query term must be of length 5 or greater (e.g., "automat*"). How would you implement support for this feature, given your system?
  4. How would you implement essay question #3 above if you had to use a hash table data structure for the dictionary? Would you recommend this data structure over your tree implementation?

Submission Formatting

The instructions below are repeated for clarity sake. Only parts highlighted in red are different from previous assignments.

You are allowed to do this assignment individually or as a team of two. There will be no difference in grading criteria if you do the assignment as a team or individually. Only one student in a team of two need to submit the assignment. For the submission information below, simply replace any mention of a matric number with the two matric numbers concatenated with a separating dash (e.g., U000000X-U000001Y).

For us to grade this assignment in a timely manner, we need you to adhere strictly to the following submission guidelines. They will help me grade the assignment in an appropriate manner. You will be penalized if you do not follow these instructions. Your matric number in all of the following statements should not have any spaces and any letters should be in CAPITALS. You are to turn in the following files:

These files will need to be suitably zipped in a single file called submission-<matric number>.zip. Please use a zip archive and not tar.gz, bzip, rar or cab files. Make sure when the archive unzips that all of the necessary files are found in a directory called submission-<matric number>. Upload the resulting zip file to the IVLE workbin by the due date: Monday, 12 Mar 11:59:59 pm SGT. There absolutely will be no extensions to the deadline of this assignment. Read the late policy if you're not sure about grade penalties for lateness.

Grading Guidelines

The grading criteria for the assignment is tentatively:

Disclaimer: percentage weights may vary without prior notice.

Hints


Min-Yen Kan <kanmy@comp.nus.edu.sg> Mon Feb 27 09:24:54 SGT 2012 | Version: 1.0 | Last modified: Sat Mar 24 12:30:02 2012