CS 3245 - Information Retrieval

Last updated: Mon Feb 27 09:11:40 SGT 2012 - First release

In Homework 3, you will implement a searcher that supports positional indexing and dictionary compression. To do this, you can choose to further modify either your solution to Homework 2 or Hadi's sample solution to Homework 2.

Commonalities with Homework 2

The indexing and query commands will use an identical input format to Homework 2, so that you need not modify any of your code to deal with command line processing. To recap:

Indexing: $ python index.py -i directory-of-documents -d dictionary-file -p postings-file

Searching: $ python search.py -d dictionary-file -p postings-file -q file-of-queries -o output-file-of-results

We will also again use the Reuters training data set provided by NLTK. Depending on where you specified the directory for NLTK data when you first installed the NLTK data (recall that the installation is triggered by nltk.download()), this data set is located under a path like: .../nltk_data/corpora/reuters/training/.

As in Homework 2, your search process must not keep the postings file in memory in answering queries, but instead use a file cursor to randomly access parts of the postings file. However, there is no such restriction for the indexer (you are not asked to implement BSBI or SPIMI).

Positional Indexing

Your searching code now needs to additionally support phrasal queries, such as:

"New York" OR Boston

To do this, you need to change your data structure for postings to include information about where the words occur. You only have to implement correct phrasal query handling (no proximity queries). You will need to modify your indexing scheme to change the data structure used for postings. Slightly more tricky is the modifications you need to make to your retrieval code (We can regard a phrasal query as an AND query but with specific positional constraints).

Dictionary Compression

Your code should implement dictionary as a string compression as described in Chapter 5 of IIR (Slide 15 in Week 6's lecture notes). Use a tree (binary or other tree structure of your choice) to implement the dictionary search structure. The version of the dictionary-as-a-string you should use uses a single string to store the dictionary, without the use of blocking (please do not implement the technique shown in lecture slides 17-20). Both the term and postings "pointers" should be byte offsets into the appropriate disk file or string.

What to turn in?

You are required to submit index.py, search.py, dictionary.txt, and postings.txt. Please do not include the Reuters data in your submission.

Essay questions

You are also asked to answer the following essay questions. These are to test your understanding of the lecture materials. Note that these are open-ended questions and do not have gold standard answers. A paragraph or two are usually sufficient for each question, but some of the questions will require some calculation support on your part, and for such questions, may be a bit longer. You should recap the design of your tree data structure if necessary in your essay section, if necessary, to justify your computations or explanation.

Imagine that we disallow searching for stopwords and numbers by themselves, but only allow searching for them as part of phrasal queries. Furthermore, assume such phrasal queries must start with a non-stopword. In this case searches like he and "the who" are disallowed but "bmw 325" and "national unversity of singapore" are allowed. Could we make any cost savings in indexing at either indexing or query time? ("Cost savings" here covers both space and/or speed).
Give statistics of your dictionary compression, similar in style to the Table 5.1 in IIR. Measure or compute (after making any necessary assumptions) the size of your dictionary (a) originally after Homework #2, (b) after using dictionary as a string compression, and (c) after using blocking with k=4 (this step is not implemented in your assignment, but just for you to think about. Your calculations need not be exact.
Imagine that our query processor also needs to handle single keyword suffix wildcards, where the stem part of the query term must be of length 5 or greater (e.g., "automat*"). How would you implement support for this feature, given your system?
How would you implement essay question #3 above if you had to use a hash table data structure for the dictionary? Would you recommend this data structure over your tree implementation?

Submission Formatting

The instructions below are repeated for clarity sake. Only parts highlighted in red are different from previous assignments.

You are allowed to do this assignment individually or as a team of two. There will be no difference in grading criteria if you do the assignment as a team or individually. Only one student in a team of two need to submit the assignment. For the submission information below, simply replace any mention of a matric number with the two matric numbers concatenated with a separating dash (e.g., U000000X-U000001Y).

For us to grade this assignment in a timely manner, we need you to adhere strictly to the following submission guidelines. They will help me grade the assignment in an appropriate manner. You will be penalized if you do not follow these instructions. Your matric number in all of the following statements should not have any spaces and any letters should be in CAPITALS. You are to turn in the following files:

A plain text documentation file README-<matric number>.txt (e.g., README-U000000X.txt): this is a text only file that describes any information you want me to know about your submission. You should not include any identifiable information about your assignment (your name, phone number, etc.) except your matric number and email (we need the email to contact you about your grade, please use your u*******@nus.edu.sg address, not your email alias). This is to help you get an objective grade in your assignment, as we won't associate matric numbers with student names. Please indicate in your README whether you modified the sample solution to Homework #2, modified your own Homework #2 submission or wrote your code from scratch.
All source code. We will be reading your code, so please do us a favor and format it nicely.
A plain text file essay-<matric number>.txt that contains your answers to the essay questions.

These files will need to be suitably zipped in a single file called submission-<matric number>.zip. Please use a zip archive and not tar.gz, bzip, rar or cab files. Make sure when the archive unzips that all of the necessary files are found in a directory called submission-<matric number>. Upload the resulting zip file to the IVLE workbin by the due date: Monday, 12 Mar 11:59:59 pm SGT. There absolutely will be no extensions to the deadline of this assignment. Read the late policy if you're not sure about grade penalties for lateness.

Grading Guidelines

The grading criteria for the assignment is tentatively:

30% Positional indexing handling
30% Dictionary compression
10% Search results correctness
10% Documentation
20% Essay questions

Disclaimer: percentage weights may vary without prior notice.

Hints

Indexing a large number of documents may take a while and a lot of space. Again, we suggest that you just test and develop your system on a subset that will make it quick to do experimentation.
Make sure you have a working solution to one part before modifying it to do another part. Save your incremental, working progress in another (directory/file/source control system) before starting the next part. In particular, if you work with just a few documents first, you may want to do the tasks in this order:
1. Carefully design how you are going to do the positional indexing and the compression. You need to think through the implications of the steps on modifications downstream. Cut up the task into reasonable sized parts.
2. You may want to make changes to your query parser first, so that you can see the correct parse of the queries and check to see whether your query parser chooses the right intersections to do in which order.
3. Implement the positional indexer, before the query processor.
4. Once you have a working positional system, you can think about how to modify it to achieve the dictionary compression.
Working in a group can be helpful but can be counter-productive too. I have observed that group work tends to make homework submissions grades slightly higher than single submissions but also more average (it's harder to have an outstanding grade with a group). If you think you work better in a group of two, please do. Make sure to clearly partition the work and demarcate the API between your modules.

Min-Yen Kan <kanmy@comp.nus.edu.sg> Mon Feb 27 09:24:54 SGT 2012 | Version: 1.0 | Last modified: Sat Mar 24 12:30:02 2012

CS 3245 » Homework 3 - Indexing

Menu