CS 3245 - Information Retrieval

In Homework 5, you will implement two simple programs to evaluate ranked retrieval and classification systems. In specific, you will create software to evaluate your own code as well (see Essay section below).

Classification IR Evaluation

In Homework 1, you completed a URL classifier, which assigns one of three categories to a given URL. We can calculate precision, recall and F₁ for performance on these categories individually, and also compile an average precision, recall and F₁ over all n categories (in the case of Homework 1, n = 3). You will write code to compute these values given input from three separate files:

A gold standard file (e.g., urls.correct.txt), usually created manually using human judgments. This file should contain one line per classification task, outputting the correct class (e.g., Arts, Sports, News), followed by the text of the classification task (e.g., the URL).
A prediction file (e.g., urls.predict.txt), usually created by an automated system. This file is in exactly the same format as the gold standard file, but will likely contain different classes on some lines where the system may incorrectly predict a different class than the gold standard.
A classes file (e.g., classes.txt), which gives the possible prediction classes. The file should have one class per line (for our URL prediction task, this file should have three lines, each containing one of "Arts", "Sports", and "News"; order doesn't matter).

Your code eval-c.py should be invoked the following way:

eval-c.py -g urls.correct.txt -p urls.predict.txt -c
classes.txt -o output-statistics.txt

where -g is for the gold standard answers, -p for predicted answers, and -c for the available classes. As output, calculate each of the three metrics -- precision, recall, and F₁ -- for an individual class, on a separate line. Output these sets of three lines for each class, in ascending lexicographical order, followed by the final n class average. Figures should have a 2 decimal place accuracy. For our URL classification task, your output file should resemble the following:

Precision of Arts: WW.WW
Recall of Arts: XX.XX
F1 of Arts: XX.XX
Precision of News: YY.YY
Recall of News: XX.XX
F1 of News: XX.XX
Precision of Sports: ZZ.ZZ
Recall of Sports: XX.XX
F1 of Sports: XX.XX
Average Precision: AA.AA
Average Recall: XX.XX
Average F1: XX.XX

Note that there are two possible ways to calculate averages. Calculate the averages by averaging the individual metric values (e.g., to calculate Average Precision in the above, add together WW.WW + YY.YY + ZZ.ZZ, and divide by three).

Interpolated Precision and Recall

In a ranked retrieval system, documents are returned in descending order of relevance, whether it is computed in a probabilistic, vector space or other model. You will also compute the interpolated precision/recall curve for the documents returned by a retrieval system, given input from two files:

A gold standard file (e.g., correct-file-of-results), usually manually created using human judgments. This file contains a set of lines, one for each retrieval task (e.g., a query), and lists the document IDs that are relevant, separated by spaces. There will be no extra space at the end of the line. Irrelevant document IDs will not be listed; they are all of the remaining documents that are not listed as relevant.
A prediction file (e.g., output-file-of-results from HW4), usually created by an automated system. This file is in exactly the same format as the gold standard file, but will likely contain a different set of documents for lines (i.e., queries) where the system may incorrectly predict some false positives or false negatives.

Your code eval-ir.py should be invoked the following way:

eval-ir.py -l 1 -g correct-file-of-results -p
output-file-of-results -o output-statistics.txt

where -g is for the gold standard answers, -p for predicted answers, and -l for the line number. As output, calculate each of the three metrics -- interpolated precision, recall, and F₁ -- for each rank, on a separate line for the retrieval task on line l. Output these sets of three lines for each class, in descending rank order. Figures should have a 2 decimal place accuracy.

In our vector space model retrieval system, for the first query, say the system returns five documents. Then your output file should resemble the following set of five lines (one set of precision, recall, and F1 for each of the five lines):

Precision at Rank 1: XX.XX
Recall at Rank 1: XX.XX
F1 at Rank 1: XX.XX
Precision at Rank 2: XX.XX
Recall at Rank 2: XX.XX
F1 at Rank 2: XX.XX
Precision at Rank 3: XX.XX
Recall at Rank 3: XX.XX
F1 at Rank 3: XX.XX
Precision at Rank 4: XX.XX
Recall at Rank 4: XX.XX
F1 at Rank 4: XX.XX
Precision at Rank 5: XX.XX
Recall at Rank 5: XX.XX
F1 at Rank 5: XX.XX

Please note that the above eval-ir.py code is used to evaluate an individual query, not all of the queries in the output-file-of-result.

What to turn in?

You are required to submit eval-c.py and eval-ir.py.

Essay questions

You are also asked to answer the following essay questions. These are to test your understanding of the lecture materials. Note that these questions may not have gold standard answers. A short paragraph or two are usually sufficient for each question.

What is the other possible method for calculating the average precision, recall or F₁? Describe how this other average could be different in value from the one that you calculated in eval-c.py.
In the eval-c task, we implicitly assume that each class is equally important in calculating the average metrics. If we knew that certain classes are more important than others (e.g., News URLs are very important to not miss), suggest how that could be best reflected in the averaging.
In the eval-ir task, we asked you to calculate interpolated precision, as opposed to actual precision. If we used average actual precision, explain whether your results would change.

Submission Formatting

The instructions below are repeated for clarity sake. Instructions different from the previous Homework 4 are highlighted in red.

You are only allowed to do this assignment individually. For us to grade this assignment in a timely manner, we need you to adhere strictly to the following submission guidelines. They will help me grade the assignment in an appropriate manner. You will be penalized if you do not follow these instructions. Your matric number in all of the following statements should not have any spaces and any letters should be in CAPITALS. You are to turn in the following files:

A plain text documentation file README-<matric number>.txt (e.g., README-U000000X.txt): this is a text only file that describes any information you want me to know about your submission. You should not include any identifiable information about your assignment (your name, phone number, etc.) except your matric number and email (we need the email to contact you about your grade, please use your u*******@nus.edu.sg address, not your email alias). This is to help you get an objective grade in your assignment, as we won't associate matric numbers with student names.
All source code. We will be reading your code, so please do us a favor and format it nicely.
A plain text file essay-<matric number>.txt that contains your answers to the essay questions.
Note that we will not give you any skeleton code for this assignment. You may want to copy your existing code from an existing homework for help with command line arguments.

These files will need to be suitably zipped in a single file called submission-<matric number>.zip. Please use a zip archive and not tar.gz, bzip, rar or cab files. Make sure when the archive unzips that all of the necessary files are found in a directory called submission-<matric number>. Upload the resulting zip file to the IVLE workbin by the due date: Monday 9 April 11:59:59 pm SGT. There absolutely will be no extensions to the deadline of this assignment. Read the late policy if you're not sure about grade penalties for lateness.

Grading Guidelines

The grading criteria for the assignment is tentatively:

35% Classification Evaluation
35% Evaluation for ranked IR system
10% Documentation
20% Essay questions

Disclaimer: percentage weights may vary without prior notice.

Min-Yen Kan <kanmy@comp.nus.edu.sg> Thu Mar 31 13:37:49 2011 | Version: 1.0 | Last modified: Sat Mar 24 12:39:42 2012