Homework 1 - Approximate LINC Evaluation

Quick Links: [ Home ] [ IVLE ] [ Project Info ] [ Schedule ] [ Details ]

As per our lecture materials on evaluation of libraries, our first homework targets how coverage of NUS libraries matches the queries given by users. To do this, we will take some of a medium-sized sample of queries and attempt to categorize them. We will match the resulting categories with the distribution of materials present in the Libraries. Please read this assignment carefully. Some time planning is necessary to complete the entire assignment (e.g., the first step may take a couple of days because of Google's 1000 query limit), but partial marks will be given for any completed milestones.

Your submission should be a single .zip file containing a single file as output for each step. Label these files as Step1.txt, Step2.txt, Step3.txt, Essay.txt (or with another appropriate file extension). The files should not marked with any personal identification (i.e., no name, email address, affliation). Your zip file should be named with your Matric number (e.g. AB123456X).

In doing this assignment, some steps in reaching your conclusion could be done in different ways. If you need to make a (logical) assumption or transform the data to make the data manageable, do it but report your assumption/transformation and provide some justification for your action.

Since this homework is cascaded (i.e., the later parts rely on the earlier ones), it's crucial that you finish the early steps on time. If you cannot manage a particular step in the beginning, you may use a fellow classmates' result for that step, as long as you acknowledge that you have done so. Note that you will not receive any credit for the part you've copied.

Details

Our School has access to over 150,000 past queries asked by library patrons to the LINC system. We will utilize these records in answering this question. In the IVLE workbin, you will find 1000-query slice of this data that has been preprocessed for your assignment. Please see the note at the end of this assignment for an important legal notice regarding this LINC data.

Step 1. Send old queries to Google (25% of grade, should take at most a couple of hours). For some pages that have been categorized by the Open Directory Project (ODP) that are relevant to a query, Google will report the ODP category of the website. We can utilize this to get an automatic classification of the query, albeit, quite noisily.

Google permits mass querying of its web database for research (and in this case, homework). To keep track of this usage and separate it from normal usage of its search engine, Google provides a standard API that allows research users to get results from the search engine without having to perform web scraping (i.e., extraction of results from saved .html pages from Google). All students are required to sign up for a Google API key (if you are reluctant to disclose any information to get a key, please see me for alternative methods). You will have to program a suitable interface that will take each query in the LINC queries slice and send it to Google to get back the first page of results (up to 10 hits).

Hint: a lot of software is out there to use and process the Google API, including software in the rpnlpir account hosted on sf3. You're welcome to use these resources, but you must cite who/what you have used to assist you and make clear what is your contribution.

Output: a file consisting of queries, ranks and categories. You can use an initial pound sign ("#") to comment your file as necessary. Each content line should contain the original query, the category returned by Google, and its rank. You can omit documents that aren't categorized in ODP from this file. These fields should be separated by a single tab. In this format, each query can produce up to ten lines of results.

Step 2. Assign weights to pruned ODP categories (25%). The Open Directory Project is a large, public-domain, publically-accessible hierarchical organization of websites. It is similar to Yahoo!. A full copy of the ODP dated 19 Sep 2002 can be found under the rpnlpir filesystem. I've uploaded the first three levels of the ODP directory structure to IVLE with some useless categories omitted.

You will take the output from Step 1 and prune the ODP categories that have more than three levels of classification to three levels. The remaining categories should be in the three level format (e.g., Top/X/Y).

Then you will assign a fraction of weight to each ODP category to represent how the category covers each query. You will assign a simple fractional weight of 1/N, where N is the maximum results returned on the first page (1-10) to each category present. Thus, summing the total distribution of weights for each category should equal the number of queries that have been processed by Google that returned at least one category hit.

Output: The input file with a fractional weight assigned to the category. The format should be the same as in Step #1, with a single tab separating the pruned category and its weight.

Step 3. Goodness of fit to the LINC category (20%). In the workbin, you'll also find an approximation of the number of books present in LINC that correspond to a category in the ODP.

Our LINC system uses the Library of Congress Subject Headings (LCSH) categorization scheme as its basis, which is well-suited for academic library materials. However, web data is not very similar to the library's storehouse of knowledge. Categorization schemes such as the ODP and Yahoo! are suitable replacements. As you know, the LCSH is very detailed, and pruning the classification scheme must be done to arrive at a smaller, more managable set of subjects. Details about how this was done is found within the workbin file. You'll want to note any flaws in this method for the Essay part below.

Calculate chi-squared similarity between the two distributions. If you don't know about the chi-square measure is, make sure you learn about it first.

Output: In a text or spreadsheet file, conclude whether your chi-squared score shows that the similarity between the two distributions is significant or not. If it is, cite the significance level. If it isn't, briefly (no more than 500 words) cite what categories contribute the largest amount of dissimilarity and discuss why.

Essay Questions (30%)

You'll want to write no more than 500 words each concerning each of these topics. Essays obviously more than this length will be cut. Output: a text file containing your answer. No formatting is necessary.

In this homework we approximate the distribution of library materials using a second classification provided by the ODP, to avoid hitting LINC with multiple, automated requests. Briefly discuss any factors would be improved or worsened if we performed the classification of queries directly by using LINC. You may want to try some of these searches on LINC to get some ideas.
If the chi-squared calculation comes out as significant, does this allow you to conclude that the library satisfies student queries well,x or would this conclusion be a fallacy? What can you safely conclude from the chi-squared test? What might be a better way to decide whether LINC satisfies queries well?

Other issues

Legal issues concerning this data. This data is being provided to us by NUS Libraries for research purposes, so you will have to destroy the data files that you will use for this assignment after you complete the homework assignment.

Late Policy

There is no late policy for this homework. Submissions should be submitted by the due date, by 11:59:59 pm. Any lateness will be worth 0 marks.

References

IVLE
INFOMINE
LINC
Cute tutorial on Chi-Square
ODP
Google API
RPNLPIR home page. The data resides on file://sunfire.comp.nus.edu.sg/~rpnlpir/corpora/metadata/odp/