CS 3245 - Information Retrieval

Last updated: Thu Jan 20 12:13:08 SGT 2011 - First release

In this homework, you are asked to implement an URL classification system using the ngram knowledge you learned in Week 1. Given an URL, your program should be able to predict whether this URL belongs to a webpage of Arts, News, or Sports. For instance, given the following three URLs:

http://www.cnn.com/US/9806/22/hot.hot.hot/video.html

http://artists.iuma.com/IUMA/Bands/FATE/

http://cbs.sportsline.com/u/football/nfl/teams/pages/NE/locker.htm

... an ideal program should output/predict the following labels for the URLs:

News	`http://www.cnn.com/US/9806/22/hot.hot.hot/video.html`
Arts	`http://artists.iuma.com/IUMA/Bands/FATE/`
Sports	`http://cbs.sportsline.com/u/football/nfl/teams/pages/NE/locker.htm`

Build and test your language models (LMs)

You should run your program to build and test your LMs in this format:

$ python build_test_LM.py -b input-file-for-building-LM -t input-file-for-testing-LM -o output-file

where input-file-for-building-LM is a file given to you that contains a list of URLs with their labels for you to build your ngram language models, input-file-for-testing-LM is a file containing a list of URLs for you to test your language models, and output-file is a file where you store your predictions.

Evaluate your predictions

To evaluate the accuracy of your predictions, you run the evaluation file eval.py which is given to you:

$ python eval.py file-containing-your-results file-containing-correct-results

where file-containing-your-results is the ouput-file from the build and test step, and file-containing-correct-results is a file containing the correct URL labels.

For example, in the homework package, you are given several files, including a skeleton build_test_LM.py, eval.py, urls.build.txt, urls.test.txt, and urls.correct.txt. To build and test your LMs, run:

$ python build_test_LM.py -b urls.build.txt -t urls.test.txt -o urls.predict.txt

which will store your predictions in urls.predict.txt. To evaluate your predictions, run:

$ python eval.py urls.predict.txt urls.correct.txt

which prints the accuracy of your predictions.

What are expected in build_test_LM.py?

The python program build_test_LM.py is given to you as a skeleton script. You are required to complete this script by implementing the build_LM() and test_LM() functions.

You need to first strip out the http:// from each URL, and collect the five-grams from the rest of the URL where the gram units are characters. For example, for the URL http://www.nba.com/, the following five-grams of characters are collected. (Note that you can choose to pad the beginning and end of the URL as shown in slide 13 of lecture 1)

[('w', 'w', 'w', '.', 'n'), ('w', 'w', '.', 'n', 'b'), ('w', '.', 'n', 'b', 'a'), ('.', 'n', 'b', 'a', '.'), ('n', 'b', 'a', '.', 'c'), ('b', 'a', '.', 'c', 'o'), ('a', '.', 'c', 'o', 'm'), ('.', 'c', 'o', 'm', '/')]

For each of the Arts, News and Sports labels, you then build a language model with add one smoothing, similar to the ones shown in slide 20 of lecture 1, which smooths out all observed ngrams. The differences are that you are required to use probabilities instead of counts and five-grams instead of unigrams. Your language models for the three labels should look in a way similar to the following table, where rows 3 to 5 are the language models for Arts, News, and Sports, respectively. Note that each row should sum up to 1, and there are other entries in the table that have been omitted for clarity.

Labels	Five-grams
Labels	...	('.', 'a', 's', 't', 'b')	('w', '.', 'n', 'b', 'a')	('.', 'c', 'o', 'm', '/')	...
Arts		3.11e-07	1.03e-07	2.07e-07
News		5.17e-07	2.07e-07	1.52e-07
Sports		2.89e-06	5.17e-07	9.65e-07

To test a new URL, you should multiply the probabilities of the five-grams for this URL, and return the label (i.e., Arts, News, or Sports) that gives the highest product. Ignore the five-gram if it is not found in the LMs.

Formats of the input/output files

urls.build.txt: the input file for you to build your LMs, which contains 10,000 lines and each line is a label/URL pair separated by a tab. There are totally three labels: Arts, News, and Sports.
urls.test.txt: the input file for you to test your LMs, which contains 20 URLs, each in a line.
urls.correct.txt: contains the correct labels for the 20 URLs in urls.test.txt, in the same format as urls.build.txt (i.e., each label/URL pair is separated by a tab).
Your output file from build_test_LM.py (e.g., urls.predict.txt): contains your predictions in the same format as urls.build.txt.

Essay questions:

You are also asked to answer the following essay questions. These are to test your understanding of the lecture materials. Note that these are open-ended questions and do not have gold standard answers. A paragraph or two are usually sufficient for each question. You may receive a small amount of extra credit if you can support your answers with experimental results.

In the homework assignment, we are using character-based ngrams, i.e., the gram units are characters. Do you expect token-based ngram models to perform better? If you are to implement a token-based ngram system, how do you tokenize an URL such as http://www.cnn.com/US/9806/22/hot.hot.hot/video.html?
What do you think will happen if we provided more data for each category for you to build the language models? What if we only provided more data for Arts?
What do you think will happen if you strip out punctuations and/or numbers? What about converting upper case characters to lower case?
We use five-gram models in this homework assignment. What do you think will happen if we varied the ngram size, such as using unigrams, bigrams and trigrams?

Submission Formatting

For us to grade this assignment in a timely manner, we need you to adhere strictly to the following submission guidelines. They will help me grade the assignment in an appropriate manner. You will be penalized if you do not follow these instructions. Your matric number in all of the following statements should not have any spaces and any letters should be in CAPITALS. You are to turn in the following files:

A plain text documentation file README-<matric number>.txt (e.g., README-U000000X.txt): this is a text only file that describes any information you want me to know about your submission. You should not include any identifiable information about your assignment (your name, phone number, etc.) except your matric number and email (we need the email to contact you about your grade, please use your u*******@nus.edu.sg address, not your email alias). This is to help you get an objective grade in your assignment, as we won't associate matric numbers with student names.
Any source code (in particular, build_test_LM.py for this homework assignment): We will be reading your code, so please do us a favor and format it nicely.
A plain text file essay-<matric number>.txt that contains your answers to the essay questions.

These files will need to be suitably zipped in a single file called submission-<matric number>.zip. Please use a zip archive and not tar.gz, bzip, rar or cab files. Make sure when the archive unzips that all of the necessary files are found in a directory called submission-<matric number>. Upload the resulting zip file to the IVLE workbin by the due date: 2011 Feb 7, 11:59:59 pm SGT. There absolutely will be no extensions to the deadline of this assignment. Read the late policy if you're not sure about grade penalties for lateness.

Grading Guidelines

The grading criteria for the assignment is tentatively:

40% Correctness of your code
15% Documentation
25% Evaluation results: I will test your program against 1000 new URLs.
20% Essay questions

Disclaimer: percentage weights may vary without prior notice.

Miscellaneous Links:

MeURLin: URL-based classification of web pages http://wing.comp.nus.edu.sg/meurlin/
Papers:

Baykan et al. (2008), Web Page Language Identification Based on URLs
Kan and Nguyen Thi (2005), Fast webpage classification using URL features

Ziheng Lin <linzihen@comp.nus.edu.sg> Thu Jan 20 12:18:08 2011 | Version: 1.0 | Last modified: Thu Mar 31 23:09:58 2011

CS 3245 » Homework 1 - URL classification

Menu