Homework 2 - Authorship Detection

Quick Links: [ Home ] [ IVLE ] [ Project Info ] [ Schedule ] [ Details ] [ HW 1 ] [ HW 2 ]

As per our lecture materials on authorship detection, you will be creating machine learning features to be used with a standard machine learner to try to guess the identity of the author. We will be using papers from reviews of books from Amazon.com to try to compute attribution.

Details

You are to come up with as many features as you can think of for computing the authorship of the files. "Classic" text categorization usually work to compute the subject category of the work. Here, because many of the book reviewers examine similar subjects and write reviews for a wide range of books, standard techniques will not fare as well. You should use features that you come up with, as well any additional features you can think of. You can code any feature you like, but you will be limited to 30 real-valued features in total. You do not have to use all 30 features if you do not wish to.

In the workbin, you will find a training file containing reviews of books and other materials from 2 of Amazon's top customer reviewers (for their .com website, Amazon keeps different reviews in different counties -- e.g., UK). It is organized as 1 review per line, and thus are very long lines. Each line gives the review followed by a tab (\t) and then the author ID (either +1 or -1). There are 100 examples per author in the training section.

In the test.txt file you will find a list of reviews, again, one per line, but without the author ID given.

We are going to use the SVM light package authored by Thorsten Joachims as the machine learning framework. You should familiarize yourself with how to apply SVM to your dataset. SVM has a very simple format for vectors used to train and test its hypothesis. I should have demonstrated this during class. Be aware that training an SVM can be quite time-intensive, even on a small training set.

What to turn in

You will upload an X.zip (where X is your matric ID) file by the due date, consisting of the following four files:

A summary file in plain text, giving your matric number (as the only form of ID), containing the percentage precision and recall for each class, under leave-one-out validation. You should inventory the features used in your assignment submission and briefly explain (1 sentence to 1 paragraph) explain the feature, if non-obvious. (filename: X.sum, where X is your matric ID). You can also include any notes about the development of your submission if you'd like.
Your training file (X.train), consisting of the features for each review, followed by the author ID. This should be in a format that can be passed directly to the machine learner for training. Remember, you may only use up to a maximum of 30 features.
Your testing file (X.test), consisting of the same features as above, but without the class labels (as you don't know them).
A model file (X.model) that was induced by running svm_train with defaults (i.e., no arguments) to the machine learner.

Updated on Wed Oct 15 15:25:12 GMT-8 2003. Please use a ZIP (not RAR or TAR) utility to construct your submission. Do not include a directory in the submission to extract to (e.g., unzipping X.zip should give files X.sum, not X/X.sum or submission/X.sum). Please use all capital letters when writing your matric number (matric numbers should start with HT or HD for all students in this class). Your cooperation with the submission format will allow me to grade the assignment in a timely manner.

Grading scheme

Your grade will be 75% determined by performance as judged by accuracy, and 25% determined by the summary file you turn in and the smoothness of the evaluation of your submission. For example, if your files cause problems with the machine learner (incorrect format, etc.) this will result in penalties within in that 25%.

Of the remaining 75%, 45% of the grade will be determined by your training set performance, and 30% determined by the testing performance. For both halves, the grade given will be largely determined by how your system performs against your peer systems. I will also have a reference, baseline system to compare your systems against, this will constitute the remaining portion of the grade.

Please note that it is quite trivial to get a working assignment that will get you most of the grade. Even the single feature classifier using a feature of review-length-in-words, would be awarded at least 40 marks (the 25% plus 15% worse than baseline performance). I recommend that you make an effort to complete a baseline working system within the first week of the assignment.

Due date and late policy

There is no late policy for this homework. Submissions should be submitted by the due date, by 11:59:59 pm. Any lateness will result in 0 marks being award for the whole assignment.

References

SVM light web site.