School of Computing

Information Retrieval

NUS SoC, 2013/2014, Semester II Video Conferencing Room (COM1 02 VCRm) / Fridays 11:00-13:00

Last updated: Wednesday, April 23, 2014 06:25:14 PM SGT Corrected grading for code part.

Homework #4 » Patent Retrieval Mini Project


In our final Homework 4, we will hold an information retrieval contest with real-world documents and queries: the problem of patent retrieval. As described in lecture, patent retrieval is a case where recall is particularly important, as it is important to not miss any relevant documents (a requirement common to search engines working in the area of law).

Jump to the competition framework, the current leaderboard, or 2013 leaderboard.

Commonalities with Homeworks #2 and #3

The indexing and query commands will use an (almost) identical input format to Homeworks #2 and #3, so that you need not modify any of your code to deal with command line processing. To recap:

Indexing: $ python -i directory-of-documents -d dictionary-file -p postings-file

Searching: $ python -d dictionary-file -p postings-file -q query-file -o output-file-of-results

The difference from Homeworks #2 and #3 is that the query-file specifies a single query, and not a list of queries.

However, significantly different from the previous homework, we will be using a patent corpus, provided by PatSnap, a company with origins partially from NUS (Disclaimer: I have no interest or affiliation with PatSnap, although one alumnus from my group is working there.)

Problem Statement: Given 1) a patent corpus (to be posted to IVLE) as the candidate document collection to retrieve from, and 2) a set of free text information needs, return the list of the IDs of the relevant documents for each need, in sorted order or relevance. Your search engine should return the entire set of relevant documents (don't threshold to the top K relevant documents; as described, recall is important in patent search).
Your system should return the results for the query query-file on a single line. Separate the IDs of different documents by a single space ' '. Return an empty line if no patents are relevant.

For this assignment, no holds are barred. You may use any type of preprocessing, post-processing, indexing and querying process you wish. You may wish to incorporate or use other python libraries or external resources; however, for python libraries, you'll have to include them with your submission properly -- I will not install new libraries to grade your submissions.

PatSnap, the company we are working with for this contest, is particularly interested in good IR systems for this problem and thus is cooperating with us for this homework assignments. They have provided the corpus (the patents are in the public domain, as is most government information) and relevance judgments for a small number of queries. Teams that do well may be approached by PatSnap to see whether you'd like to work further on your project to help them for pay. Note: Your README will be read by both Min and the PatSnap team, but your code will not be given to their team to use; if they are interested in what you have done, you may opt to license your work to them.

More detail on the inputs: Information Needs and Patents

The patents and the information needs have a particular structure in this task. Let's start with the information needs.

Information Need: We call the inputs information needs, as they describe the relevant documents at a semantic level, and not (necessarily) at the shallow, language level that the queries given to the search engine will have to execute. The needs will be given in a format similar to TREC queries. They will have a title field, which is a short noun phrase or sentence describing the information need. A description field will give more detail on what the relevant documents may or may not contain (will always start with "relevant documents will describe". Here is an example information need (also provided in the workbin):

<?xml version="1.0" ?>
  Washers that clean laundry with bubbles
  Relevant documents will describe washing technologies that clean or
  induce using bubbles, foam, by means of vacuuming, swirling, inducing
  flow or other mechanisms.

In PatSnap's own system, searchers need to transform these needs into actual search queries. For the above need, a patent engineer transformed it into the following Boolean query:

((bubble AND fine) OR microbubble)
This requires some human knowledge from the patent engineer to do, as "fine" and "microbubble" don't appear any where in the description or title of the query. This is shown for illustrative purposes, please don't interpret this as an actual step you'll need to do for your assignment. Note that this transformation 1) was done to deal with the Boolean nature of their search engine, and 2) may not reflect the best method to transform the need into a query.


Patents are structured documents. For the purposes of our assignment, we are going to use an XML representation of a patent. Below is a document, ID EP2067524A1, which is relevant to the above query:

<?xml version="1.0" ?>
  <str name="Patent Number">EP2067524A1</str>
  <str name="Application Number">EP2007828700</str>
  <str name="Kind Code">A1</str>
  <str name="Abstract">
    There are provided a fluid injection nozzle, a fluid mixer, a microbubble generating apparatus, a vapor phase generating apparatus, a method of producing swirling flow, and a swirling flow producing apparatus that can be applied to any kind of fluid and can efficiently generate a swirling flow at high speed.
    The swirling flow producing apparatus includes a housing and a cylindrical member. The housing includes a cylindrical portion of which at least one end is opened, and a fluid introducing passage that is opened on an inner peripheral surface of the cylindrical portion. The cylindrical member is provided in the cylindrical portion of the housing. The cylindrical member includes a cylindrical portion of which at least one end in a direction corresponding to an opening direction of the cylindrical portion is opened, and holes formed in a peripheral wall of the cylindrical portion. A fluid introduced from the fluid introducing passage flows into the cylindrical portion of the cylindrical member through the holes so as to generate a swirling flow, and flows out from the housing and the cylindrical member.
  <str name="Document Types">EP | EPA | DOCDB</str>
  <str name="Application Date">2007-09-28</str>
  <str name="Application Year">2007</str>
  <str name="Application(Year/Month)">2007-09</str>
  <str name="Publication Date">2009-06-10</str>
  <str name="Publication Year">2009</str>
  <str name="Publication(Year/Month)">2009-06</str>
  <str name="All IPC">B05B1/34 | B01F5/00</str>
  <str name="IPC Primary">B05B1/34</str>
  <str name="IPC Section">B</str>
  <str name="IPC Class">B05</str>
  <str name="IPC Subclass">B05B</str>
  <str name="IPC Group">B05B1</str>
  <str name="Family Members">KR1020090028835A | WO2008038763A1 | CN101505859A | US20090201761A1 | EP2067524A1</str>
  <str name="Family Member Count">5</str>
  <str name="Family Members Cited By Count">1</str>
  <str name="Other References">See references of WO  2008038763A1</str>
  <str name="Other References Count">1</str>
  <str name="Cited By Count">0</str>
  <str name="Priority Country">JP</str>
  <str name="Priority Number">2006264652</str>
  <str name="Priority Date">2006-09-28</str>
  <str name="Assignee(s)">NAKATA COATING CO., LTD.</str>
  <str name="1st Assignee">NAKATA COATING CO., LTD.</str>
  <str name="Number of Assignees">1</str>
  <str name="1st Assignee Address">82, Higashikawashima-cho, Hodogaya-ku, Yokohama-shi, Kanagawa 240-0041, JP</str>
  <str name="Assignee(s) Address">82, Higashikawashima-cho, Hodogaya-ku, Yokohama-shi, Kanagawa 240-0041, JP</str>
  <str name="Inventor(s)">MATSUNO, TAKEMI | NAKATA, AKIO</str>
  <str name="1st Inventor">MATSUNO, TAKEMI</str>
  <str name="Number of Inventors">2</str>
  <str name="1st Inventor Address">NAKATA, COATING, CO., LTD., 82, Higashikawashima-cho, Hodogaya-ku, Yokohama-shi, Kanagawa, 240-0041, JP</str>
  <str name="Inventor(s) Address">NAKATA, COATING, CO., LTD., 82, Higashikawashima-cho, Hodogaya-ku, Yokohama-shi, Kanagawa, 240-0041, JP | NAKATA, COATING, CO., LTD., 82, Higashikawashima-cho, Hodogaya-ku, Yokohama-shi, Kanagawa, 240-0041, JP</str>
  <str name="Agent/Attorney">HOFFMANN, ECKART</str>
  <str name="cited by within 3 years">0</str>
  <str name="cited by within 5 years">0</str>

You will notice that there are a lot of fields in the patent. However, not all fields things are relevant to assessing a patent's relevance to the query (and thus you may not want to index them), but are included for the sake of completeness.

in particular, the IPC (International Patent Classification) is a useful piece of data that you may want to use to assess the relevance of a document. It is a hierarchical classification of the patent into a ontology. However you may need to parse this information in some way to make it useful to your system.

Zones and Fields

As introduced in Week 8, Zones are free text areas usually within a document that holds some special significance. Fields are more akin to database columns (in a database, we would actually make them columns), in that they take on a specific value from some (possibly infinite) enumerated set of values.

Along with the standard notion of a document as a ordered set of words, handling either / both zones and fields is important for certain aspects of patent retrieval.

Query Expansion

You might notice that many of the terms used in the patents themselves do not overlap with the query times used. This is known as the anomalous state of knowledge (ASK) problem or vocabulary mismatch, in which the searcher may use terminology that doesn't fit the documents' expression of the same semantics. A simple way that you can deal with the problem is to utilize query expansion.

In this technique, we use a first round of retrieval on the query terms used by a searcher to find some sample documents. Assuming that these documents are relevant, we can extract sometimes found these documents or use the entire documents themselves as queries, used in a second round of retrieval. The idea is that the sample documents have terminology that match the document corpus, overcoming the problem of vocabulary mismatch.

What to turn in?

You are required to submit,, dictionary.txt, and postings.txt. Please do not include the patent corpus.

Submission Formatting

You are allowed to do this assignment individually or as a team of up to 4 students. There will be no difference in grading criteria if you do the assignment as a large team or individually. For the submission information below, simply replace any mention of a matric number with the matric numbers concatenated with a separating dash (e.g., A000000X-A000001Y-A000002Z). Please ensure you use the same identifier (matric numbers in the same order) in all places that require a matric number

For us to grade this assignment in a timely manner, we need you to adhere strictly to the following submission guidelines. They will help me grade the assignment in an appropriate manner. You will be penalized if you do not follow these instructions. Your matric number in all of the following statements should not have any spaces and any letters should be in CAPITALS. You are to turn in the following files:

These files will need to be suitably zipped in a single file called <matric number>.zip. Please use a zip archive and not tar.gz, bzip, rar or cab files. Make sure when the archive unzips that all of the necessary files are found in a directory called <matric number>. Upload the resulting zip file to the IVLE workbin by the due date: Updated 11 April 2014, 11:59:59 pm SGT. There will absolutely be no extensions to the deadline of this assignment. Read the late policy if you're not sure about grade penalties for lateness.

Grading Guidelines

The grading criteria for the assignment is below. You should note that there are no essay questions for this assignment.