Index of /~qiul/NUSScenarioCorpus

[ICO]NameLast modifiedSizeDescription

[DIR]Parent Directory   -
[   ]JavaAPI.tar 16-Sep-2008 12:55 210K
[DIR]JavaAPI/ 16-Sep-2008 22:09 -
[   ]NUSCorpus.snapshot 16-Sep-2008 10:47 6.3K
[   ]PredicateArgumentTup..>16-Sep-2008 22:09 5.1M
[DIR]PredicateArgumentTup..>16-Sep-2008 21:42 -
[DIR]data/ 16-Sep-2008 13:58 -
[   ]lib.tar 16-Sep-2008 22:09 1.1M
[DIR]lib/ 06-Feb-2007 17:23 -
[DIR]sample/ 08-Feb-2007 14:17 -

NUS Scenario Corpus by Wing@NUS

NUS Scenario Corpus


This is a brief description of the NUS Scenario Corpus, which is free for scientific research only. To use this corpus for commercial purposes, please first contact me for licensing details.

      In the Topic Detection and Tracking (TDT) research, an event is defined as something that occurs at specific place and time associated with some specific actions (TDT, 1999), and a set of similar events can be regarded as instances of the same scenario. For example, the Air France Flight 4590 (a Concorde jet) accident in 2000 is an event as well as an instance of the scenario aviation disaster.

     The corpus contains news articles for more than 15 scenarios. The goal was to collect 10 events, each represented by at least 5 article instances, for each scenario. The articles were taken from a controlled list of websites that 1) are true, online versions of articles provided by reputable news agencies, and 2) must provide free news article archives, extending back no less than 5 years. A snapshot of the current corpus can be found in NUSCorpus.snapshot.

     Unfortunately, not all scenarios have the proper amount of articles. Data from other two sources was collected to supplement the collection: articles from websites (indicated by a *) that were not in the controlled list, and the commercial service, LexisNexis. These news articles are not included in this downloadable corpus due to copyright issues. For interested parties that have subscribed to the LexisNexis' service, you may want to contact me (with some proof of current subscription) to receive these articles as well.


Download: To download everything in one shot, click here(~88M).

There are five directories in this distribution:

  1. data:
  2. PredicateArgumentTupleAnnotation:
  3. JavaAPI: Java code that helps to extract different parts of an article. (Note that it doesn't work well with JDK1.5 and has a problem when processing a few html files even with JDK1.4.)
  4. lib: The Java classes needed to preprocess NUSScenarioCorpusFiles.
  5. sample: A single, sample article in three formats: with manual labels, plain text, and xml.

     Note: The scenario name and event name of each news article are dependent on the path of the file. You are advised to keep the directory tree structure unchanged.