Data

WSJ Preposition Senses

This data set contains preposition word senses for prepositional phrases in the Wall Street Journal (WSJ) section of the Penn Treebank. The data was used in the experiments in (Dahlmeier et al. 2009).

Data available for download.

NUS Corpus of Learner English (NUCLE)

The NUS Corpus of Learner English (NUCLE) was collected in a collaboration project between the National University of Singapore (NUS) Natural Language Processing (NLP) Group led by Prof. Hwee Tou Ng and the NUS Centre for English Language Communication (CELC) led by Prof. Siew Mei Wu. The work was carried out as part of the PhD thesis research of Daniel Dahlmeier at the NUS NLP Group.

The corpus consists of about 1,400 essays written by university students at the National University of Singapore on a wide range of topics, such as environmental pollution, healthcare, etc. It contains over one million words which are completely annotated with error tags and corrections. All annotations have been performed by professional English instructors at the NUS CELC.

The corpus is distributed under the standard NUS licensing agreement and can be downloaded from the NUCLE (Release 3.3) Data License Agreement Form.

NUS Social Media Text Normalization and Translation Corpus

The corpus is created for social media text normalization and translation. It is built by randomly selecting 2,000 messages from the NUS English SMS corpus. The messages were first normalized into formal English and then translated into formal Chinese.

Corpus available for download.

Ten Sets of Multiply Annotated Essays for Grammatical Error Correction

Ten native speakers were each asked to correct 50 essays (~600 words per essay) written by non-native English speakers for grammatial correctness. Each edit was also classified according to the error classification scheme used in the CoNLL-2014 shared task. This corpus was used in the ACL 2015 paper of Christopher Bryant and Hwee Tou Ng, titled "How Far are We from Fully Automatic High Quality Grammatical Error Correction?"

Corpus available for download.

One Million Sense-Tagged Instances for Word Sense Disambiguation and Induction

Corpora and models available for download (based on different versions of WordNet sense inventory):

WordNet 3.0 [data] [models]
WordNet 2.1 [data] [models]
WordNet 1.7.1 [data] [models]

If you want to replicate (Zhong and Ng, 2010), download these models instead.

NIST Chinese-to-English Translation Results

Published results of NIST Chinese-to-English Translation in major publication venues can be viewed here.

NUS Natural Language Processing Group