Release 2.3.1
24 May 2013

This README file describes the test data and scoring procedure for the
CoNLL-2013 Shared Task: Grammatical Error Correction.

The package is distributed freely with the following copyright
Copyright (C) 2013 Hwee Tou Ng, Joel Tetreault, Siew Mei Wu,
                   Yuanbin Wu, Christian Hadiwinoto

Any questions regarding the test data should be directed to
Hwee Tou Ng at: nght@comp.nus.edu.sg


1. Directory Structure and Contents
===================================

The top-level directory has four subdirectories, namely

- original/ : the test data with the original official annotations
              before adding alternatives contributed by the participants
- revised/  : the moderated participants' alternative annotations and
              the revised official annotation for the test data
- scripts/  : the scripts used to preprocess the test data inside the
              original/ and revised/ subdirectories
- m2scorer/ : the official scoring software for the shared task

Each of the original/ and revised/ subdirectories contains data/
subdirectory, which includes annotations for all the error types, and
data_5types/ subdirectory, which includes annotations for only the 5
types concerned in the shared task.


2. Data Format 
==============

The corpus is distributed in a simple SGML format.  All annotations
come in a "stand-off" format. The start position and end position of
an annotation are given by paragraph and character offsets.
Paragraphs are enclosed in <P>...</P> tags. Paragraphs and characters
are counted starting from zero. Each annotation includes the following
fields: the error category, the correction, and optionally a
comment. If the correction replaces the original text at the given
location, it should fix the grammatical error.

Example:

<DOC nid="2">
<TEXT>
<P>
In modern digital world, ...
</P>
<P>
Surveillance technology such as ...
</P>
...
</TEXT>
<ANNOTATION teacher_id="6">
<MISTAKE start_par="0" start_off="3" end_par="0" end_off="9">
<TYPE>ArtOrDet</TYPE>
<CORRECTION>the modern</CORRECTION>
</MISTAKE>
...
</ANNOTATION>
</DOC>

<DOC nid="3">
...

Below is a complete list of the error categories in the data/
subdirectory under the original/ and revised/ subdirectories:

ERROR TAG    ERROR CATEGORY
---------------------------
Vt	     Verb tense
Vm	     Verb modal
V0	     Missing verb
Vform	     Verb form
SVA	     Subject-verb-agreement
ArtOrDet     Article or Determiner
Nn	     Noun number
Npos	     Noun possesive
Pform	     Pronoun form
Pref	     Pronoun reference
Prep         Preposition
Wci	     Wrong collocation/idiom
Wa	     Acronyms
Wform	     Word form
Wtone	     Tone
Srun	     Runons, comma splice
Smod	     Dangling modifier
Spar	     Parallelism
Sfrag	     Fragment
Ssub	     Subordinate clause
WOinc	     Incorrect sentence form
WOadv	     Adverb/adjective position
Trans	     Link word/phrases
Mec	     Punctuation, capitalization, spelling, typos
Rloc-	     Local redundancy
Cit	     Citation
Others	     Other errors
Um	     Unclear meaning (cannot be corrected)

Below is a list of the error categories in the data_5types/
subdirectory under the original/ and revised/ subdirectories:

ERROR TAG    ERROR CATEGORY
---------------------------
Vform	     Verb form
SVA	     Subject-verb-agreement
ArtOrDet     Article or Determiner
Nn	     Noun number
Prep         Preposition

The official annotation file contains all the default annotations to
make the whole text correct. Meanwhile, each of the alternative
annotation files contains only annotations for sentences that can be
corrected in a different way, i.e. sentences that have alternative
annotations. If according to an alternative, a sentence can remain
unchanged, a special tag "noop" is used for that particular sentence.


3. Updates included in version 2.1
==================================

The major change made in version 2.1 is to map the past error
categories Wcip and Rloc to Prep, Wci, ArtOrDet, and Rloc-.

In the original data, there is no explicit preposition error
category. Instead, preposition errors are part of the Wcip (Wrong
collocation/idiom/preposition) and Rloc (local redundancy) error
categories. In addition, redundant article or determiner errors are
part of the Rloc error category.


4. Updates included in version 2.2
==================================

- Fixed the bug on expanding an error annotation involving part of a
  token to the full token.

- Other miscellaneous corrections were made.


5. Updates included in version 2.3
==================================

- Fixed the bug involving tokenization of punctuation symbols in the
  correction string.

- Fixed the tokenization example in the README file of the M^2 scorer
  to reflect the real tokenization to be used and removed irrelevant
  codes from the scorer package.


6. Updates included in version 2.3.1
====================================

- Enhanced the capability of the M^2 scorer to be able to handle
  multiple alternative sets of gold edits.

- Fixed the preprocess.py script to keep the annotation span minimal,
  i.e. by excluding the beginning and/or end tokens that co-exist in
  the original and correction string.


