Task #11: English Lexical Sample Task via English-Chinese Parallel
Text
Updated on Feb 25, 2007

Important notes to participating teams in Task #11 of Semeval-2007:
1. If you are participating in the track that uses the LDC data, you need to
first contact the task organizers by sending an email to chanys@comp.nus.edu.sg
showing proof that you have a valid licence to access the corpus LDC2005T10
(Chinese English News Magazine Parallel Text), before you are granted
password to access this data.
2. If you are participating only in the web data track, then there is no
need for you to contact the task organizers.
3. Regardless of whether you are participating in one or both tracks, the
time given to submit your answers after downloading the data is the same. As
such, if you are participating in the track that uses the LDC data, you
should complete step 1 above first to get the password, BEFORE clicking the
download data button of this task. This is because a password to be provided
by the task organizers is needed to access the LDC portion of the training
and test data, and the clock will start ticking once the download data
button is clicked. Get the Trial-data for this task.
Organizers
Hwee Tou Ng and
Yee Seng
Chan
National University of Singapore
Summary
We propose an English lexical sample task for word sense
disambiguation (WSD), where the sense-annotated examples are
(semi)-automatically gathered from word-aligned English-Chinese parallel
texts. After assigning appropriate Chinese translations to each sense of an
English word, the English side of the parallel texts can then serve as the
training data, as they are considered to have been disambiguated and
"sense-tagged" by the appropriate Chinese translations.
For more details, please refer to the full description
for this task and the references given.
Full Description
First, English-Chinese parallel texts are automatically
word-aligned. Then the correct Chinese translations corresponding to the
different WordNet 1.7.1 senses of an English word are manually selected.
Finally, the English half of the parallel texts (the ambiguous English word
and its 3-sentence contexts) are used as the training and test material to
set up an English lexical sample task.
Since more than one English word sense may be translated
by the same Chinese word, two or more English senses s1, s2, ..., sk may be
collapsed into one sense in such cases. This gives rise to a lumped sense
(coarser-grained) evaluation.
We found from our past work that such an approach of
acquiring training examples can yield sense-tagged data of high quality (at
least as good as the quality of sense-tagged data for nouns collected in
Senseval3 English lexical sample task).
This proposed task is thus similar to the multilingual
lexical sample task in Senseval3, except that the training and test examples
are collected without manually annotating each individual ambiguous word
occurrence.
Datasets and Formats
We have two tracks for this task, each track using a
different corpus. The first corpus is the following English-Chinese
parallel corpus available from the Linguistic Data Consortium (LDC):
LDC2005T10 Chinese English News Magazine Parallel Text
It will be used for the evaluation of 50 English words
(25 nouns and 25 adjectives). Participants taking part in this track will
need to have access to the above LDC corpus in order to access the training
and test material in this track. Institutions that are LDC members can
obtain the corpus by paying US$150. Institutions that are non-LDC members
can obtain the corpus by paying US$2,000.
Since not all interested participants may have access to
the above LDC corpus, the second track of this task will make use of
English-Chinese documents gathered from the URL pairs given by the
STRAND Bilingual Databases. STRAND is a system that acquires
document pairs in parallel translation automatically from the Web. We will
be using this corpus for the evaluation of 40 English words (20 nouns and 20
adjectives).
Participants in this task can choose to participate in
one or both tracks.
Evaluation
The scorer will be the standard Senseval scorer.
Download area
This section will contain evaluation software, useful scripts,
complementary materials, baseline systems, etc. but not the datasets
proper. The datasets will be available at the main site for download.
Systems and Results
This section will be completed after the competition.
References
Chan, Yee Seng & Ng, Hwee Tou (2005). Scaling Up Word Sense Disambiguation via Parallel Texts. Proceedings of the 20th
National Conference on Artificial Intelligence (AAAI 2005). (pp. 1037-1042). Pittsburgh, Pennsylvania, USA.
Ng, Hwee Tou, & Wang, Bin, & Chan, Yee Seng (2003). Exploiting Parallel Texts for Word Sense Disambiguation: An Empirical
Study. Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL-03).
(pp. 455-462). Sapporo, Japan.
Resnik, Philip & Smith, Noah A (2003). The Web as a Parallel Corpus. Computational Linguistics, Volume 29, Issue 3 (pp. 349-380). |