IMS (It Makes Sense) version 0.9.2 ---- July 9th, 2010. ========================= CONTENTS 1. introduction 2. files and directories 3. install 4. requirement 5. quick start a. train model b. test lexical-sample task file c. test fine-grained all-words task file d. test coarse-grained all-words task file e. test plain file 6. more details 7. license 8. versions and changes ======================== 1. INTRODUCTION This software is a supervised learning based Word Sense Disambiguation (WSD) system. This release version is prepared by Zhong Zhi. ======================== 2. FILES & DIRECTORIES a. IMS_v0.9.2.tar.gz - source code package, containing source code, binary files, java documents, license, and some other basic files *.dtd - xml Document Type Definition (DTD) files for SensEval/SemEval tasks amend_se3.bash - convert SensEval-3 fine-grained all-words task test data to SensEval-2 format train_one.bash - bash script to train a model test_one.bash - bash script to test a lexical task file testCoarse.bash - bash script to test a SensEval/SemEval coarse-grained task file testFine.bash - bash script to test a SensEval/SemEval fine-grained task file testPlain.bash - bash script to test a plain file scorer.bash - bash script to evaluate result README.txt - this file Makefile - make file build.xml - build.xml for ant ims.jar - a jar package with binary files javadoc.xml - file used to build javadoc bin - java class directory doc - javadoc directory src - source code directory b. lib.tar.gz - package contains all the necessary jar files and their related model files commons-logging.jar - commons-logging package jdom.jar - jdom package jwnl.jar - JWNL package liblinear-1.33-with-deps.jar - liblinear package libsvm.jar - libsvm package maxent-2.4.0.jar - opennlp maxent package opennlp-tools-1.3.0.jar - opennlp toolkit package prop.xml - prop.xml file for JWNL tag.bin.gz - model for opennlp POS tagger tagdict.txt - tagdict for opennlp POS tagger trove.jar - trove package weka-3.2.3.jar - weka package EnglishSD.bin.gz - model for opennlp sentence boundary detector EnglishTok.bin.gz - model for opennlp tokenizer dict - wordnet-1.7.1 dict directory c. models.tar.gz - model and statistic files for all word types with WordNet-1.7.1 as the sense inventory The classifier is Liblinear. The training data of these models include examples from parallel texts, SemCor and DSO corpus. For more details about training data set, please refer to http://acl.ldc.upenn.edu/W/W07/W07-2054.pdf ======================== 3. INSTALL a. Get the above four packages. b. Uncompress IMS_v0.9.2.tar.gz to one directory (take "ims" for example). c. Uncompress lib.tar.gz into "ims". Now, you have the files described in 2.a and directory "lib" in "ims" directory . d. Uncompress models.tar.gz.tar.gz to some directory. ======================== 4. REQUIREMENT a. This software requires java 6 (JRE 1.6) or higher version. Check java version with command "java -version". b. Since we use JWNL to maintain WordNet, some settings in lib/prop.xml may need to be changed. In lib/prop.xml, you can find a line "" in this file. The value "lib/dict" specifies the path of the WordNet dictionary. If you change or move WordNet dictionary to somewhere else, you should modify this parameter properly. (** we suggest you use the WordNet dictionary version we provided, because all the experiments we did were based on this version. **) ======================== 5. QUICK START (Assume models.tar.gz is uncompressed to directory "ims".) a. To train one model: Type in a shell open to this directory: ./train_one.bash train.xml train.key outputDir train.xml SensEval-2 lexical sample format xml file. (** answer tag in the xml file will be omitted, the senses of instances should be speicified in train.key. **) train.key SensEval-2 lexical sample format key file. outputDir directory to save the model file and statistic file. b. To test one lexical sample file: Type in a shell open to this directory: ./test_one.bash modelDir test.xml outputDir modelDir directory contains the statistic file and model file. test.xml SensEval-2 lexical sample format xml file. outputDir directory to save the result file. c. To test fine-grained all-words task: Type in a shell open to this directory: ./testFine.bash modelDir test.xml outputFile lexeltFile index.sense modelDir directory contains the statistic and model files. test.xml test SensEval-2 fine-grained all-words format xml file. outputFile the output result file. lexeltFile file contains the lexelt informantion of testing instances. Each line represents one instance, e.g.: inst_id lexelt_id(s) index.sense sense definition file of the sense-inventory for current task. (For example, for SemEval 2007 fine-grained all-words task, please use the index.sense from WordNet 2.1). d. To test coarse-grained all-words task: Type in a shell open to this directory: ./testCoarse.bash modelDir test.xml outputFile index.sense modelDir directory contains the statistic and model files. test.xml test coarse-grained all-words xml file. outputFile the output result file. index.sense sense definition file of current task. e. To test a plain text: Type in a shell open to this directory: ./testPlain.bash modelDir test.txt outputFile index.sense split(0/1) tokenized(0/1) pos(0/1) lemmatize(0/1) delimiter(default "/") modelDir directory contains the statistic and model files test.txt test file outputFile the output result file index.sense the index.sense file in WordNet split whether the document has been sentence split (1 for true) tokenized whether the document has been tokenized (1 for true) pos whether the document contains pos tag information (1 for true) lemmatize whether the document contains lemmatization information (1 for true) delimiter delemiters between token, pos, and lemma ======================== 6. MORE DETAILS Let us start from the sg.edu.nus.comp.nlp.ims.implement package. Assume that you are now in the working directory "ims"; and variable CLASSPATH=lib/weka-3.2.3.jar:lib/jwnl.jar:lib/commons-logging.jar:lib/mxpost.jar:lib/trove.jar:lib/opennlp-tools-1.3.0.jar:lib/maxent-2.4.0.jar:lib/jdom.jar:lib/liblinear-1.33-with-deps.jar:ims.jar In this package, you will find two classes: a. sg.edu.nus.comp.nlp.ims.implement.CTrainModel -- class for training Usage: java -cp $CLASSPATH sg.edu.nus.comp.nlp.ims.implement.CTrainModel [options] train.xml train.key saveDir options: -i class name of Instance Extractor(default sg.edu.nus.comp.nlp.ims.instance.CInstanceExtractor) -f class name of Feature Extractor(default sg.edu.nus.comp.nlp.ims.feature.CFeatureExtractorCombination) -c class name of Corpus(default sg.edu.nus.comp.nlp.ims.corpus.CLexicalCorpus) -t class name of Trainer(default sg.edu.nus.comp.nlp.ims.classifiers.CLibLinearTrainer) -m class name of Model Writer(default sg.edu.nus.comp.nlp.ims.io.CModelWriter) -s2 cut off for surrounding word(default 0) -c2 cut off for collocation(default 0) -p2 cut off for pos(default 0) -split 1/0 whether the corpus is sentence splitted(default 0) -ssm path of sentence splitter model -token 1/0 whether the corpus is tokenized(default 0) -pos 1/0 whether the pos tag is provided in corpus(default 0) -ptm path of pos tagger model -dict path of dictionary for opennlp POS tagger(option) -tagdict path of tagdict for opennlp POS tagger(option) -lemma 1/0 whether the lemma is provided in the corpus(default 0) -prop path of prop.xml for JWNL -type type of train.xml directory: train all xml files under directory trainPath list: train all xml files listed in file trainPath file(default): train file trainPath b. sg.edu.nus.comp.nlp.ims.implement.CTester -- class for test. Usage: java -cp $CLASSPATH sg.edu.nus.comp.nlp.ims.implement.CTester [options] testPath modelDir statisticDir saveDir options: -i class name of Instance Extractor(default sg.edu.nus.comp.nlp.ims.instance.CInstanceExtractor) -f class name of Feature Extractor(default sg.edu.nus.comp.nlp.ims.feature.CFeatureExtractorCombination) -c class name of Corpus(default sg.edu.nus.comp.nlp.ims.corpus.CLexicalCorpus) -e class name of Evaluator(default sg.edu.nus.comp.nlp.ims.classifiers.CLibLinearEvaluator) -r class name of Result Writer(default sg.edu.nus.comp.nlp.ims.io.CResultWriter) -lexelt path of lexelt file -is path of index.sense -prop path of prop.xml for JWNL -split 1/0 whether the corpus is sentence splitted(default 0) -ssm path of sentence splitter model -token 1/0 whether the corpus is tokenized(default 0) -pos 1/0 whether the pos tag is provided in corpus(default 0) -ptm path POS tagger model -dict path of dictionary for opennlp POS tagger(option) -tagdict path of tagdict for POS tagger(option) -lemma 1/0 whether the lemma is provided in the corpus(default 0) -delimiter the delimiter to separate tokens, lemmas and POS tags (default "/") -type type of testPath directory: test all xml files under directory testPath list: test all files listed in file testPath file(default): test file testPath You can refer to the bash scripts for the use of these two classes. Furthermore, these two classes are just examples of how to make use of the API/classes of IMS. For more API information of this software, please refer to the javadoc files or the source code. ======================== 7. LICENSE IMS (It Makes Sense) -- NUS WSD system Copyright (c) 2010 National University of Singapore. All Rights Reserved. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. If you use this system, please cite the following paper: Zhong, Zhi and Ng, Hwee Tou. 2010. It Makes Sense: A Wide-Coverage Word Sense Disambiguation System for Free Text. In Proceedings of the ACL 2010 System Demonstrations, pages 78--83, Uppsala, Sweden For more information, bug reports, fixes, please contact: Zhong Zhi Department of Computer Science Computing 1, 13 Computing Drive National University of Singapore Singapore 117417 zhongzhi[at]comp[dot]nus[dot]edu[dot]sg http://nlp.comp.nus.edu.sg/~zhongzhi ======================== 8. VERSIONS & CHANGES Version 0.9 --- 2009-08-01 Version 0.9.1 --- 2009-11-20 Version 0.9.2 --- 2010-07-09 Change default feature value from "!" to "!DEF!" Add "n't, 've, 'd, 'm, 's, 're, 'll, -lrb-, -rrb-, -lsb-, -rsb-, -lcb-, -rcb-" to stop word list Fix some bugs