sg.edu.nus.comp.nlp.ims.corpus
Class CAllWordsPlainCorpus

java.lang.Object
  extended by sg.edu.nus.comp.nlp.ims.corpus.ACorpus
      extended by sg.edu.nus.comp.nlp.ims.corpus.CAllWordsPlainCorpus
All Implemented Interfaces:
ICorpus

public class CAllWordsPlainCorpus
extends ACorpus

interface for a plain text. extract all the content words according to the POS tagging result.

Author:
zhongzhi

Field Summary
 
Fields inherited from class sg.edu.nus.comp.nlp.ims.corpus.ACorpus
g_LIDX, g_PIDX, g_TIDX, m_Boundaries, m_DefaultDelimiter, m_Delimiter, m_DocIDs, m_IDs, m_Indice, m_InstanceLemmas, m_InstancePOSs, m_InstanceTokens, m_Lemmatized, m_Lemmatizer, m_Lengths, m_LexeltIDs, m_POSTagged, m_POSTagger, m_Ready, m_SatID2Index, m_SatIDs, m_SatIndice, m_SatSentenceIDs, m_SentenceIDs, m_Sentences, m_SentenceSplitter, m_Split, m_Tags, m_Tokenized, m_Tokenizer
 
Constructor Summary
CAllWordsPlainCorpus()
          default constructor
CAllWordsPlainCorpus(IPOSTagger p_POSTagger, ISentenceSplitter p_Splitter, ITokenizer p_Tokenizer, ILemmatizer p_Lemmatizer)
          constructor with some components
 
Method Summary
protected  void genInfo()
          collection some information
 boolean load(java.io.Reader p_Reader)
          load data into corpus
protected  java.util.ArrayList<java.util.ArrayList<java.lang.String>> split(java.util.ArrayList<java.util.ArrayList<java.lang.String>> p_Texts)
          split paragraph into sentences
protected  void tokenizeSentence(java.lang.String p_Sentence)
          tokenize a sentence
 
Methods inherited from class sg.edu.nus.comp.nlp.ims.corpus.ACorpus
clear, getIndexInSentence, getLength, getLowerBoundary, getSentence, getSentenceID, getTag, getUpperBoundary, getValue, isReady, isValidInstance, isValidSentence, lemmatize, numOfSentences, posTag, setDelimiter, setLemmatized, setPOSTagged, setSplit, setTokenized, size, tokenize, toString
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

CAllWordsPlainCorpus

public CAllWordsPlainCorpus()
default constructor


CAllWordsPlainCorpus

public CAllWordsPlainCorpus(IPOSTagger p_POSTagger,
                            ISentenceSplitter p_Splitter,
                            ITokenizer p_Tokenizer,
                            ILemmatizer p_Lemmatizer)
constructor with some components

Parameters:
p_POSTagger - POS tagger
p_Splitter - Sentence splitter
p_Tokenizer - tokenzier
p_Lemmatizer - lemmatizer
Method Detail

load

public boolean load(java.io.Reader p_Reader)
             throws java.lang.Exception
Description copied from interface: ICorpus
load data into corpus

Specified by:
load in interface ICorpus
Specified by:
load in class ACorpus
Parameters:
p_Reader - reader of the input stream
Returns:
ready or not
Throws:
java.lang.Exception - exception while loading file

split

protected java.util.ArrayList<java.util.ArrayList<java.lang.String>> split(java.util.ArrayList<java.util.ArrayList<java.lang.String>> p_Texts)
split paragraph into sentences

Parameters:
p_Texts - paragraph
Returns:
sentences

tokenizeSentence

protected void tokenizeSentence(java.lang.String p_Sentence)
Description copied from class: ACorpus
tokenize a sentence

Specified by:
tokenizeSentence in class ACorpus
Parameters:
p_Sentence - input sentence

genInfo

protected void genInfo()
Description copied from class: ACorpus
collection some information

Overrides:
genInfo in class ACorpus