[ Back to the RPNLPIR home page ] [ Back to WING ]
This is a corpus of SMS (Short Message Service) messages collected for research at the Department of Computer Science at the National University of Singapore. Currently (April 2004), the corpus consists of about 10,000 SMS messages collected by students. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available.
The corpus is stored as an XML document. Currently only the message and source fields are used, but future corpus collection may include other fields in the DTD below.
<!DOCTYPE smsCorpus [ <!ELEMENT smsCorpus (message+)> <!ATTLIST smsCorpus date CDATA #REQUIRED version CDATA #REQUIRED> <!ELEMENT message (text, source, destination?, messageProfile?, collectionMethod)> <!ATTLIST message id CDATA #REQUIRED> <!ELEMENT text (#PCDATA)> <!ELEMENT source ((srcID|srcNumber), phoneModel?, userProfile?)> <!ATTLIST source country CDATA #REQUIRED> <!ELEMENT srcID (#PCDATA)> <!ELEMENT srcNumber (#PCDATA)> <!ELEMENT destination ((destID|destNumber), phoneModel?, userProfile?)> <!ATTLIST destination country CDATA #REQUIRED> <!ELEMENT destID (#PCDATA)> <!ELEMENT destNumber (#PCDATA)> <!ELEMENT phoneModel EMPTY> <!ATTLIST phoneModel manufacturer CDATA #IMPLIED modelNumber CDATA #IMPLIED> <!ELEMENT userProfile (userID|userName)> <!ATTLIST userProfile experienceLevel CDATA "unknown" inputMethod CDATA "unknown"> <!ELEMENT userID (#PCDATA)> <!ELEMENT userName (#PCDATA)> <!-- Characteristics about the message itself --> <!-- Default to (EN) English as language of transmission --> <!-- Is message forwarded from someone? --> <!-- Is message a service message / mass message? --> <!-- Is message a part of a linked message (e.g., 1 of 2)? --> <!-- Is message a reply? --> <!ELEMENT messageProfile EMPTY> <!ATTLIST messageProfile time CDATA "unknown" date CDATA "unknown" language CDATA "EN" forwarded (unknown | false | true) "unknown" massMessage (unknown | false | true) "unknown" linkMessage (unknown | false | true) "unknown" replyMessage (unknown | false | true) "unknown"> <!ELEMENT collectionMethod EMPTY> <!ATTLIST collectionMethod collector CDATA #REQUIRED method (web | friends | family | other | unknown) "unknown" year CDATA #REQUIRED date CDATA #IMPLIED> ]>
Download the corpus here (will create a separate smsCorpus directory in the current working directory).
Please do us a favor and send a quick message to rpnlpir@comp.nus.edu.sg, if download this corpus and plan on using it. It will only take a minute of your time and will help us get a better idea of what such a corpus might be used for.
The corpus is one of the deliverables of a final year project done by Yijue How, with continuing work done by Ming Fung Lee. If you would like to cite this corpus, please cite our paper in HCI International, below. Yijue's final year thesis is also available here:
This corpus is distributed under a license derived from the Open Directory Project. The license is distributed with the corpus. The collection of this corpus was generously funded by the Academic Research Fund of the National University of Singapore, R 252-000-155-101/112.