README for the NUS SMS Corpus

[ Back to the RPNLPIR home page ] [ Back to WING ]

This is a corpus of SMS (Short Message Service) messages collected for research at the Department of Computer Science at the National University of Singapore. Currently (April 2004), the corpus consists of about 10,000 SMS messages collected by students. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available.

The corpus is stored as an XML document. Currently only the message and source fields are used, but future corpus collection may include other fields in the DTD below.

<!DOCTYPE smsCorpus [
  <!ELEMENT smsCorpus	(message+)>
  <!ATTLIST smsCorpus	date	CDATA	#REQUIRED
			version	CDATA	#REQUIRED>
  <!ELEMENT message	(text, source, destination?, messageProfile?, collectionMethod)>
  <!ATTLIST message	id 	CDATA	#REQUIRED>

  <!ELEMENT text	(#PCDATA)>

  <!ELEMENT source	((srcID|srcNumber), phoneModel?, userProfile?)>
  <!ATTLIST source	country	CDATA	#REQUIRED>
  <!ELEMENT srcID	(#PCDATA)>
  <!ELEMENT srcNumber	(#PCDATA)>

  <!ELEMENT destination	((destID|destNumber), phoneModel?, userProfile?)>
  <!ATTLIST destination	country	CDATA	#REQUIRED>
  <!ELEMENT destID	(#PCDATA)>
  <!ELEMENT destNumber	(#PCDATA)>

  <!ELEMENT phoneModel 	EMPTY>
  <!ATTLIST phoneModel	manufacturer	CDATA	#IMPLIED
			modelNumber	CDATA	#IMPLIED>

  <!ELEMENT userProfile (userID|userName)>
  <!ATTLIST userProfile	experienceLevel	CDATA	"unknown"
  			inputMethod	CDATA	"unknown">
  <!ELEMENT userID	(#PCDATA)>
  <!ELEMENT userName	(#PCDATA)>

  <!-- Characteristics about the message itself -->
  <!-- Default to (EN) English as language of transmission -->
  <!-- Is message forwarded from someone? -->
  <!-- Is message a service message / mass message? -->
  <!-- Is message a part of a linked message (e.g., 1 of 2)? -->
  <!-- Is message a reply? -->
  <!ELEMENT messageProfile	EMPTY>
  <!ATTLIST messageProfile	time	CDATA	"unknown"
				date	CDATA	"unknown"
				language	CDATA	"EN"
				forwarded	(unknown | false | true)	"unknown"
				massMessage	(unknown | false | true)	"unknown"
				linkMessage	(unknown | false | true)	"unknown"
				replyMessage	(unknown | false | true)	"unknown">

  <!ELEMENT collectionMethod	EMPTY>
  <!ATTLIST collectionMethod	collector	CDATA	#REQUIRED
				method	(web | friends | family | other | unknown)	"unknown"
				year	CDATA	#REQUIRED
				date	CDATA	#IMPLIED>
]>

Download the corpus here (will create a separate smsCorpus directory in the current working directory).

Please do us a favor and send a quick message to rpnlpir@comp.nus.edu.sg, if download this corpus and plan on using it. It will only take a minute of your time and will help us get a better idea of what such a corpus might be used for.

The corpus is one of the deliverables of a final year project done by Yijue How, with continuing work done by Ming Fung Lee. If you would like to cite this corpus, please cite our paper in HCI International, below. Yijue's final year thesis is also available here:

This corpus is distributed under a license derived from the Open Directory Project. The license is distributed with the corpus. The collection of this corpus was generously funded by the Academic Research Fund of the National University of Singapore, R 252-000-155-101/112.


Min-Yen Kan <kanmy@comp.nus.edu.sg>
Created on: Sat Aug 2 14:39:33 2003 | Version: 1.0 | Last modified: Tue Jun 15 11:30:15 2004
This document, index.html, has been accessed 9981 times since 30-Apr-04 15:23:21 SGT. This is the 9th time it has been accessed today. A total of 3944 different hosts have accessed this document in the last 2035 days; your host, 38.107.191.89, has accessed it 2 times. Complete statistics are also available.