| |
Areas of research in Computer Science:
With increasing integration of computing, broadcasting, networking
and the Web, multimedia information has become pervasive, permeating
almost every aspect of our life. Increasingly, we find information
coming in different media forms, multiple external sources and encoded
in various knowledge representations. Typical generic media types
include text, image, video and audio. Major parallel sources of
information may come from the media contents in different forms, the
Web, and social network sites such as Wikipedia. Also, we can normally
find good coded knowledge sources such as ontologies and user usage
patterns. To process such information effectively and efficiently, the
ability to analyse and fuse the myriads of related sources of
information has become critically important in any information
processing tasks.
To leverage such synergy, the media group is organised to include
researchers in the related fields of multimedia, computer vision,
computer graphics, natural language processing, multimedia systems, and
machine learning. Overall, the group conducts research related to the
generation, processing, understanding, display, interaction,
transmission and storage of multimedia information. The group is active
in both basic and systems research, including works of industrial
impact. Members are active in international professional activities,
including chairing major international conferences, sitting on
editorial boards and serving in technical programme committees.
Locally, various group members participate in various national level
technical committees, with one chairing a national level research
funding board.
Multimedia Information Processing
Analysis of multimedia contents
has in the past been carried out on the basis of a single medium. Only
recently have we begun to use multimodal features routinely to analyse
multimedia contents. The use of only intra-content features, however,
is still inadequate. To progress further, we need to utilise the
available external information sources, such as the abundance of Web,
ontology, and human language resources (dictionaries, encyclopaedia),
and various encoded knowledge, to supplement media content analysis.
Along this theme, we have carried out research on concept extraction,
retrieval and question answering in video, working on both news and
informational videos. The longterm goal is to develop automated
techniques to index an input video stream to facilitate the retrieval,
summarisation and personalisation of video. In concept annotation,
which involves assigning one or more pre-defined concepts to input
video clips, we are focusing on employing hierarchical concept
structures and developing visual vocabulary to perform multimodal
concept annotation. The use of a visual dictionary along with text
vocabulary has been found to be effective on public image corpuses.
In retrieval, our system exploits domain models for news, together with
speech (in terms of Automatic Speech Recognition or ASR output) and
various audiovisual (AV) features inherent in news video streams. Our
modelling incorporates query analysis, topic-dependent information
fusion model, and integration of text and visual concepts to identify
precise answers. In addition, we explore the modelling and detection of
events of human interests through their relations to people, time and
space, and leveraging on vast external information sources such as news
and blogger sites. We have participated in large scale public news
video retrieval evaluations organised by TRECVID (See here)
, and achieved top positions in auto-news video retrieval tasks in
2005-06. We are developing an interactive system that incorporates
active learning and facilitates fast user feedback.
We apply the similar multimodal, multi-source and multiresolution
framework to personal media and painting domains. In personal media, we
focus on auto annotation of "who" and "exact where" by exploiting
visual and social contexts. We explore the annotation of paintings with
high level artistic concepts using arts ontologies and Web resources.
The key research issues explored in these projects are: (a)
ontologybased learning; and (b) transductive learning as training data
is scarce.

Multimedia Analysis and Synthesis
Analysis: A diversity of media types such as text, audio, video and
novel sensory forms is proliferating in a variety of applications. This
is true of traditional media sources such as television, new media
sources such as the Internet, consumer applications such as personal
media management, and emerging niche areas such as surveillance. This
calls for media analysis, which is the precursor of other forms of
processing such as archiving, querying, retrieval and transcoding. For
example, the ability to analyse and fuse information from different
sources and in multiple media types in order to parse the semantic
events occurring has become critically important in many media
management tasks. Handling live data (be it symbolic text feed or
signal sensor feed) is particularly challenging in this environment. We
have developed an experiential sampling technique as well as
information assimilation techniques for this purpose. Most of the
analysis work has been done in the multimedia surveillance context.
There are two basic thrusts:
Active Multimedia Sensing:
This thrust aims to develop the theoretical foundation, algorithms,
architectures and prototypes for active sensing for a wide variety of
applications such as surveillance, monitoring and the Web. The idea
here is to harness the power of diverse sensors in an orchestrated
manner to optimally perform the given tasks. For example, we have
developed a "coopetitive" interaction approach, which combines the
salient features of cooperation and competition with an aim to optimise
the cooperation among sensors to achieve best results at the system
level rather than redundantly implementing cooperation at each stage.
For this, we employ a forward state estimation method which is based on
Model Predictive Control to counteract various delays in multi-sensor
environments. Our results from two different visual surveillance
adaptations with different number of cameras and different surveillance
goals provide clear evidence of improvements achieved with the
coopetitive strategy.
Multimedia Event Detection and Capture through Sensors and Context:
Here, we operate on the recognition that humans tend to organise their
lives around events. Our work centres on developing event detection
techniques with the use of all appropriate sensors. So far, we have
considered specific event detection using multimodal information with
human assistance as may be suitable in specific contexts. Most of the
work has been done in the context of building personal media
collections in the form of family e-chronicles.

Synthesis:
A well-produced video makes a strong impression on the viewer. However,
amateur home video makers are unaware of the principles of cinema
grammar. The videos they shoot are meant to convey some specific
intent, although often, their inexperience and lack of editing skills,
or the limitations of the means of video capture they have at their
disposal belie their intentions. Intent delivery techniques, which rely
on principles of cinema grammar, aesthetics and video analysis, may
remedy the situation by conveying a range of evident intents. We have
developed a general approach for video intent delivery by means of a
brief catalogue of the intentions of the cinematographer and the
editor. It allows for the delivery of four basic types of intents:
cheer, serenity, gloom and excitement. Essentially, we have used the
theoretical support from video grammar and cinematic rules in order to
model computable features for repurposing home videos. We have also
developed a transcoding technique called multimedia simplification
which is based on experiential sampling. Multimedia simplification
helps optimise the synthesis of Multimedia Messaging Service (MMS)
messages for mobile phones. Transcoding is useful in overcoming the
limitations of compact devices. The proposed approach aims at reducing
the redundancy in multimedia data captured by multiple types of media
sensors. The simplified data is first stored into a gallery for further
usage. Once a request for MMS is received, the MMS server makes use of
the simplified media from the gallery. The multimedia data is aligned
to the time-line for MMS message synthesis. Our technique is targeted
at users who are interested in obtaining salient multimedia information
via mobile devices.

Computer Vision
Our research in computer vision spans many areas, including: biometrics
and face recognition, human motion analysis, medical image analysis,
and digital photography. In biometrics, we have developed techniques to
make face recognition more robust against changes in lighting and head
orientation. These are the two most prevalent variations in face images
that make recognition difficult. We can also deal with images of low
quality, such as blur, low contrast, and partial occlusion. We have
recently demonstrated the usefulness of combining multiple modalities,
e.g., face, fingerprint, and keystroke dynamics, to create a system
that continuously authenticates a user after initial login. Such a
system is useful for high security environments.

In digital photography, we are developing the next generation of
"picture perfect" cameras that will eliminate common problems such as
red eye, blurring, uneven exposure, and insufficient ambient light. The
goal is to create a consumer camera that produces sharp, detailed and
pleasing photographs directly from hardware, without the need to touch
up the images. To do this, we are exploiting near infrared light which
is already captured by normal CCD sensors, as well as by detecting the
presence of faces and known objects in the scene. In human motion
analysis, we have developed a sophisticated offline system for
analysing the motion of a sports novice captured in a single video with
the 3D reference motion of an expert. The system performs temporal
alignment of novice motion in the video and the 3D reference motion,
and then computes the difference between the novice posture and the
corresponding 3D reference posture. The posture differences are
visualised and fed back to the novice to allow him to correct his
motion as though a human coach is present. We have applied this system
to Taichi and golf swing coaching. At present, we are collaborating
with an IT entrepreneur to build a prototype system for remote golf
swing coaching. This prototyping effort is supported by the Economics
Development Board (EDB) in Singapore and is showcased in EDB's
publicity brochure. We are in the process of filing patents for this
technology.

In medical image analysis, we have collaborated with medical doctors on
a wide range of applications including detection of bone fractures in
x-ray images (with Singapore General Hospital); segmentation of liver
in abdominal CT images for liver transplant (with National University
Hospital); detection, classification and quantification of acnes and
pigmentation in face images (with National Skin Centre); early
detection of infarcts in brain CT (with Singapore General Hospital);
simulation of cardiac surgery for surgical planning (with National
Taiwan University Hospital), etc. An international patent for fracture
detection system has been published under PCT and a U.S. patent has
been filed. We are in the process of filing patents for the skin acne
analysis system. Most of the above research work is still in progress.

Interactive 3D Computer Graphics and Computational Geometry
The group pursues research mainly in the areas of interactive 3D
graphics and computational geometry. Research in the first area
includes: interactive control in graphics applications such as
morphing; real-time modelling and rendering; large scale data
management to support visual simulation & animation; GPU
computation; and adaptive gaming agents with personality
modelling.
Research in the area of computational geometry includes: a) quality and
homeomorphic meshes for smooth surface used in molecular science,
engineering simulation and general deformation in computer graphics;
(b) parametric deforming mesh scheduling for automatic deformation with
arbitrary topology changes; and (c) deformation and meshing of R3
surfaces, shape alignments, approximation of convex shapes, and
relationships between toric surfaces and Dixon sparse resultants.
The works are relevant to building design, architectural visualisation,
visual simulation and gaming. The group is in collaboration with
building professionals to research into computational problems in
building design. Its work on realtime generation of shadows,
Trapezoidal Shadow Maps (TSM), has been licensed to game companies
(such as Big Huge Games Inc., USA). The algorithm has also been bundled
in nVidia (the market giant on video display chips) SDK, and used in a
plant rendering software (USA). Its work on real-time Voronoi diagram
has been used in software developed in Harvard-MIT, Division of Health
Sciences and Technology.

3D Digital Reconstruction of Realworld Objects & Environments
Using active range sensing to reconstruct highquality 3D digital
models of real-world objects and environments has many practical
applications in areas ranging from manufacturing to the cultural. Yet
even now, acquisition and reconstruction processes are very tedious and
laborious, let alone the huge amount of data that needs to be
processed. The ultimate goal of many research projects in this area is
to automate the whole 3D reconstruct process, or to reduce human
involvement. To create visually realistic models, colour information is
acquired using digital still cameras and video cameras.
Our current research focuses on: (a) View planning -- to determine a
sequence of positions to place acquisition devices to optimise
acquisition and obtain a more complete reconstructed model. (b)
View-dependent colour acquisition and automatic registration -- to
create a visually realistic 3D digital model with colour information.
(c) Real-time hand-held range scanning -- to register, reconstruct and
displayrange data in realtime to provide effective visual feedback to
the human operator.
Digital Audio Processing and Media Integration
We focus on applied audio research in the context of real life
applications. We have produced the Interactive Digital Violin Tutor
(iDVT), which mainly focuses on music transcription. Exploiting synergy
between individuals in the School and collaborating with colleagues at
local research institutes, we are developing practical multimedia
applications which integrate audio into other media types such as text,
video and animation. We have developed LyricAlly, which is designed as
a useful tool for mobile entertainment in the form of portable karaoke.

Multimedia Systems
Our research aims to improve resource efficiency and playback quality
in a multimedia system. As these are often conflicting goals, one
general theme of our research is to develop new techniques that operate
at the right trade-off point between resource and quality. In this
direction, we are currently investigating the tradeoff between
bandwidth and correctness of a distributed video surveillance system,
and between power consumption and frame-rate rendering in a
first-person shooting game. Another general research theme deals with
the Internet¡¯s lack of service guarantee during media
streaming. Our interest covers fundamental issues, such as error
control and congestion control, as well as new approaches to streaming,
such as multi-source streaming and peer-to-peer streaming. We are also
investigating the streaming of 3D objects, a new media type that is
becoming popular. Our team is collaborating with IRIT-ENSEEIHT, France,
on this topic.

Media Streaming
Low-latency, Interactive Peer-to-peer Streaming
Peer-to-peer (P2P) streaming is emerging as a viable communications
paradigm. Research in this field has traditionally aimed at building
efficient and optimal overlay multicast trees at the application level.
However, much of the existing work focuses on storeand- forward or
one-way stream delivery. In these scenarios, end-to-end latency is not
very critical. In our work, we focus on peer-to-peer streaming support
for live, interactive applications. While some applications exist in
this space (e.g., Skype), most are targeted at relatively small groups
of participants per session (say two to eight).
The aim of our Adaptive Cluster Technology for Interactive Virtual
Environments (ACTIVE) protocol is to enable interactive communications
for large participant groups as can be found in a number of
applications such as e-learning and Massively Multiplayer Online Games
(MMOG). Some of the novel concepts introduced by ACTIVE are the
distinction between active and passive participants and a dynamically
adaptive clustering mechanism based on this classification. In virtual
environments, latency optimisation is performed based on the location
proximity of avatars within the virtual space. By leveraging the
location information, the ACTIVE platform can also be used to deliver
positional audio to create a more realistic aural landscape.
ACTIVE has been the foundation of a number of experimental prototype
systems such as AudioPeer, a voice chat application for large
participant groups (see Figure 11(a)). More recently, ACTIVE has been
used to add peer-topeer voice services to the Torque game engine and a
sample game called PartyPeer has been created (Figure 11(b)).

Wireless Ad Hoc Media Streaming
With the widespread availability of handheld devices that are both
media-capable and wirelessly networked, streaming audio or video
content between such units is feasible. Many recent mobile devices can
operate via wireless 802.11 networks, which provide broadband-level
bandwidth (usually free of charge) and a communication range of
hundreds of meters. This allows a user to move freely when she is
streaming multimedia content from others within her communication
radius (see figure 12(a)). One challenge in streaming multimedia
content among mobile ad hoc peers is to deliver the content, usually
large in size, over a wireless link whose quality is constantly
changing. For example, the wireless bandwidth may drop or the link may
even break as the distance between two mobile ad hoc peers increases.
Our research in this area focuses on link availability prediction to
improve the quality of peer streaming.
Predicting future link availability under different movement patterns
of the devices is a challenging task (Figure 12(b)). Our work
mathematically models the link status given certain mobility models
(e.g., random walk and random waypoint models), location and speed
information (e.g., obtained via GPS), and realistic network bandwidth
as determined by the Auto-Rate Fallback (ARF) scheme of 802.11-based
wireless equipment. Additionally, our technique takes advantage of the
multi-layer structure of Scalable Video Coding (SVC) or Multiple
Description Coding (MDC) to increase the success of media delivery
under varying conditions.
Some of our techniques have been implemented in a prototype called
MStream (Figure 12(c)), which was demonstrated as a US Finalist project
of the ImagineCup 2006 competition at Microsoft in Redmond, Washington.

Natural Language Processing
Research on natural language processing includes the areas of semantic
processing, discourse processing, and Chinese language processing. In
semantic processing, we exploit parallel texts and semi-supervised
learning to scale up word sense disambiguation, which is the task of
determining the correct meaning or sense of a word in context. We also
employ automated techniques to estimate sense priors for adapting a
word sense disambiguation program to different domains. We have also
built state-ofthe-art semantic role labelling programs for PropBank and
NomBank, which identify the semantic role of each constituent in a
sentence.

In discourse processing, we work on the task of English one-anaphora
resolution, using a machine learning approach. In Chinese language
processing, we have built a state-of-the-art Chinese word segmenter. In
2005, we participated in the open track of the Second International
Chinese Word Segmentation Bakeoff, an international evaluation exercise
that compares competing Chinese word segmenters. Our Chinese word
segmenter achieved the highest accuracy on three of the four test
corpora, and the second highest accuracy on the fourth test corpus. A
total of 18 teams participated in the event.
In addition to these areas, we have done peripheral work in other key
areas of natural language processing, including machine translation,
information extraction and verb analysis. Our work in machine
translation has improved the quality of word ordering in translated
Chinese to English text. We have developed methods for building better
textual similarity for application in NLP processes such as in
information extraction, summarisation and question answering. Finally,
in lexical analysis, we have continued to examine and find automated
methods to treat compound verb phrases, in particular, light verb
phrases (e.g., make a call) where the verbs play only a licensing role
for its arguments.
Currently, we employ our multi-resolution relation-based framework for
information extraction. We are also exploring the use of web knowledge
and ontologies to perform interactive QA.
Precise Information Retrieval and Question Answering
Question answering (QA) aims to find exact answers to users' natural
language queries, instead of ranked lists of documents as is done in
current search engines. It is a major step towards information
retrieval instead of document retrieval.
Our QA system employs a pipeline structure that consists of several
modules to get short and precise answers to users' questions. It
searches for answers at increasingly finergrained units of: (1)
locating the relevant documents, (2) retrieving passages that may
contain the answer, and (3) pinpointing the exact answer from candidate
passages. The research focus of our work is three-fold. First, we
search the Web for relevant context information to supplement the often
inexact query. In particular, we perform semantic clustering of
information retrieved from the Web to identify different sub-events and
induce different facets of queries in supporting event-based QA.
Second, in addition to density-based word matching, we employ
discourse, semantic and dependency relations to perform passage
retrieval at sentence level. This gives rise to a multi-resolution
framework for relation-based precise information retrieval. Third, we
develop a document concept lattice model together with definitional
patterns and a human interest model to perform task-oriented
summarisation.
Our studies on the large-scale TREC-QA corpus demonstrate that our
approaches are effective in performing factoid, list and definitional
QA. Our system has been ranked consistently at second position over
three years (2003-2005) in the public TREC-QA evaluations organised by
NIST, USA. Our summarisation system also achieved top position in the
DUC forum in 2005. Our technology has been licensed by industry to
perform precise legal search. Currently, we employ our multi-resolution
relation-based framework for information extraction. We are also
exploring the use of web knowledge and ontologies to perform
interactive QA.
Machine Learning for Media Applications
Our research applies machine learning to the areas of text processing,
natural language processing, and signal and video processing for
activity recognition. For activity recognition, we have been working
with physiological signals from wearable sensors as well as video from
fixed cameras. In text classification, our focus has been on developing
kernels and features that would perform well. In natural language
processing, we have been working on word sense disambiguation,
utilising unlabeled data in unsupervised and semisupervised learning.
We focus on both developing machine learning techniques to address the
issues important in these applications as well as doing well on the
applications themselves. We participated in the SemEval 2007 evaluation
for the word sense disambiguation tasks, and our system ranked first in
the lexical sample task and third in the coarse-grained all-words task.
The faculty members involved in media research are:
- CHANG Ee Chien
- CHENG Holun
- CHUA Tat Seng
- CHIONH Eng Wee
- FANG Chee Hung, Anthony
- GOLAM Ashraf
- KAN Min Yen
- KANKANHALLI Mohan
- LEE Wee Sun
- LEOW Wee Kheng
- LOW Kok Lim
- NG Hwee Tou
- OOI Wei Tsang
- SIM Mong Cheng, Terence
- TAN Tiow Seng
- WANG Ye
|