Research Areas
    Activities
   Conferences Hosted by SoC
  
Professional Services
    Research Funding
    Research Publications
    Internship
    Useful Links

 

  Home > Research
   
  Media
 

With increasing integration of computing, broadcasting, networking and the Web, multimedia information has become pervasive, permeating almost every aspect of our life. Increasingly, we find information coming in different media forms, multiple external sources and encoded in various knowledge representations. Typical generic media types include text, image, video and audio. Major parallel sources of information may come from the media contents in different forms, the Web, and social network sites such as Wikipedia. Also, we can normally find good coded knowledge sources such as ontologies and user usage patterns. To process such information effectively and efficiently, the ability to analyse and fuse the myriads of related sources of information has become critically important in any information processing tasks. 


To leverage such synergy, the media group is organised to include researchers in the related fields of multimedia, computer vision, computer graphics, natural language processing, multimedia systems, and machine learning. Overall, the group conducts research related to the generation, processing, understanding, display, interaction, transmission and storage of multimedia information. The group is active in both basic and systems research, including works of industrial impact. Members are active in international professional activities, including chairing major international conferences, sitting on editorial boards and serving in technical programme committees. Locally, various group members participate in various national level technical committees, with one chairing a national level research funding board.


Multimedia Information Processing

Analysis of multimedia contents has in the past been carried out on the basis of a single medium. Only recently have we begun to use multimodal features routinely to analyse multimedia contents. The use of only intra-content features, however, is still inadequate. To progress further, we need to utilise the available external information sources, such as the abundance of Web, ontology, and human language resources (dictionaries, encyclopaedia), and various encoded knowledge, to supplement media content analysis.

Along this theme, we have carried out research on concept extraction, retrieval and question answering in video, working on both news and informational videos. The longterm goal is to develop automated techniques to index an input video stream to facilitate the retrieval, summarisation and personalisation of video. In concept annotation, which involves assigning one or more pre-defined concepts to input video clips, we are focusing on employing hierarchical concept structures and developing visual vocabulary to perform multimodal concept annotation. The use of a visual dictionary along with text vocabulary has been found to be effective on public image corpuses.

In retrieval, our system exploits domain models for news, together with speech (in terms of Automatic Speech Recognition or ASR output) and various audiovisual (AV) features inherent in news video streams. Our modelling incorporates query analysis, topic-dependent information fusion model, and integration of text and visual concepts to identify precise answers. In addition, we explore the modelling and detection of events of human interests through their relations to people, time and space, and leveraging on vast external information sources such as news and blogger sites. We have participated in large scale public news video retrieval evaluations organised by TRECVID (See here) , and achieved top positions in auto-news video retrieval tasks in 2005-06. We are developing an interactive system that incorporates active learning and facilitates fast user feedback.

We apply the similar multimodal, multi-source and multiresolution framework to personal media and painting domains. In personal media, we focus on auto annotation of "who" and "exact where" by exploiting visual and social contexts. We explore the annotation of paintings with high level artistic concepts using arts ontologies and Web resources. The key research issues explored in these projects are: (a) ontologybased learning; and (b) transductive learning as training data is scarce.



Multimedia Analysis and Synthesis

Analysis: A diversity of media types such as text, audio, video and novel sensory forms is proliferating in a variety of applications. This is true of traditional media sources such as television, new media sources such as the Internet, consumer applications such as personal media management, and emerging niche areas such as surveillance. This calls for media analysis, which is the precursor of other forms of processing such as archiving, querying, retrieval and transcoding. For example, the ability to analyse and fuse information from different sources and in multiple media types in order to parse the semantic events occurring has become critically important in many media management tasks. Handling live data (be it symbolic text feed or signal sensor feed) is particularly challenging in this environment. We have developed an experiential sampling technique as well as information assimilation techniques for this purpose. Most of the analysis work has been done in the multimedia surveillance context. There are two basic thrusts:

Active Multimedia Sensing: This thrust aims to develop the theoretical foundation, algorithms, architectures and prototypes for active sensing for a wide variety of applications such as surveillance, monitoring and the Web. The idea here is to harness the power of diverse sensors in an orchestrated manner to optimally perform the given tasks. For example, we have developed a "coopetitive" interaction approach, which combines the salient features of cooperation and competition with an aim to optimise the cooperation among sensors to achieve best results at the system level rather than redundantly implementing cooperation at each stage. For this, we employ a forward state estimation method which is based on Model Predictive Control to counteract various delays in multi-sensor environments. Our results from two different visual surveillance adaptations with different number of cameras and different surveillance goals provide clear evidence of improvements achieved with the coopetitive strategy.

Multimedia Event Detection and Capture through Sensors and Context
: Here, we operate on the recognition that humans tend to organise their lives around events. Our work centres on developing event detection techniques with the use of all appropriate sensors. So far, we have considered specific event detection using multimodal information with human assistance as may be suitable in specific contexts. Most of the work has been done in the context of building personal media collections in the form of family e-chronicles.



Synthesis: A well-produced video makes a strong impression on the viewer. However, amateur home video makers are unaware of the principles of cinema grammar. The videos they shoot are meant to convey some specific intent, although often, their inexperience and lack of editing skills, or the limitations of the means of video capture they have at their disposal belie their intentions. Intent delivery techniques, which rely on principles of cinema grammar, aesthetics and video analysis, may remedy the situation by conveying a range of evident intents. We have developed a general approach for video intent delivery by means of a brief catalogue of the intentions of the cinematographer and the editor. It allows for the delivery of four basic types of intents: cheer, serenity, gloom and excitement. Essentially, we have used the theoretical support from video grammar and cinematic rules in order to model computable features for repurposing home videos. We have also developed a transcoding technique called multimedia simplification which is based on experiential sampling. Multimedia simplification helps optimise the synthesis of Multimedia Messaging Service (MMS) messages for mobile phones. Transcoding is useful in overcoming the limitations of compact devices. The proposed approach aims at reducing the redundancy in multimedia data captured by multiple types of media sensors. The simplified data is first stored into a gallery for further usage. Once a request for MMS is received, the MMS server makes use of the simplified media from the gallery. The multimedia data is aligned to the time-line for MMS message synthesis. Our technique is targeted at users who are interested in obtaining salient multimedia information via mobile devices.


Computer Vision

Our research in computer vision spans many areas, including: biometrics and face recognition, human motion analysis, medical image analysis, and digital photography. In biometrics, we have developed techniques to make face recognition more robust against changes in lighting and head orientation. These are the two most prevalent variations in face images that make recognition difficult. We can also deal with images of low quality, such as blur, low contrast, and partial occlusion. We have recently demonstrated the usefulness of combining multiple modalities, e.g., face, fingerprint, and keystroke dynamics, to create a system that continuously authenticates a user after initial login. Such a system is useful for high security environments.



In digital photography, we are developing the next generation of "picture perfect" cameras that will eliminate common problems such as red eye, blurring, uneven exposure, and insufficient ambient light. The goal is to create a consumer camera that produces sharp, detailed and pleasing photographs directly from hardware, without the need to touch up the images. To do this, we are exploiting near infrared light which is already captured by normal CCD sensors, as well as by detecting the presence of faces and known objects in the scene. In human motion analysis, we have developed a sophisticated offline system for analysing the motion of a sports novice captured in a single video with the 3D reference motion of an expert. The system performs temporal alignment of novice motion in the video and the 3D reference motion, and then computes the difference between the novice posture and the corresponding 3D reference posture. The posture differences are visualised and fed back to the novice to allow him to correct his motion as though a human coach is present. We have applied this system to Taichi and golf swing coaching. At present, we are collaborating with an IT entrepreneur to build a prototype system for remote golf swing coaching. This prototyping effort is supported by the Economics Development Board (EDB) in Singapore and is showcased in EDB's publicity brochure. We are in the process of filing patents for this technology.



In medical image analysis, we have collaborated with medical doctors on a wide range of applications including detection of bone fractures in x-ray images (with Singapore General Hospital); segmentation of liver in abdominal CT images for liver transplant (with National University Hospital); detection, classification and quantification of acnes and pigmentation in face images (with National Skin Centre); early detection of infarcts in brain CT (with Singapore General Hospital); simulation of cardiac surgery for surgical planning (with National Taiwan University Hospital), etc. An international patent for fracture detection system has been published under PCT and a U.S. patent has been filed. We are in the process of filing patents for the skin acne analysis system. Most of the above research work is still in progress.



Interactive 3D Computer Graphics and Computational Geometry

The group pursues research mainly in the areas of interactive 3D graphics and computational geometry. Research in the first area includes: interactive control in graphics applications such as morphing; real-time modelling and rendering; large scale data management to support visual simulation & animation; GPU computation; and adaptive gaming agents with personality modelling. 

Research in the area of computational geometry includes: a) quality and homeomorphic meshes for smooth surface used in molecular science, engineering simulation and general deformation in computer graphics; (b) parametric deforming mesh scheduling for automatic deformation with arbitrary topology changes; and (c) deformation and meshing of R3 surfaces, shape alignments, approximation of convex shapes, and relationships between toric surfaces and Dixon sparse resultants.

The works are relevant to building design, architectural visualisation, visual simulation and gaming. The group is in collaboration with building professionals to research into computational problems in building design. Its work on realtime generation of shadows, Trapezoidal Shadow Maps (TSM), has been licensed to game companies (such as Big Huge Games Inc., USA). The algorithm has also been bundled in nVidia (the market giant on video display chips) SDK, and used in a plant rendering software (USA). Its work on real-time Voronoi diagram has been used in software developed in Harvard-MIT, Division of Health Sciences and Technology.



3D Digital Reconstruction of Realworld Objects & Environments

Using active range sensing to reconstruct highquality 3D digital models of real-world objects and environments has many practical applications in areas ranging from manufacturing to the cultural. Yet even now, acquisition and reconstruction processes are very tedious and laborious, let alone the huge amount of data that needs to be processed. The ultimate goal of many research projects in this area is to automate the whole 3D reconstruct process, or to reduce human involvement. To create visually realistic models, colour information is acquired using digital still cameras and video cameras.

Our current research focuses on: (a) View planning -- to determine a sequence of positions to place acquisition devices to optimise acquisition and obtain a more complete reconstructed model. (b) View-dependent colour acquisition and automatic registration -- to create a visually realistic 3D digital model with colour information. (c) Real-time hand-held range scanning -- to register, reconstruct and displayrange data in realtime to provide effective visual feedback to the human operator.


Digital Audio Processing and Media Integration

We focus on applied audio research in the context of real life applications. We have produced the Interactive Digital Violin Tutor (iDVT), which mainly focuses on music transcription. Exploiting synergy between individuals in the School and collaborating with colleagues at local research institutes, we are developing practical multimedia applications which integrate audio into other media types such as text, video and animation. We have developed LyricAlly, which is designed as a useful tool for mobile entertainment in the form of portable karaoke.


Multimedia Systems

Our research aims to improve resource efficiency and playback quality in a multimedia system. As these are often conflicting goals, one general theme of our research is to develop new techniques that operate at the right trade-off point between resource and quality. In this direction, we are currently investigating the tradeoff between bandwidth and correctness of a distributed video surveillance system, and between power consumption and frame-rate rendering in a first-person shooting game. Another general research theme deals with the Internet¡¯s lack of service guarantee during media streaming. Our interest covers fundamental issues, such as error control and congestion control, as well as new approaches to streaming, such as multi-source streaming and peer-to-peer streaming. We are also investigating the streaming of 3D objects, a new media type that is becoming popular. Our team is collaborating with IRIT-ENSEEIHT, France, on this topic.



Media Streaming

Low-latency, Interactive Peer-to-peer Streaming

Peer-to-peer (P2P) streaming is emerging as a viable communications paradigm. Research in this field has traditionally aimed at building efficient and optimal overlay multicast trees at the application level. However, much of the existing work focuses on storeand- forward or one-way stream delivery. In these scenarios, end-to-end latency is not very critical. In our work, we focus on peer-to-peer streaming support for live, interactive applications. While some applications exist in this space (e.g., Skype), most are targeted at relatively small groups of participants per session (say two to eight).

The aim of our Adaptive Cluster Technology for Interactive Virtual Environments (ACTIVE) protocol is to enable interactive communications for large participant groups as can be found in a number of applications such as e-learning and Massively Multiplayer Online Games (MMOG). Some of the novel concepts introduced by ACTIVE are the distinction between active and passive participants and a dynamically adaptive clustering mechanism based on this classification. In virtual environments, latency optimisation is performed based on the location proximity of avatars within the virtual space. By leveraging the location information, the ACTIVE platform can also be used to deliver positional audio to create a more realistic aural landscape.

ACTIVE has been the foundation of a number of experimental prototype systems such as AudioPeer, a voice chat application for large participant groups (see Figure 11(a)). More recently, ACTIVE has been used to add peer-topeer voice services to the Torque game engine and a sample game called PartyPeer has been created (Figure 11(b)).



Wireless Ad Hoc Media Streaming

With the widespread availability of handheld devices that are both media-capable and wirelessly networked, streaming audio or video content between such units is feasible. Many recent mobile devices can operate via wireless 802.11 networks, which provide broadband-level bandwidth (usually free of charge) and a communication range of hundreds of meters. This allows a user to move freely when she is streaming multimedia content from others within her communication radius (see figure 12(a)). One challenge in streaming multimedia content among mobile ad hoc peers is to deliver the content, usually large in size, over a wireless link whose quality is constantly changing. For example, the wireless bandwidth may drop or the link may even break as the distance between two mobile ad hoc peers increases. Our research in this area focuses on link availability prediction to improve the quality of peer streaming.

Predicting future link availability under different movement patterns of the devices is a challenging task (Figure 12(b)). Our work mathematically models the link status given certain mobility models (e.g., random walk and random waypoint models), location and speed information (e.g., obtained via GPS), and realistic network bandwidth as determined by the Auto-Rate Fallback (ARF) scheme of 802.11-based wireless equipment. Additionally, our technique takes advantage of the multi-layer structure of Scalable Video Coding (SVC) or Multiple Description Coding (MDC) to increase the success of media delivery under varying conditions.

Some of our techniques have been implemented in a prototype called MStream (Figure 12(c)), which was demonstrated as a US Finalist project of the ImagineCup 2006 competition at Microsoft in Redmond, Washington.


Natural Language Processing

Research on natural language processing includes the areas of semantic processing, discourse processing, and Chinese language processing. In semantic processing, we exploit parallel texts and semi-supervised learning to scale up word sense disambiguation, which is the task of determining the correct meaning or sense of a word in context. We also employ automated techniques to estimate sense priors for adapting a word sense disambiguation program to different domains. We have also built state-ofthe-art semantic role labelling programs for PropBank and NomBank, which identify the semantic role of each constituent in a sentence.



In discourse processing, we work on the task of English one-anaphora resolution, using a machine learning approach. In Chinese language processing, we have built a state-of-the-art Chinese word segmenter. In 2005, we participated in the open track of the Second International Chinese Word Segmentation Bakeoff, an international evaluation exercise that compares competing Chinese word segmenters. Our Chinese word segmenter achieved the highest accuracy on three of the four test corpora, and the second highest accuracy on the fourth test corpus. A total of 18 teams participated in the event.

In addition to these areas, we have done peripheral work in other key areas of natural language processing, including machine translation, information extraction and verb analysis. Our work in machine translation has improved the quality of word ordering in translated Chinese to English text. We have developed methods for building better textual similarity for application in NLP processes such as in information extraction, summarisation and question answering. Finally, in lexical analysis, we have continued to examine and find automated methods to treat compound verb phrases, in particular, light verb phrases (e.g., make a call) where the verbs play only a licensing role for its arguments.

Currently, we employ our multi-resolution relation-based framework for information extraction. We are also exploring the use of web knowledge and ontologies to perform interactive QA.

Precise Information Retrieval and Question Answering

Question answering (QA) aims to find exact answers to users' natural language queries, instead of ranked lists of documents as is done in current search engines. It is a major step towards information retrieval instead of document retrieval.

Our QA system employs a pipeline structure that consists of several modules to get short and precise answers to users' questions. It searches for answers at increasingly finergrained units of: (1) locating the relevant documents, (2) retrieving passages that may contain the answer, and (3) pinpointing the exact answer from candidate passages. The research focus of our work is three-fold. First, we search the Web for relevant context information to supplement the often inexact query. In particular, we perform semantic clustering of information retrieved from the Web to identify different sub-events and induce different facets of queries in supporting event-based QA. Second, in addition to density-based word matching, we employ discourse, semantic and dependency relations to perform passage retrieval at sentence level. This gives rise to a multi-resolution framework for relation-based precise information retrieval. Third, we develop a document concept lattice model together with definitional patterns and a human interest model to perform task-oriented summarisation. 

Our studies on the large-scale TREC-QA corpus demonstrate that our approaches are effective in performing factoid, list and definitional QA. Our system has been ranked consistently at second position over three years (2003-2005) in the public TREC-QA evaluations organised by NIST, USA. Our summarisation system also achieved top position in the DUC forum in 2005. Our technology has been licensed by industry to perform precise legal search. Currently, we employ our multi-resolution relation-based framework for information extraction. We are also exploring the use of web knowledge and ontologies to perform interactive QA.

Machine Learning for Media Applications

Our research applies machine learning to the areas of text processing, natural language processing, and signal and video processing for activity recognition. For activity recognition, we have been working with physiological signals from wearable sensors as well as video from fixed cameras. In text classification, our focus has been on developing kernels and features that would perform well. In natural language processing, we have been working on word sense disambiguation, utilising unlabeled data in unsupervised and semisupervised learning. We focus on both developing machine learning techniques to address the issues important in these applications as well as doing well on the applications themselves. We participated in the SemEval 2007 evaluation for the word sense disambiguation tasks, and our system ranked first in the lexical sample task and third in the coarse-grained all-words task.


The faculty members involved in media research are:

  • CHANG Ee Chien
  • CHENG Holun
  • CHUA Tat Seng
  • CHIONH Eng Wee
  • FANG Chee Hung, Anthony
  • GOLAM Ashraf
  • KAN Min Yen
  • KANKANHALLI Mohan
  • LEE Wee Sun
  • LEOW Wee Kheng
  • LOW Kok Lim
  • NG Hwee Tou
  • OOI Wei Tsang
  • SIM Mong Cheng, Terence
  • TAN Tiow Seng
  • WANG Ye


© Copyright 2001-08 National University of Singapore. All Rights Reserved