Experience based Sampling Technique

for Multimedia Analysis

         

Abstract

Voluminous

Voluminous spatio-temporal data make multimedia analysis tasks extremely inefficient and lack of adaptability. We present a novel experience based sampling technique which has the ability to focus on the analysis's task by making use of the contextual information and past experiences. Based on this, a sampling based dynamic attention model is built by sensing the experiential environment. Sensor samples are used to gather information about the current environment and attention samples are used to represent the current state of attention. In our framework, the task-attended samples are inferred from experiences and maintained by a sampling based dynamical system. The multimedia analysis task can then focus on the attention samples only. Moreover, past experiences and the current environment can be used to adaptively correct and tune the attention. As a prototypical multimedia analysis task, we tackle the face detection problem in videos. Face detection is only performed on the attended samples to achieve robust real time processing. Experimental results have been presented to demonstrate the efficacy of our technique. This experience based sampling based analysis method appears to be a promising technique for general multimedia analysis problems. The generality stems from the power of the sampling method which makes no assumptions about the form of distribution of attention which is usually multimodal in nature.

The full paper can be downloaded here (1.56MB PDF file ) 

Introduction

 

 

Multimedia processing often deals with spatio-temporal data which have the following attributes:

1. They possess a tremendous amount of redundancy
2. The data is dynamic with temporal variations with the resultant history
3. Some data can be live with real-time processing requirements 
4. It does not exist in isolation: it exists in its context with other data. For instance, visual data comes along  with audio, music, text, etc.

However, many current multimedia analysis approaches do not fully consider the above attributes which leads to two main drawbacks : inefficiency and lack of adaptability.

On the other hand, we have solid evidence that humans are superb at dealing with large volumes of disparate data using their sensors. Especially the human visual perception is quite successful in understanding the surrounding environment at appropriate accuracy quite efficiently.

Therefore, we would like to articulate the following goal for multimedia analysis:

"In an experiential environment, analysis is based on sensing the data from the environment. Based on the observations and experiences, collate the relevant data and information of interest related to the task of the analysis. Thus, the analysis process interacts naturally with the data based on its interests in light of the past experiences of that analysis."

In order to achieve this, we introduce a novel technique called experience based sampling, i.e., sampling multimedia data according to experiences. As shown in Fig. 1, by sensing the contextual information in the experiential environment [1], a sampling based dynamic visual attention model is built to maintain the focus towards the interest of the current analysis task. Only the relevant samples survive for performing of the final task. These samples precisely capture the most important data. What is interesting is the past samples influence future sampling via feedback. This mechanism ensures that the analysis task benefits from past experience.

 

Fig.1. Experience based sampling technique for multimedia analysis

Experience Based Sampling: The current environment is first sensed by uniform random sensor samples and based on experiences so far, compute the samples of interest to discard the irrelevant data. Spatially, higher attended samples will be given more weight and temporally, attention is controlled by the total number of samples.

There are two types of samples called sensor samples and attention samples used for sensing the experiential environments and maintaining the visual attention respectively .

Sensor samples: they are used to sensor the visual environments. Sensor samples are created and scatter randomly in both x and y directions.

We define S(t) as a set of NS(t) sensor samples at time t which estimates the state of the multimedia environment.

     

where s(t) depends on the type of multimedia data. For spatial data,  at time t, this is the set of spatial coordinates of the sensor samples. These coordinates are generated randomly and uniformly at every time instance.   is the associated weight or the importance of each sample which is represented as

Attention samples: They are used to dynamically maintain the visual attention in both spatial and temporal directions.

We represent the dynamically varying NA(t)  number of  attention samples A(t) using:

                      

where a(t) again depends on the type of multimedia data. For spatial data, , is the set of spatial coordinates of the attention samples.   is the associated weight or the importance of each sample which is represented as . The spatial visual attention is represented by the spatial coordinates of the attention samples and their weights, while the temporal visual attention is described by the total number of the attention samples NA(t).

     

Proposed method

 

             

From the above face detection example, it is clear that our key technique to represent the analysis's knowledge about the experiential environment is a sampling based visual attention model.

The previous saliency map based visual attention models are image based that do not provide a mechanism to evolve and adapt attention dynamically. Contrastingly, our sampling framework naturally expresses the dynamics of attention (focus of consciousness) of a system. What is particularly appealing is that the attention states as well as the state-transitions are captured as a closed-loop feedback system.

Sampling based visual attention
This figure shows the difference between our sampling based method and saliency map approach when modeling the motion attention.

  

   Original frame                                                Motion represented by the saliency map

 

Motion represented by the samples.            Motion attention by sampling method.

Fig. 2. Sampling based method v.s. saliency map based method for motion attention

Notation: Sensor samples are marked as blue points while attention samples are marked as yellow points. The size of the point indicates the confidence (weight) of this sample. The number of sensor samples and attention samples are indicated in the top of each frame. Red bar shows the visual attention in the x direction. It is calculated according to the weight of  the attention samples.

Explanation: (a) Original frame shows two persons are moving. (b) a motion saliency map calculated by the difference of two neighboring frames. This saliency map based approach need to accurate all the pixels in the frame and is static (can be maintained dynamically). (c)-(d) Motion attention is maintained by the attention samples. Only 232 pixels are calculated for representing this motion attention. Visual attention can be dynamically maintained since each attention sample evolves according to its own dynamics. The red bar in (d) shows that only 232 attention samples is required to represent two motion attention quite well in the x direction (our approach does not need to maintain the saliency map for calculating the samples).

   

Experiments for motion attention

 

    

Spatial motion attention


1. Spatial Motion Attention

Traffic monitoring sequence.

  

    Frame 3                                              Frame 191

    NA = 0, NS =200                                  NA = 272, NS =200

 

 Frame 193                                           Frame 203

 NA = 147, NS =200                              NA =345, NS =200

 

   Frame 212                                            Frame 243

   NA =238, NS =200                                 NA =0, NS =200

 

     Frame 256                                          Frame 342

      NA =185, NS =200                              NA = 0, NS =200

Fig. 3. Spatial motion attention in the traffic monitor sequence

Notation: Sensor samples are marked as blue points while attention samples are marked as yellow points. The size of the point indicates the confidence (weight) of this sample. The number of sensor samples and attention samples are indicated in the top of each frame. Red bar shows the visual attention in the x direction. It is calculated according to the weight of  the attention samples. NS  is number of sensor samples while NA is number of attention samples.

Explanation: This figure illustrates the spatial visual attention inferred from motion experience. The attention samples can represent the motion attention quite robustly. As shown by the red bar in the bottom of each frame, attention samples can represent the evolvement of the  spatial motion attention accurately.

You can download the test clip : MPEG Format or AVI Format (needs divx codec )

The original video clip : MPEG Format

 

Temporal

Motion

Attention

2. Temporal Motion Attention


          This graph show the temporal motion attention.

Fig.4. Temporal Motion attention

Explanation: This figure illustrates the temporal motion attention maintained by our proposed method. This temporal visual attention describe the attention in particular time (in contrast to the spatial attention). It can be inferred from the experiential environment. In our proposed method, we use the total number of attention samples in particular time to model this attention. The more temporal attention in particular time, the more total number of attention samples, i.e., the more analysis. 

You can download the test clip: MPEG Format or AVI Format (needs divx codec )

The original video clip: MPEG Format 

 

Experiments for real time face detection

 

 

Fig.5 shows the procedure of face detection when using our proposed method. Firstly, the 200 sensor samples are created randomly (blue points). As shown in the figure, they are scattered uniformly in both x and y directions.  They are employed to sensor both motion and skin color attention. Based those sensor samples, the attention samples are created by using important sampling method. The more attentive regions will be given more attention samples. As shown in the figure 5.(a), most of attention samples are  localized in the human face.  The final face detector (here we use the AdaBoost face detector proposed by Paul Viola and Michael Jones 2001 ) is only performed in the attention samples. As shown in the figure 5(b), a face is detected by only performing 759 times face detection analysis and 200 times feature extraction (motion and skin color). This extremely reduce the computation complexity compared to traditional analysis (perform pervasive non-focused computations).

Face detection procedure

This figure shows the procedure of face detection using our experience based sampling technique.

          

           (a) proposed experience based sampling technique               (b) Face detection results

 Fig. 5. Face detection procedure

Notation: Sensor samples are marked as blue points while attention samples are marked as yellow points. The size of the point indicates the confidence (weight) of this sample. The number of sensor samples and attention samples are indicated in the bottom of each frame. Red bar shows the visual attention in the x direction. It is calculated according to the weight of  the attention samples.

Explanation: Fig.5.(a): Visual attention maintained by the 795 attention samples . Experiences ( motion and skin color in the visual environments) are sensed and fused by 200 sensor samples . Attention samples are obtained by important sampling from the sensor samples. It indicates that current attention regions are localized in the human face. Fig.5. (b): Final fine analysis (face detector) is performed only on the attention samples. This extremely reduce the computation complexity compared to traditional analysis (perform pervasive non-focused computations).

Temporal and spatial visual attention for face detection

This figure shows, spatial and temporal visual attention is modeled by the experience based sampling technique, oriented by the face detection task.

 

(a) Frame 76. No attention                         (b) Frame 81. a moving chair (some attention)

Sensor Samples:200 Attention Samples:0    Sensor Samples:200. Attention Samples:414

 

(c) Frame 106. chair stopped (No attention)     (d) Frame 120.  A person comes (high attention)

Sensor Samples:200. Attention Samples:0      Sensor Samples:200. Attention Samples:791

   

(e) Frame 147 One person (high attention)   (f) Frame 268 static frame( less attention)

Sensor Samples:200. Attention Samples:791  Sensor Samples:200. Attention Samples:2

Fig.6. Proposed experience based sampling technique for face detection

Notation: Sensor samples are marked as blue points while attention samples are marked as yellow points. The size of the point indicates the confidence (weight) of this sample. The number of sensor samples and attention samples are indicated in the bottom of each frame. Red bar shows the visual attention in the x direction. It is calculated according to the weight of  the attention samples.

Explanation: NS number of sensor samples is set to 200.The number and spatial distribution of attention samples can dynamically change according to the face attention. In (a), there is no motion in the frame, so NA, the number of attention samples is zero. No face detection is performed. In (b), when a chair enters, it alerts the motion sensor and attention is aroused. NA increases to 414. Face detection is performed on the 414 attention samples. But the face detector verifies that there is no face there. In (c) as the chair stops, there is no motion and so the attention samples vanish. In (d)-(h) attention samples come on with the face until the face vanishes.

You can download the captured test sequence: MPEG Format or AVI Format (needs divx codec )

 

 

Experiences from skin   color and motion 1. Experiences of motion and skin color

This video shows face detection using our experience based sampling technique.         

   

(a) Frame 6:No motion attention. NA=0      (b) Frame 20: One person comes.NA =110

(c) Frame 59: Two persons. NA =743        (d) Frame 73: One person.NA =479

 

 (e) Frame 102: One person.NA =315       (f) Frame 149: No motion attention.NA =0

Fig. 7. Face Detection using experience of skin color and motion

Notation: Sensor samples are marked as blue points while attention samples are marked as yellow points. The size of the point indicates the confidence (weight) of this sample. The number of sensor samples and attention samples are indicated in the bottom of each frame. Red bar shows the visual attention in the x direction. It is calculated according to the weight of  the attention samples.

Explanation:(1) Sensor samples are used to obtained the visual attention from the motion and skin color experiences. (2)Attention samples are used to maintain and propagate the spatial visual attention. The temporal visual attention is maintained by the number of the attention samples. More attention samples in current frame mean more temporal attention in this frame.(3) The final analysis (face detector) is performed on the attention samples. It also gives the feedback to the samples. Good samples survive and bad samples vanish.

Fig.7. shows that our experience based sampling technique makes face detection have the ability to interact naturally with the data based on its interests in light of the past experiences of that analysis and the visual environment. in (a) and (f) since 200 random sensor samples do not find any interesting data (according to the current task), there is not attention samples aroused. In this moment, intuitively no face detection analysis process is performed. In (b)-(e), during this time, sensor samples find some interesting data regarding to the current task. The attention samples are aroused. Note that depending on the how much the attention is, the number of attention samples is  different. For instance, the number of attention samples in (c) is 743 which is more than in (b), (d) and (e) since (c) has two attention areas whereas (b), (d) and (e) only have one. (c) also shows our sampling technique can maintain more than one attention region also.

 You can download the captured test sequence: MPEG Format or AVI Format (needs divx codec )

             

Experiences from speech

2. Experiences of speech. 

   

(a) Frame 2               (b) Frame 10                    (c) Frame 42                    (d) Frame 100

Fig. 8.Speech experience (a) speech off NA=0.(b) speech on. NA becomes 1000. face detected. (c) speech off. NA becomes 711(feedback from previous face detection). Face is detected (d) speech on. face detected. NA becomes 1000.

Notation: Sensor samples are marked as blue points while attention samples are marked as yellow samples. The size of the point indicates the confidence (weight) of this sample. The number of sensor samples and attention samples are indicated in the bottom of each frame. Red bar shows the visual attention in the x direction. It is calculated according to the weight of  the attention samples.

Explanation: Speech coming from the accompanying audio data is another experience which can be used to know the visual environment. When speech is on, it means the present of the face. The attention samples are aroused and face is detected. speech cue can help to create the attention samples when there is not any motion attention initially. In addition, as shown in (c), even speech is off, relevant attention samples still survive by being giving higher weights from the feedback of the previous face detection.

 You can download the captured test sequence: MPEG Format or AVI Format (needs divx codec )

Previous experience 3. Previous experience. 

   

(a)                              (b)                                 (c)                       (d)

Fig.9. Previous experience for updating skin color model

Explanation: Figure 9 (a) is a face under normal light. Figure (b) shows its skin color saliency map calculated by the equation (13). Figure 16 (c) (d) are a shadowed face and its skin color saliency map. Figure 16 (b) (d) shows that the feedback of face detector can update the skin color model Ht and make it more adaptive to the visual environment. (Note that the skin color saliency map as shown in Figure 16 (b) and (d) is not necessary to be maintained in our method).  

  

Computation load 4. Computation load. 

   Fig.10.Comparison of computation speed.

Explanation: We use a USB web camera to perform the real time face detection on a Pentium III 600MHz laptop. The graph of the computation load in this real time scenario is shown in Figure 10. In this experiment, curve 1 shows the computation load of the adaboost face detection while curve 2 indicates the computation load of our experience based sampling with adaboost face detector. This figure shows that by using our experience based sampling technique, computation complexity can be significantly reduced. In addition, the computation complexity also varies. When there is no face attention (see frame (a) and (d)), the only process is sensing by employing sensor samples. When the face comes (see frame (b) and (c)), the process includes attention samples and consequently its load goes up.

 Maintained by Mohan S. Kankanhalli (mohan@comp.nus.edu.sg)and Jun Wang . Copyright © 2003.