Experiential Sampling in Multimedia Systems

Problems in multimedia systems

We represent this situation in the following definition of the problem. Let us assume that we are given S1, S2 … Sn synchronized data streams belong to the space of multimedia data streams . These data streams have K types of data in the form of image sequence, audio stream, motion detector, annotations, symbolic streams, and any other type that may be relevant and available. Also, metadata for each of the streams MD1, MD2 …, MDn is available in the context of the environment. This metadata may include things like location and type of the sensor, viewpoint, angles, camera calibration parameters or any other similar parameters relevant to the data stream. Since a data stream is usually not directly very useful, some feature detectors must be applied to each data stream to obtain features that are relevant in the current environment. When features are based on time intervals, they will be considered as detected at the end of interval.

Given the above data environment, there are now many very interesting problems that one faces, including the following that are directly relevant to the main theme that we wish to address in this paper:

How to focus on the most relevant data in a particular data stream?
How to focus on the most relevant data in multiple correlated data streams?

For the given task, what is the minimum number of data streams required?
How does one sample the data streams? How can one minimize sampling for maximizing the efficiency?
Can one use alternate data streams to perform the same task with different costs?
Given that M streams are necessary for a given task, how does one combine the information from the data streams?

We believe that this issue of determining which data streams are relevant and even among those streams which ones provide most relevant information at any given moment is a very important problem that needs to be addressed and has been ignored in the current literature. Current multimedia systems, usually start with the assumption that there is a given set of N data streams, unfortunately in most cases N=1 making it a signal analysis rather than a multimedia problem, and one must deduce or extract all information from there to build the schema representing the environment. There are other issues related to semantics and indexing that we do not wish to address here.

Perceptual circle

Our ideas are articulated using some important concepts that Neisser introduced in 1976 in his work on the notion of perceptual cycle to model how people perceive the world. He presented the idea that a perceiver builds a model of the world by acquiring specific signals and information to accomplish certain tasks in the natural environment. The perceiver continuously builds a schema that is based on the signals that he has received so far. This schema represents the world as the perceiver sees it at that instant. The perceiver then decides to get more information to refine the schema for accomplishing the task that he has in mind. This sets up the cycle as shown in the following figure. The perceiver gets signals from the environment, interprets them using the current schema, uses the results to modify the schema, uses the schema to decide to get more information, and continues the cycle until the task is done.

Experiential Sampling Definition

Experience is defined as the accumulation of knowledge or skill that results from direct participation in events or activities. The direct participation implies having access to the environment of the event to observe it using all potential sensory mechanisms available to the perceiver or the experiencer. In such an environment, the experiencer is driven by the goal of maximizing the efficacy of building the schema with minimal efforts to accomplish the most efficient mechanism to accumulate the knowledge. This task translates into selecting appropriate data streams at any given time, based on the current schema, for paying attention.

We define
Experiential sampling: the process of identifying the most relevant data stream among the available streams at a given instant to utilize for interpretation to refine the current model of the environment.

Experience in Multimedia Analysis: is any information that needs to be specified to characterize the current state of the multimedia system. It includes the current environment, a priori knowledge of the system domain, current goals and the past states.

Although experience and experiential environments are domain dependent and their components are not clear in general, we define three main components as follows:

Current contextual information: is the current existing information about the environment that needs to be specified to characterize the current state of the multimedia system with respect to the current goal.
Past experience: is the accumulated experience of the multimedia analysis task performed in the past.
Goal: is the purpose of the current analysis task. It is used to define what the related experiences are, and what analysis technique should be employed to accomplish the task.

There are some relationships among these components. The current contextual information can be characterized by features extracted from the visual scene and other accompanying multimedia data (audio, speech, text etc.). The current goal and prior knowledge provide a top-down approach to analysis. It also determines which features of the visual scene and other accompanying data type should be used to represent the environment. The past experiences encapsulate the experiences till the current state. These relationships can help us define the experiential environments when we perform multimedia analysis. More importantly, when we consider the experiential environment, the analysis process systematically integrates the top-down and bottom-up approaches by employing the context and history.