Question #7 - Representing Time
7) How should time be handled? Again, drawing on my own work, I have considered manipulations varying from simple speed variation to dynamic time warping to temporal interval reasoning. Which variations are important to be able to handle?
There is already a wealth of visual representations that will be useful for vision-of-action: deformable templates, motion blobs, higher-order moments, flow, temporal textures, texture tracking, eigen-parameter representations, etc. Many of the immediate challenges now lie on the side of interpretation. Better representations of time and different time-scales [are part of what we need].
It depends on the application. If the properties on mobile objects are true/false at one time point and at one location, then we can use sophisticate time reasoning. For example in some work, robots are monitored in a plan. They use optical barriers as sensors, so they know at each time point robot locations. This is also the case of applications like the safety monitoring of planes in airports.
If you try to interpret human activities, you never know when a property starts or is really true. You have to deal with uncertainty and transitions between actions. At the present time, the recognition of actions is not robust enough to enable a complex manipulation of time. As far as I am concerned, I only use a very simple notion of time (e.g. How long has lasted the action?, Action 1 starts after action 2 finishes.).
Our own work, here, uses temporal logic programming to represent and recognize people's actions and interactions, with those logic programs operating on (and controlling the construction of) a database of facts concerning the positions and trajectories of body parts in image sequences.
Such linguistic models also liberate us from the "tyranny of time" that plagues appearance-based methods, which typically employ computational models such as HMM's, dynamic time warping, or autoregressive models to match appearance models to image sequences. It is by now almost dogmatic within the computer vision community that spatial vision is context dependent and opportunistic. Knowing, with confidence (either from a priori information or from image analysis) the location of some object in an image provides a spatial context for searching for and recognizing related objects. We would never consider a fixed, spatial pattern of image analysis independent of available or acquired context. I believe that the same situation should obtain in temporal vision, and especially in the analysis of human actions and interaction. Our recognition models should predict, from prior and acquired context, actions both forward and backwards in time. This idea is both classic and commonplace in AI planning systems, that reason both forward and backward, and force planning to move through certain states known (or determined) to be part of the solution sought. Such approaches help to control the combinatorics of recognition and naturally allow us to integrate prior and acquired knowledge into our analysis.
Overall, our goal is that we have to extract some sensory information so that we can use it to label certain events in the scene. Capturing and labeling an event may not be exactly what we need for perception of action, as an action could be a series of related events or sometimes just a single event (i.e. walking vs. pointing). This causes a major problem in how to represent actions (or how to define an action) for recognition. Additionally, it suggests the importance of time in the representation of actions. It is for this reason some sort of spatio-temporal reasoning needs to be incorporated into our representations. Such spatio-temporal reasoning can be incorporated by using constructs that change with time; with expectations and probabilities assigned to these constructs so that we can predict the changes and estimate the actions. There are several ways we can attempt to do this.
A possible representation of some stereotype actions (say Johannesma-type motions) is to extend to time the morphable models of Jones and Poggio (see also Taylor, Cootes, Blake,). One action consists then of a t-field - defined as the sequence of correspondences between the images in the sequence, that is the sequence of optical flows computed between successive and overlapping pairs of images in the sequence. Correspondence between two actions is provided by correspondence between their t-fields - which is in principle provided by one optical flow between two images in the two sequences at the "same" time t.
A morphable model of a given action is then the linear combination of the t-fields of a few prototypical examples of the action. A morphable model can be used for analysis and synthesis like in the static case.
There are several open issues in an approach of this type. The first one regards the primitives used for matching: pixels may be used but seem too expensive a representation. Coarser scale image-based representations may be one solution. Another is to compress say an initial pixel representation to contain only points that move in different ways on the image plane. Another issue is related to time warping: it should be possible to establish correspondence between two actions even if their duration is different. This calls for extensions of the usual optical flow algorithms to space-time, which in turn may change our definition of t-fields and the primitives on which they are based.
The basic representation of action patterns is labelled space-time volumes. There are at least three levels in the representation hierarchy, which can be illustrated for the case of the hands. At the first level is a statement simply that the hands are moving. At the second level is the identification of a specific hand gesture, and at the third is the estimation of some parameters of the gesture, such as its velocity, throughout its duration. In each case it should be possible to localize the action in image space. But 3D locations are likely to be application specific (such as the upstage and downstage notation used for dance steps.) A notion of spatio-temporal scale space could be useful for action patterns. Time plays a similar role to space at this level, the goal is to remain invariant to it.
Action concepts are in turn likely to be composed of a series of action patterns or other lower-level events. Time should play more of an ordering role here, regulating the sequencing of events. It may not be easy to confine high-level actions to a localized time and place. For example, building a house comprises a large collection of activities taking place at different places and times. However, the observer's reference frame obviously plays a role here, as detecting the action "building a house" >from satellite imagery is obviously different from doing so from 100 yards away.
While a spatial top-down strategy could draw from earlier work in object recognition, a temporally driven top-down strategy raises issues that have not been encountered previously. The most prominent of these issues are:
(1) What is a temporal window and how are spatial scale-space ideas extended to spatio-temporal analysis?
(2) What are the temporal invariants of activities?
(3) What spatio-temporal representations are effective?
(4) When does time begin? And how important is it?
It is necessary to acknowledge and enquire into the differences between time and space, since time cannot be simply treated as an additional dimension to space.
Back to Workshop Homepage