Question #2 - Action vs Static Images

Question #2 - Action vs Static Images

2) In what ways are perceiving action (i.e. labeling sequences as being imagery of a particular action taking place) fundamentally different than dealing with static images? One view is that action labeling is like "old style" object recognition ("that's a chair") as opposed to the now predominant CAD-based view (this is some exactly specified chair shaped object).

Minoru Asada

In our categorization, labeling sequences can be considered as one of visual behaviors. What is the more important is what kind of meaning in "labeling process" has in the context of the agent's goal. That is, how can the agent find its own labeling.

In case of "old-fashioned" object recognition such as "that is chair," we should make clear what kind of meaning the object (human interpretation might be "a chair.") has for the agent to achieve its goal rather than seeking for geometrical matching or similar methods. For example, for four-legged robots like a dog or cat, does the concept "chair" or similar object exist? It could be, but it might quite different from ours.

Matthew Brand

A useful result from a vision system would be an answer to the question, "What is happening?'' If we ask a computer to interpretation a video signal, we are obliged to develop computational models of what is interesting (events), and how sets of interesting signals map to interpretations (contexts). Let me argue briefly that the prospects for vision-of-action are much more promising than for traditional high-level vision, because for vision-of-action there is a good model of both events and contexts: causality.

Trevor Darrell

All of the pressing issues in object recognition are present (or will be present) in action understanding as well: issues of view-centered vs. object centered representation, of what the desired modes of generalization are, of whether to model the statistics of pixels or to capture higher structure.

The classic issues of syntax vs. semantics, functionality, and even knowledge representation, all are there. I would hope practitioners in each subfield would be well versed in each other's techniques; the relative bulk and history of the object literature would place the larger burden in this regard on the action recognition/understanding researcher. Perceived lack of prior progress is, of course, no excuse to be ignorant of any relevant literature. Too many of us, myself included, have proffered a 'novel' method or technique which really amounts to a change of domain.

But wait, the spatio-temporal domain does matter! There are differences in technique, and in the phenomenology, of the perception of dynamic and static objects. Few of the current computational approaches to the perception of action exploit what I consider the salient differences, though I hope this will change as the field matures.

At the risk of excessive symmetry in nomenclature, I would say the clearest difference lies in the "action of perception of action". (At least as it relates to the action of perception, or active perception, of static signals). It is only when we consider active perception, with its implicit perceptual limitations and foveal dynamics, that it is easy to see clear differences between static object perception and dynamic action perception. Simply put: with objects, under active perception conditions, we have the luxury of extended viewing conditions.

In a sense, in active object perception we have the potential for full access to the state of the object, since given sufficient viewing time we can scan the perceptual apparatus to an arbitrary set of locations and obtain the highest resolution information everywhere on the object. If we are unsure about what we saw the moment before-was that the letter I or the number 1?--we can, and do, rapidly scan back to the relevant portion of the object. As many of the pioneers of active perception have pointed out, when the world is available to represent itself the brain will likely take advantage of this fact.

But this is impossible with dynamic objects, by definition! If an active perceptual apparatus is attending to a particular spatial portion of the spatio-temporal signal flowing past, some information at other spatial regions will be inevitably be missed, and the perceptual system will not in general have any way of recovering that information. Perception of action is therefore a more challenging task, or at least more unforgiving to the errant active observer. It also makes explicit the need for internal representations, insofar as aspects of action can influence interpretation (i.e., context) across relatively long time scales. Whether there is internal representation in static perception may be open for debate, but not so in the perception of action.

There are many interesting perception of action tasks that do not require active perception. Probably most of the short term progress in new application areas will be of this character. But I do not see how this endeavor is different from object recognition in any principled way. If we have full access to the spatio-temporal signal, there is no difference between understanding the spatial signal and the spatio-temporal signal. (Of course, for real-time systems, there is also the additional difference that temporal signals are causal, and processing must not depend on future observations; but if off-line processing is allowed this is not an issue, and thus does not seem to be a central distinction between the spatial and spatio-temporal domains.)

Larry Davis

Whatever the case may be, it is important to keep in mind that the challenges to appearance-based techniques that have limited their application to 3-D object recognition (including large memory requirements for practical object databases to handle arbitrary viewpoint, useful indexing methods, recognition in clutter, susceptibility to occlusion, sensitivity to illumination conditions, difficulty in formulating a useful model for rejecting the null hypothesis, ...) are only worsened when one considers highly articulated objects such as people, who wear a variety of clothing styles, colors and textures and perform their actions and activities with very large within-class variation.

David Hogg

Several important distinctions between action perception and the range of tasks associated with static images stem from the relative ease with which the key semantically important entities can be segmented from a temporal sequence of images. This is principally by virtue of the motion of the actors participating in an action with respect to the surrounding environment. This feature of action perception has been exploited from the earliest work in machine vision through simple methods such as background subtraction, and more recently in a range of more sophisticated segmentation techniques. Of course the problems of analysing static imagery are still encountered when actors interact with stationary objects, although in some cases the motion of the actor alone may be sufficient to cue the presence of a given object. For example, the act of sitting suggests the presence of a chair or level surface.

This ease of segmentation has two important implications - one obvious and the other more subtle. The first is that the objects extracted are at the required level of abstraction required for symbolic approaches to analysis (e.g. qualitative reasoning, spatial and temporal logics). As a result, we've seen systems that can produce relevant natural language descriptions that would be hard to match in tasks associated with static images.

The second implication is the possibility for automated learning, not only of the actions themselves but also of the shapes of classes of constituent actors. This is important not only for action perception, but may also drive our acquisition of models for subsequent use in the analysis of static images.

Yasuo Kuniyoshi

The most fundamental notion is the qualitative nature of action perception. In this aspect, action perception is somewhat similar to recognizing natural objects (trees, animals, etc.) in static images. Consider a set of possible instantiations of a given "action" in the real world. If you compare the motion trajectories, unless for a high-precision programmed motion machine, you will never find an exact match between them. Moreover, even a pair of performances have exhibited extremely close trajectories, they are not necessarily be regarded as the same action if their results are different. For example, in a typical traditional robotic problem called peg insertion (with a small clearance), it is well known that if you control a robot very rigidly you have little chance of success, and the resulting behavior changes unpredictably over repetitions. In contrast, if you apply an appropriate compliant control, you gain a great increase in success rate but now your robot will never traces exactly the same trajectory. You will call the latter case as repetition of the same actions.

In general contexts, it is inevitable for action perception to choose appropriate aspects of the data so that they capture the invariances over different instances of the same action. Therefore, feature selection and event selection, in other words, spatial (feature and location) and temporal attention control, plays a vital role in action perception [Kuniyoshi93].

Actually, action perception has nothing to do with image-by-image static analysis of a given image sequence. This kind of scheme never solves the problem, it just transforms the image sequences to another format. What is necessary is to keep track of the movement of the subject of action, extract invariance over time, and detect important time points where the system temporally focuses on and detect meaningful events in the world.

Maja Mataric

It appears that human processing of static and moving images utilizes very different strategies. In particular, special purpose real-time detectors that may involve complex internal motor control models seem to be employed for extracting high-level descriptions from moving images, as described below.

Randal Nelson

The obvious difference between perceiving action and dealing with static imagery is that we have time to work with. In one sense, this increases the potential complexity of the problem, but as is the case with color, the additional structure can cut both ways. On the down side, new data structures and representations may be needed (well, for researchers, this is an up side), and resource requirements (storage and computation) may increase. On up side, the additional structure may make some problems easier. For example, motion is obviously a powerful segmentation cue; there is also evidence that movement provides some powerful recognition cues that make solving certain problems much easier than with static images (e.g. finding walking people).

Mathematically, time can be seen as just another dimension; however even at the lowest levels of processing, we tend to handle the time dimension in a fundamentally different way than the spatial dimensions - (e.g. note the asymmetry of the time dimension in the basic motion equations).

Physically there is the whole notion of causality, and unidirectionality in time, but beyond that, the important structures are just fundamentally different.

We are interested in how objects move/change in time, what the causal connections between events is, keeping track of individual objects over time, describing dynamic interactions, predicting future interactions, etc. None of these have close correlates in the world of static imagery.

I don't think the issue is "like old-style recognition". There are plenty of useful action-oriented measures that have a continuous as opposed to a discrete feel to them (e.g. trajectory prediction, and object/part tracking), or are quite specific (e.g. notice when THIS door swings open). There are also plenty of issues, as mentioned above, that don't seem to have a close correlate in conventional image processing.

The main point, however, is that considering action simply opens up a huge number of essentially independent sources of information about the world and what is going on in it.

Tomaso Poggio and Pawan Sinha

Object recognition in human vision seems in some cases to use a view-based strategy and to be unable to use view-invariant 3D information. It seems that the perception of some actions is not different. Sinha, Buelthoff and Buelthoff have shown that the perception of sequences of biological motions such as the ones developed by Johansson is strongly view-based. Motion sequences acquired from familiar view-points are recognized much more readily than the ones from unfamiliar view-point. Such view-point dependency is more consistent with a view-based representation than with an object-centered 3D representation. Furthermore, experimental data suggests that recognition of biological motion sequences is largely unaffected by large distortions in the 3D depth structures of the moving figure, so long as its 2D appearance is not overly distorted. In fact, subjects seem to be perceptually unaware of the depth distortions - an interesting instance of top-down recognition based influences overwhelming the information being provided by the bottom-up depth extraction processes such as binocular stereo. It would be interesting to conceive of sequences where depth-structure might in fact be critical for action recognition.

James Rehg

I speculate that there are fewer interesting classes of actions than classes of 3D shapes, and that there is less between-class similarity, simply because motion is harder to generate and costs energy. It must be coordinated, controlled, and efficient in order to be useful. As a result, I suspect there are fewer categories of motion than of shape, which suggests that the problem may be easier. Furthermore, small functionally irrelevant variations in the shape of a 3D object can confound object recognition, but patterns of motion may tend to be devoid of unnecessary flourish (ballet and funny car races notwithstanding.)

In addition, the segmentation problem for motion may be somewhat easier than for static imagery. The simplest case of a single motion taking place against a static background probably occurs more frequently in practice than a single object silhouetted against a uniform background. But motion segmentation can also be quite hard in general.

Mubarak Shah

In a typical approach for action recognition, each frame of a sequence is analyzed individually to extract some feature vector (See "how to represent action?" for examples of features). The feature vector as a function of time is used to recognize an action. We need to emphasize here that each static frame is not really recognized in terms of most general case of recognition. (Even though, some work has been reported to identify individual parts of human body in each frame, which can be used for action recognition.)

So, the major difference between dealing with static images and action recognition is that in action recognition we are more interested in how the feature vector changes with respect to time, which conveys the information about action. One can treat all the frames in a sequence as one long vector, essentially removing the time factor. (This will require time warping to convert different sequences to the same length, which will bring all such vectors to the same dimensionality.) One simple possibility is to use all gray levels in all frames as a feature vector. This is a spatiotemporal feature vector.

We have used eigen decomposition of this spatiotemporal vector in the lipreading problem, and have observed that it has more discriminating power as compared to treating each static frame individually. However, it is computationally expensive, for it requires time warping, and each vector is very long.

Yaser Yacoob

Are instantaneous measurements sufficient for recognition?

I will propose that a temporal framework is far more appropriate and economic since the ambiguity of instantaneous measurements brings to question the feasibility of effective recognition.

Back to Workshop Homepage