Question #2 - Action vs Static Images
2) In what ways are perceiving action (i.e. labeling sequences as being imagery of a particular action taking place) fundamentally different than dealing with static images? One view is that action labeling is like "old style" object recognition ("that's a chair") as opposed to the now predominant CAD-based view (this is some exactly specified chair shaped object).
In our categorization, labeling sequences can be considered as one of visual behaviors. What is the more important is what kind of meaning in "labeling process" has in the context of the agent's goal. That is, how can the agent find its own labeling.
In case of "old-fashioned" object recognition such as "that
is chair," we should make clear what kind of meaning the object (human
interpretation might be "a chair.") has for the agent to achieve
its goal rather than seeking for geometrical matching or similar methods.
For example, for four-legged robots like a dog or cat, does the concept
"chair" or similar object exist? It could be, but it might quite
different from ours.
A useful result from a vision system would be an answer to the question,
"What is happening?'' If we ask a computer to interpretation a video
signal, we are obliged to develop computational models of what is interesting
(events), and how sets of interesting signals map to interpretations (contexts).
Let me argue briefly that the prospects for vision-of-action are much more
promising than for traditional high-level vision, because for vision-of-action
there is a good model of both events and contexts: causality.
All of the pressing issues in object recognition are present (or will
be present) in action understanding as well: issues of view-centered vs.
object centered representation, of what the desired modes of generalization
are, of whether to model the statistics of pixels or to capture higher structure.
The classic issues of syntax vs. semantics, functionality, and even knowledge representation, all are there. I would hope practitioners in each subfield would be well versed in each other's techniques; the relative bulk and history of the object literature would place the larger burden in this regard on the action recognition/understanding researcher. Perceived lack of prior progress is, of course, no excuse to be ignorant of any relevant literature. Too many of us, myself included, have proffered a 'novel' method or technique which really amounts to a change of domain.
But wait, the spatio-temporal domain does matter! There are differences in technique, and in the phenomenology, of the perception of dynamic and static objects. Few of the current computational approaches to the perception of action exploit what I consider the salient differences, though I hope this will change as the field matures.
At the risk of excessive symmetry in nomenclature, I would say the clearest difference lies in the "action of perception of action". (At least as it relates to the action of perception, or active perception, of static signals). It is only when we consider active perception, with its implicit perceptual limitations and foveal dynamics, that it is easy to see clear differences between static object perception and dynamic action perception. Simply put: with objects, under active perception conditions, we have the luxury of extended viewing conditions.
In a sense, in active object perception we have the potential for full access to the state of the object, since given sufficient viewing time we can scan the perceptual apparatus to an arbitrary set of locations and obtain the highest resolution information everywhere on the object. If we are unsure about what we saw the moment before-was that the letter I or the number 1?--we can, and do, rapidly scan back to the relevant portion of the object. As many of the pioneers of active perception have pointed out, when the world is available to represent itself the brain will likely take advantage of this fact.
But this is impossible with dynamic objects, by definition! If an active perceptual apparatus is attending to a particular spatial portion of the spatio-temporal signal flowing past, some information at other spatial regions will be inevitably be missed, and the perceptual system will not in general have any way of recovering that information. Perception of action is therefore a more challenging task, or at least more unforgiving to the errant active observer. It also makes explicit the need for internal representations, insofar as aspects of action can influence interpretation (i.e., context) across relatively long time scales. Whether there is internal representation in static perception may be open for debate, but not so in the perception of action.
There are many interesting perception of action tasks that do not require
active perception. Probably most of the short term progress in new application
areas will be of this character. But I do not see how this endeavor is different
from object recognition in any principled way. If we have full access to
the spatio-temporal signal, there is no difference between understanding
the spatial signal and the spatio-temporal signal. (Of course, for real-time
systems, there is also the additional difference that temporal signals are
causal, and processing must not depend on future observations; but if off-line
processing is allowed this is not an issue, and thus does not seem to be
a central distinction between the spatial and spatio-temporal domains.)
Whatever the case may be, it is important to keep in mind that the challenges
to appearance-based techniques that have limited their application to 3-D
object recognition (including large memory requirements for practical object
databases to handle arbitrary viewpoint, useful indexing methods, recognition
in clutter, susceptibility to occlusion, sensitivity to illumination conditions,
difficulty in formulating a useful model for rejecting the null hypothesis,
...) are only worsened when one considers highly articulated objects such
as people, who wear a variety of clothing styles, colors and textures and
perform their actions and activities with very large within-class variation.
Several important distinctions between action perception and the range of tasks associated with static images stem from the relative ease with which the key semantically important entities can be segmented from a temporal sequence of images. This is principally by virtue of the motion of the actors participating in an action with respect to the surrounding environment. This feature of action perception has been exploited from the earliest work in machine vision through simple methods such as background subtraction, and more recently in a range of more sophisticated segmentation techniques. Of course the problems of analysing static imagery are still encountered when actors interact with stationary objects, although in some cases the motion of the actor alone may be sufficient to cue the presence of a given object. For example, the act of sitting suggests the presence of a chair or level surface.
This ease of segmentation has two important implications - one obvious and the other more subtle. The first is that the objects extracted are at the required level of abstraction required for symbolic approaches to analysis (e.g. qualitative reasoning, spatial and temporal logics). As a result, we've seen systems that can produce relevant natural language descriptions that would be hard to match in tasks associated with static images.
The second implication is the possibility for automated learning, not
only of the actions themselves but also of the shapes of classes of constituent
actors. This is important not only for action perception, but may also drive
our acquisition of models for subsequent use in the analysis of static images.
The most fundamental notion is the qualitative nature of action perception. In this aspect, action perception is somewhat similar to recognizing natural objects (trees, animals, etc.) in static images. Consider a set of possible instantiations of a given "action" in the real world. If you compare the motion trajectories, unless for a high-precision programmed motion machine, you will never find an exact match between them. Moreover, even a pair of performances have exhibited extremely close trajectories, they are not necessarily be regarded as the same action if their results are different. For example, in a typical traditional robotic problem called peg insertion (with a small clearance), it is well known that if you control a robot very rigidly you have little chance of success, and the resulting behavior changes unpredictably over repetitions. In contrast, if you apply an appropriate compliant control, you gain a great increase in success rate but now your robot will never traces exactly the same trajectory. You will call the latter case as repetition of the same actions.
In general contexts, it is inevitable for action perception to choose appropriate aspects of the data so that they capture the invariances over different instances of the same action. Therefore, feature selection and event selection, in other words, spatial (feature and location) and temporal attention control, plays a vital role in action perception [Kuniyoshi93].
Actually, action perception has nothing to do with image-by-image static
analysis of a given image sequence. This kind of scheme never solves the
problem, it just transforms the image sequences to another format. What
is necessary is to keep track of the movement of the subject of action,
extract invariance over time, and detect important time points where the
system temporally focuses on and detect meaningful events in the world.
It appears that human processing of static and moving images utilizes
very different strategies. In particular, special purpose real-time detectors
that may involve complex internal motor control models seem to be employed
for extracting high-level descriptions from moving images, as described
below.
The obvious difference between perceiving action and dealing with static imagery is that we have time to work with. In one sense, this increases the potential complexity of the problem, but as is the case with color, the additional structure can cut both ways. On the down side, new data structures and representations may be needed (well, for researchers, this is an up side), and resource requirements (storage and computation) may increase. On up side, the additional structure may make some problems easier. For example, motion is obviously a powerful segmentation cue; there is also evidence that movement provides some powerful recognition cues that make solving certain problems much easier than with static images (e.g. finding walking people).
Mathematically, time can be seen as just another dimension; however even at the lowest levels of processing, we tend to handle the time dimension in a fundamentally different way than the spatial dimensions - (e.g. note the asymmetry of the time dimension in the basic motion equations).
Physically there is the whole notion of causality, and unidirectionality in time, but beyond that, the important structures are just fundamentally different.
We are interested in how objects move/change in time, what the causal connections between events is, keeping track of individual objects over time, describing dynamic interactions, predicting future interactions, etc. None of these have close correlates in the world of static imagery.
I don't think the issue is "like old-style recognition". There are plenty of useful action-oriented measures that have a continuous as opposed to a discrete feel to them (e.g. trajectory prediction, and object/part tracking), or are quite specific (e.g. notice when THIS door swings open). There are also plenty of issues, as mentioned above, that don't seem to have a close correlate in conventional image processing.
The main point, however, is that considering action simply opens up a
huge number of essentially independent sources of information about the
world and what is going on in it.
Object recognition in human vision seems in some cases to use a view-based
strategy and to be unable to use view-invariant 3D information. It seems
that the perception of some actions is not different. Sinha, Buelthoff and
Buelthoff have shown that the perception of sequences of biological motions
such as the ones developed by Johansson is strongly view-based. Motion sequences
acquired from familiar view-points are recognized much more readily than
the ones from unfamiliar view-point. Such view-point dependency is more
consistent with a view-based representation than with an object-centered
3D representation. Furthermore, experimental data suggests that recognition
of biological motion sequences is largely unaffected by large distortions
in the 3D depth structures of the moving figure, so long as its 2D appearance
is not overly distorted. In fact, subjects seem to be perceptually unaware
of the depth distortions - an interesting instance of top-down recognition
based influences overwhelming the information being provided by the bottom-up
depth extraction processes such as binocular stereo. It would be interesting
to conceive of sequences where depth-structure might in fact be critical
for action recognition.
I speculate that there are fewer interesting classes of actions than classes of 3D shapes, and that there is less between-class similarity, simply because motion is harder to generate and costs energy. It must be coordinated, controlled, and efficient in order to be useful. As a result, I suspect there are fewer categories of motion than of shape, which suggests that the problem may be easier. Furthermore, small functionally irrelevant variations in the shape of a 3D object can confound object recognition, but patterns of motion may tend to be devoid of unnecessary flourish (ballet and funny car races notwithstanding.)
In addition, the segmentation problem for motion may be somewhat easier
than for static imagery. The simplest case of a single motion taking place
against a static background probably occurs more frequently in practice
than a single object silhouetted against a uniform background. But motion
segmentation can also be quite hard in general.
In a typical approach for action recognition, each frame of a sequence is analyzed individually to extract some feature vector (See "how to represent action?" for examples of features). The feature vector as a function of time is used to recognize an action. We need to emphasize here that each static frame is not really recognized in terms of most general case of recognition. (Even though, some work has been reported to identify individual parts of human body in each frame, which can be used for action recognition.)
So, the major difference between dealing with static images and action recognition is that in action recognition we are more interested in how the feature vector changes with respect to time, which conveys the information about action. One can treat all the frames in a sequence as one long vector, essentially removing the time factor. (This will require time warping to convert different sequences to the same length, which will bring all such vectors to the same dimensionality.) One simple possibility is to use all gray levels in all frames as a feature vector. This is a spatiotemporal feature vector.
We have used eigen decomposition of this spatiotemporal vector in the
lipreading problem, and have observed that it has more discriminating power
as compared to treating each static frame individually. However, it is computationally
expensive, for it requires time warping, and each vector is very long.
Are instantaneous measurements sufficient for recognition?
I will propose that a temporal framework is far more appropriate and economic since the ambiguity of instantaneous measurements brings to question the feasibility of effective recognition.