Question #6 - Role of Causal Reasoning
6) What is the role of "causal" or physics based descriptions in understanding action? Is it more important for action than for statics? Are appearance/kinematic descriptions more or less appropriate than for the static case? [I am trying to get at the difference between just describing things as they "are" versus "why" they are that way. While the explanation based approaches have not become prominent in the interpretation of static imagery they seem to be more easily invoked for the understanding of video. Then again, the HMM approach to action description is purely based on what tends to happen, no mention of why.]
I think that the issue is not so much about whether our models are physical or not but rather whether they provide the necessary constraints. Whether physical or image-based, we need models of action that provide strong constraints on the interpretation of image events.
Such strong constraints are necessary for estimating image properties such as motion in complex scenes. They also provide concise descriptions of image events that can be used for recognition. An example of this is my work with Yaser Yacoob on the recognition of facial expressions from image motion. By formulating simple parameterized models of facial feature motions we were able to measure these motions reliably and exploit the changing parameter values for recognition.
The use of such strong models in general scenes raises the important,
and little addressed, issue of indexing. Given a set of models which may
explain an image event, how do we efficiently choose which model to use?
Moreover, if the model is scale dependent how do we select the appropriate
spatial and temporal scales?
Vision sciences traditionally take high-level vision to be concerned with static properties of objects, typically their identities, categories, and shapes. The relationships between these properties and visual features are correlational (not necessary or sufficient), leading to many proposals for how brains and computers may compute optimal discriminators for various sets of images. The success of these methods depends of course on coming close to the prior probability of all possible "natural" images, a distribution which nobody knows how to approximate.
Arguably, causal dynamic properties of objects and scenes are more informative, more universal and more easily computed. These properties- substantiality, solidity, contiguity, contact, and conservation of momentum -- are governed by simple physical laws at human scales and are thus consistent across most of visual experience. They also have very simple signatures in the visuo-temporal stream. But most importantly, since these properties are causal, we may find that a small number of qualitative rules will provide satisfactory psychological and computational accounts of much of visual understanding.
For example: There is a growing body of psychological evidence showing that infants are fluent perceivers of lawful causality and violations thereof. Spelke and Van de Valle found that nine-month old infants can make sophisticated judgments about the causality of action-sometimes before they can reliably segment still objects from backgrounds! Yet, only a handful of rules suffices to describe all the causal distinctions made by the infants.
Following this lead, I am exploring the hypothesis that causal perception rests on inference about the motions and collisions of surfaces (and proceeds independently of processes such as recognition, reconstruction, and static segmentation). I am building systems that use causal landmarks to segment video into actions, and use higher-level causal constraints to ensure that actions are consistent over time. Each system takes a video sequence of manipulative action as input, and outputs a plan-of-action and selected frames showing key events --- the "gist'' of the video --- useful for summary, indexing, reasoning, and automated editing. Gisting may be thought of as the "inverse Hollywood problem'' --- begin with a movie, end with a script and storyboard.
These systems track patches of similarly behaving pixels (coherent in motion or color) and interprets the changing spatial relations between them. The strategy is to detect potential events in varying shapes, velocities, and spatial relations among the tracked surfaces. This becomes a problem in dynamical modeling of multiple interacting processes. (Up to three "objects" and their pairwise spatial relations are needed to capture the full range of actions denoted by natural language verbs.) For this I have developed a generalization of hidden Markov models (HMMs) in which HMMs are coupled with across-model probabilities to represent the interactions between processes, e.g., between sets of evolving spatial relations. These models are trained to recognize events and then assembled into a finite state machine whose transitions accord with the infant perception results obtained by Spelke et al. A modified Viterbi algorithm is then used to parse video sequences of continuous action, integrating information over time to find the most probable sequence of actions given the evidence in the video.
These systems are aimed at dexterous manipulation tasks, like repairing a machine. In such tasks, causal events (grasping a screwdriver, touching it to a screw, pulling out the screw, etc.) contain most of the information one needs to reconstruct the actor's plan and intentions. But there are many actions described by natural language that are not causal. Some are gestural, such as waving and dancing. In these cases, there may still be frameworks for interpretation that provide computational leverage, e.g., dances are rhythmically structured and deviations from that structure are often expressive. This framework is not as compact, productive, and reliable as physical causality, but it has obvious utility. A similar argument can be made for games and athletic events-there is a loose logic to the patterns of motion; both deviations and completions of cliched motion sequences are interesting. Conversational gesture is substantially harder; the individual and cultural variation in gesture events is enormous, and the context is even less tractable. However, I have argued elsewhere that there is a small, oft-used subset of gestures that have metaphorical causality, for example, "opening" one's arms to indicate receptivity and "pushing away" an unappealing proposition.
Even if all these tasks are considered purely as pattern-recognition
problems, there are physical logics that will improve one's computational
prospects. Most of these derive from the kinematics and dynamics of the
human body, and this can lead us to a choice of pattern recognition algorithms
and signal representations. For example, to make a system that recognizes
martial arts moves, we noted that the positions of the end-effectors (hands)
carries enough information to support gesture discrimination. Moreover,
since each hand is doing something different, the difference between them
carries information as well. Consequently, we used coupled HMMs with each
model only seeing data from one hand. This substantially outperformed conventional
HMMs applied to data from both hands, principally because coupled HMMs can
explicitly model the interaction.
The recognition of complex static objects consists usually in a loop of two operations: (1) computing properties on the static object, (2) then classifying the object based on these properties and on a set of predefined object models. First, we start with generic properties and then we end up with specific ones. At each step, we hope to reduce the number of possible classes for the current object. Thus we can plan the computation of properties used to recognize static objects.
The recognition of actions differs in the fact that all properties cannot
be computed at the beginning. We have to wait for other frames to compute
new properties and to enhance the recognition. Instead of using classification
or deduction, we have to use abductive diagnosis for action recognition.
If we compute a property on an action, then it just gives a clue to generate
a hypothesis to explain the property. Then we have to manipulate a set of
hypotheses and to wait for additive properties to decide which one is true.
Another possibility could be that we can work towards representations
that allow for causal reasoning, i.e. what is happening and how can we use
it to interpret present and future actions. In many ways this method is
aimed at extracting scripts of actions from series of snapshots (both static
and dynamic). I believe this methodology could prove to be very useful in
limited domains where we can define the context and also bootstrap some
sort of representational model of action within the context.
Causality is essential in action understanding. The reasons are already
given so far. [See Kuniyoshi's previous answers.]
In this work we have described a specific image sequence understanding system based on the Newtonian mechanics of a simplified scene model. While the actual domain theory is not critical, it is interesting to contrast the general physics-based approach to other approaches to motion understanding.
Physics-based approaches vs. appearance-based approaches. Appearance-based approaches, such as hidden Markov models (Siskind and Morris, 1996), describe events based on the time course of features such as the positions and motions of the participant objects. Unlike the physics-based approach, however, there is no underlying representation for object properties (such as active vs./ passive objects or attachment) that are not directly observable from the sensory input. While one could argue that a hidden Markov model could distinguish these cases, more observations of the events would be required to train the system since the model is not exploiting domain-specific knowledge about Newtonian mechanics.
Physics-based approaches vs. rule-based approaches. Several rule-based
systems have been presented that attempt to describe the underlying "causal''
structure of the domain, usually in the form of linguistic or conceptual
descriptions. Sample domains include the analysis of support relations in
static scenes (Brand et. al, 1993) and the analysis of changing kinematic
relations in sequences (Borchardt, 1996). While such approaches are appealing,
we feel they are inadequate since they do not provide an explicit representation
and the validity of the rules can therefore not be evaluated.
Our psychophysical results suggest that people rely on internal models
of kinematics and dynamics not only to generate but also to recognize movement.
This means recognition systems akin to human movement perception would have
a stock of "recognizable" physical models to draw on when observing
and interpreting action and movement. This is perfectly reasonable for a
constrained set of interactions; it is quite likely that humans evolved
the ability to learn and maintain such models since social interaction,
i.e., watching human action/behavior, is an important part of human visual
processing. Aside from social interaction, humans certainly appear to use
internal physical models of objects when performing reaching, grasping,
and manipulation, even in the absence of visual feedback. This notion of
a functional physical model, as distinct from a static model or view typically
used in vision, may have a great deal of potential not only in senori-motor
interaction and perception, but in higher level cognition as well.
While I'm uncertain about the role of causal analysis per se, it's certainly very useful from the standpoint of efficiency to decouple the total state space into separate systems with constrained interactions between them. The complete set of human motions can be characterized, for example, and then mapped onto an arbitrary set of objects. One clear role for detailed knowledge about physics and process is as an aid to the perceptual system. In fact, most previous work addressing causal analysis has used it to compensate for very crude vision techniques. I am skeptical about the ability of this line of attack to scale up to interesting problems, and am unsure of the need for physics reasoning if the sensing process is reliable. Note that I would answer differently in a robotics context where it is necessary to build prior models for the effects of one's actions that could be used, for example, for planning.
Perhaps the most promising role for causal knowledge is the ability to enlarge a category of action concept to include a new instance after observing it for the first time. For illustration, consider the previous example of the electric mixer. An action recognition system that knows about previous instances of mixing might have a causal model of the effect of moving utensils on ingredients. If a causal connection between the beaters and the ingredients could be established for the electric mixer, it may be possible to add electric mixers to the action concept class.
Back to Workshop Homepage