Extended Abstracts

Extended abstracts - some folks responded with full position papers or "extended abstradcts." I extracted the answers to the provided questions an put them in the appropriate pages, but here are the full texts that give the indivivual coherent positions.

Minoru Asada

Vision-based Behavior Learning and Development for Emergence of Robot Intelligence

We discuss how so-called "intelligence" can be emerged as a cognitive process, that is, how an agent can develop its internal representation according to the complexity of the interactions with its environment through its capabilities of sensing and acting. The complexity might be increased by the existence of other active agents, and the development can be possible depending on how the agent can find a new axis in the internal representation in trying to accomplish a given task in the environment including other agents. As an example of such a development, we show a case study of a vision-based mobile robots of which task is to perform a soccer game playing such as shooting and passing a ball or avoiding an opponent along with preliminary experiments by real robots.

The ultimate goal of our research is to design the fundamental internal structure inside physical entities having their bodies (robots) which can emerge complex behaviors through the interactions with their environments. In order to emerge the intelligent behaviors, physical bodies have an important role of bringing the system into meaningful interaction with the physical environment-complex, uncertain, but with automatically consistent set of natural constraints. This facilitates the correct agent design, learning from the environment, and rich meaningful agent interaction. The meanings of "having a physical body" can be summarized as follows:

1) Sensing and acting capabilities are not separable, but tightly coupled.

2) In order to accomplish the given tasks, the sensor and actuator spaces should be abstracted under the resource bounded conditions (memory, processing power, controller etc.).

3) The abstraction depends on both the fundamental embodiments inside the agents and the experiences (interactions with their environments).

4) The consequences of the abstraction are the agent-based subjective representation of the environment, and its evaluation can be done by the consequences of behaviors.

5) In real world, both inter-agent and agent-environment interactions are asynchronous, parallel and arbitrarily complex. There is no justification for adopting a particular top-down abstraction level for simulation such as a global clock 'tick', observable information about other agents, modes of interaction among agents, or even physical phenomena like slippage, as any seemingly insignificant parameter can sometimes take over and affect the global multi-agent behavior.

6) Natural complexity of physical interaction automatically generates reliable sample distributions of input data for learning, rather than from a priori Gaussian distribution in simulations which does not always capture the correct distribution.

Design principles are

1) the design of the internal structure of the agent which has a physical body able to interact with its environment, and

2) the policy how to provide the agent with tasks, situations, and environments so as to develop the internal structure.

Here, we focus on the former and attempt at defining the complexity of the environment based on the relationship between visual motion cues and self motor commands using our soccer playing robots.

Self body and Static Environment: The self body or static environment can be defined in a sense that the observable parts of which changes in the image plane can be directly correlated with the self motor commands (ex. looking at your hand showing voluntary motion, or observing an optical flow of the environment when changing your gaze). Theoretically, discrimination between "self body" and "static environment" is a hard problem because the definition of "static" is relative and depends on the selection of the base coordinate system which also depends on the context of the given task. Usually, we suppose the natural orientation of the gravity and therefore it provides the ground coordinate system.

Passive agents: As a result of actions of the self or other agents, passive agents can be moving or stopped. A ball is a typical one. As long as they are stationary, they can be categorized into the static environment. But, not so simple correlation with motor commands as the self body or the static environment can be obtained when they are in motion.

Active (other) agents: Active other agents do not have a simple and straightforward relationship with the self motions. In the early stage, they are treated as noise or disturbance because of not having direct visual correlation with the self motor commands. Later, they can be found as having more complicated and higher correlation (coordination, competition, and others). The complexity is drastically increased.

According to the complexity of the environment, the internal structure of the robot should be higher and more complex to emerge various intelligent behaviors. We show one of such structure coping with the complexity of agent-environment interactions with real robot experiments and discuss the future issues.

Matthew Brand

Action perception

A useful result from a vision system would be an answer to the question, "What is happening?" If we ask a computer to interpretation a video signal, we are obliged to develop computational models of what is interesting (events), and how sets of interesting signals map to interpretations (contexts). Let me argue briefly that the prospects for vision-of-action are much more promising than for traditional high-level vision, because for vision-of-action there is a good model of both events and contexts: causality.

Why causality?

Vision sciences traditionally take high-level vision to be concerned with static properties of objects, typically their identities, categories, and shapes. The relationships between these properties and visual features are correlational (not necessary or sufficient), leading to many proposals for how brains and computers may compute optimal discriminators for various sets of images. The success of these methods depends of course on coming close to the prior probability of all possible "natural" images, a distribution which nobody knows how to approximate.

Arguably, causal dynamic properties of objects and scenes are more informative, more universal and more easily computed. These properties- substantiality, solidity, contiguity, contact, and conservation of momentum -- are governed by simple physical laws at human scales and are thus consistent across most of visual experience. They also have very simple signatures in the visuo-temporal stream. But most importantly, since these properties are causal, we may find that a small number of qualitative rules will provide satisfactory psychological and computational accounts of much of visual understanding.

For example: There is a growing body of psychological evidence showing that infants are fluent perceivers of lawful causality and violations thereof. Spelke and Van de Valle found that nine-month old infants can make sophisticated judgments about the causality of action-sometimes before they can reliably segment still objects from backgrounds! Yet, only a handful of rules suffices to describe all the causal distinctions made by the infants.

From pixel correlations to physical causes

Following this lead, I am exploring the hypothesis that causal perception rests on inference about the motions and collisions of surfaces (and proceeds independently of processes such as recognition, reconstruction, and static segmentation). I am building systems use causal landmarks to segment video into actions, and use higher-level causal constraints to ensure that actions are consistent over time. Each system takes a video sequence of manipulative action as input, and outputs a plan-of-action and selected frames showing key events --- the "gist" of the video --- useful for summary, indexing, reasoning, and automated editing. Gisting may be thought of as the "inverse Hollywood problem" --- begin with a movie, end with a script and storyboard.

These systems track patches of similarly behaving pixels (coherent in motion or color) and interprets the changing spatial relations between them. The strategy is to detect potential events in varying shapes, velocities, and spatial relations among the tracked surfaces. This becomes a problem in dynamical modeling of multiple interacting processes. (Up to three "objects" and their pairwise spatial relations are needed to capture the full range of actions denoted by natural language verbs.) For this I have developed a generalization of hidden Markov models (HMMs) in which HMMs are coupled with across-model probabilities to represent the interactions between processes, e.g., between sets of evolving spatial relations. These models are trained to recognize events and then assembled into a finite state machine whose transitions accord with the infant perception results obtained by Spelke et al. A modified Viterbi algorithm is then used to parse video sequences of continuous action, integrating information over time to find the most probable sequence of actions given the evidence in the video.

There's more to action

These systems are aimed at dexterous manipulation tasks, like repairing a machine. In such tasks, causal events (grasping a screwdriver, touching it to a screw, pulling out the screw, etc.) contain most of the information one needs to reconstruct the actor's plan and intentions. But there are many actions described by natural language that are not causal. Some are gestural, such as waving and dancing. In these cases, there may still be frameworks for interpretation that provide computational leverage, e.g., dances are rhythmically structured and deviations from that structure are often expressive. This framework is not as compact, productive, and reliable as physical causality, but it has obvious utility. A similar argument can be made for games and athletic events-there is a loose logic to the patterns of motion; both deviations and completions of cliched motion sequences are interesting. Conversational gesture is substantially harder; the individual and cultural variation in gesture events is enormous, and the context is even less tractable. However, I have argued elsewhere that there is a small, oft-used subset of gestures that have metaphorical causality, for example, "opening" one's arms to indicate receptivity and "pushing away" an unappealing proposition.

Even if all these tasks are considered purely as pattern-recognition problems, there are physical logics that will improve one's computational prospects. Most of these derive from the kinematics and dynamics of the human body, and this can lead us to a choice of pattern recognition algorithms and signal representations. For example, to make a system that recognizes martial arts moves, we noted that the positions of the end-effectors (hands) carries enough information to support gesture discrimination. Moreover, since each hand is doing something different, the difference between them carries information as well. Consequently, we used coupled HMMs with each model only seeing data from one hand. This substantially outperformed conventional HMMs applied to data from both hands, principally because coupled HMMs can explicitly model the interaction.

Upcoming challenges

There is already a wealth of visual representations that will be useful for vision-of-action: deformable templates, motion blobs, higher-order moments, flow, temporal textures, texture tracking, eigen-parameter representations, etc. Many of the immediate challenges now lie on the side of interpretation. What do we need? More sophisticated frameworks for dynamical inference and learning over the output of these representations. Better representations of time and different time-scales. Computationally efficient methods for discriminating action from inaction. And, of course, opportunities to field useful systems and thus encounter real problems.

Trevor Darrell

On the action of perception of action

To me, actions are simply objects in the spatio-temporal domain. All of the pressing issues in object recognition are present (or will be present) in action understanding as well: issues of view-centered vs. object centered representation, of what the desired modes of generalization are, of whether to model the statistics of pixels or to capture higher structure.

The classic issues of syntax vs. semantics, functionality, and even knowledge representation, all are there. I would hope practitioners in each subfield would be well versed in each other's techniques; the relative bulk and history of the object literature would place the larger burden in this regard on the action recognition/understanding researcher. Perceived lack of prior progress is, of course, no excuse to be ignorant of any relevant literature. Too many of us, myself included, have proffered a 'novel' method or technique which really amounts to a change of domain.

But wait, the spatio-temporal domain does matter! There are differences in technique, and in the phenomenology, of the perception of dynamic and static objects. Few of the current computational approaches to the perception of action exploit what I consider the salient differences, though I hope this will change as the field matures.

At the risk of excessive symmetry in nomenclature, I would say the clearest difference lies in the "action of perception of action". (At least as it relates to the action of perception, or active perception, of static signals). It is only when we consider active perception, with its implicit perceptual limitations and foveal dynamics, that it is easy to see clear differences between static object perception and dynamic action perception. Simply put: with objects, under active perception conditions, we have the luxury of extended viewing conditions.

In a sense, in active object perception we have the potential for full access to the state of the object, since given sufficient viewing time we can scan the perceptual apparatus to an arbitrary set of locations and obtain the highest resolution information everywhere on the object. If we are unsure about what we saw the moment before-was that the letter I or the number 1?--we can, and do, rapidly scan back to the relevant portion of the object. As many of the pioneers of active perception have pointed out, when the world is available to represent itself the brain will likely take advantage of this fact.

But this is impossible with dynamic objects, by definition! If an active perceptual apparatus is attending to a particular spatial portion of the spatio-temporal signal flowing past, some information at other spatial regions will be inevitably be missed, and the perceptual system will not in general have any way of recovering that information. Perception of action is therefore a more challenging task, or at least more unforgiving to the errant active observer. It also makes explicit the need for internal representations, insofar as aspects of action can influence interpretation (i.e., context) across relatively long time scales. Whether there is internal representation in static perception may be open for debate, but not so in the perception of action.

There are many interesting perception of action tasks that do not require active perception. Probably most of the short term progress in new application areas will be of this character. But I do not see how this endeavor is different from object recognition in any principled way. If we have full access to the spatio-temporal signal, there is no difference between understanding the spatial signal and the spatio-temporal signal. (Of course, for real-time systems, there is also the additional difference that temporal signals are causal, and processing must not depend on future observations; but if off-line processing is allowed this is not an issue, and thus does not seem to be a central distinction between the spatial and spatio-temporal domains.)

If do we consider the influence of active perception on the perception of action (and we would need to find more asymmetric terminology!), I think it could have a significant impact on applications in the domain of action perception. In many domains the action of the observer is implicit in the semantic content of the signal-the signal is designed, (or potentially co-evolves), to have a particular active observation pattern appropriate to detect/understand/enjoy it. Film and video editors have always understood this, and pay close attention to the dynamics of the active observer when constructing their own dynamic media. (To be honest, this has been true in static imagery as well, but it is usually less explicit in the mind of the static artist, and the observer is more able to control his or her experience with a static signal than with a dynamic signal.) The analysis of these types of signals within a machine perception system, either for automated manipulation or search of video footage, or for interpretation of interpersonal gesture in a human-computer interface context, would benefit considerably from an understanding of how the process of active observation operates on particular dynamic signals.

Action perception is thus interesting for many reasons-it at minimum raises considerable methodological challenges, it certainly suggests new applications that make clear the utility of intelligent visual processing methods, and it potentially forces us to address some issues in attention and the conscious processing of signals that are otherwise avoidable. Practical progress on any of these fronts would be a significant contribution.

Larry Davis

Looking at People in Action

For most of the past thirty years the computer vision community has focused its attention on a world without people, making substantial progress on problems such as recognition of rigid 3-D objects, estimation of egomotion through (mostly) rigid scenes and understanding the physical relationships between images, scenes, sensors and illumination. During the past five years people have entered the picture, both complicating our lives and bringing to our attention a new set of fundamental and applied research problems in perception and cognitive vision.

People's faces have attracted the most attention, and the greatest successes have involved the application of "appearance-models" to problems such as face detection, estimation of head orientation and face recognition. This has led many researchers, ourselves included, to be tempted to apply similar techniques to the representation and recognition of human "activities." So, for example, Black and Yacoob employ appearance models of facial feature deformation to recognize facial expressions from video; Pentland and Starner employ them for reading American Sign Language; and Bobick and Intille introduced motion energy templates for recognizing simple actions such as sitting, kneeling, etc. There is also, of course, a strong and vocal community in human perception research that argues in favor of appearance-based methods for human visual perception.

Whatever the case may be, it is important to keep in mind that the challenges to appearance-based techniques that have limited their application to 3-D object recognition (including large memory requirements for practical object databases to handle arbitrary viewpoint, useful indexing methods, recognition in clutter, susceptibility to occlusion, sensitivity to illumination conditions, difficulty in formulating a useful model for rejecting the null hypothesis, ...) are only worsened when one considers highly articulated objects such as people, who wear a variety of clothing styles, colors and textures and perform their actions and activities with very large within-class variation.

I believe that these problems cannot be solved solely within an appearance-based framework using any computer, man or machine, with reasonable size and power. Instead, these considerations argue strongly for the use of symbolic, or linguistic, methods at least at high levels of representation of human form and action. Our ability to reason about human movement or to recognize novel activities - or familiar activities viewed in novel ways - suggests that there is a constructive component to recognition of human action. I am reminded of Fodor's interesting arguments against neural networks as a foundation for thought because they do not admit of recursive composition. Appearance-based models mostly ignore the deep interface between perception and language (not necessarily natural language, but some formal structure for drawing spatial inferences and planning activities in space). Our own work, here, uses temporal logic programming to represent and recognize people's actions and interactions, with those logic programs operating on (and controlling the construction of) a database of facts concerning the positions and trajectories of body parts in image sequences.

Such linguistic models also liberate us from the "tyranny of time" that plagues appearance-based methods, which typically employ computational models such as HMM's, dynamic time warping, or autoregressive models to match appearance models to image sequences. It is by now almost dogmatic within the computer vision community that spatial vision is context dependent and opportunistic. Knowing, with confidence (either from a priori information or from image analysis) the location of some object in an image provides a spatial context for searching for and recognizing related objects. We would never consider a fixed, spatial pattern of image analysis independent of available or acquired context. I believe that the same situation should obtain in temporal vision, and especially in the analysis of human actions and interaction. Our recognition models should predict, from prior and acquired context, actions both forward and backwards in time. This idea is both classic and commonplace in AI planning systems, that reason both forward and backward, and force planning to move through certain states known (or determined) to be part of the solution sought. Such approaches help to control the combinatorics of recognition and naturally allow us to integrate prior and acquired knowledge into our analysis.

Irfan Essa

Present research efforts in the area of machine perception of action are not much different in their goals to the earlier works aimed at extracting intentionality from the environment or extracting syntactic and semantic cues from the scene. What we do have going for ourselves at present is that computing power can support our needs to undertake robust searches through somewhat toy domains, or allow us to model limited scenarios so that we can just look for "stuff" and "things" that we understand and make rule-based or probabilistic inferences from them. No real attempts at undertaking machine perception of actions, and I specifically mean human actions, in real domains have been made to date and perhaps can't be made for some time in the future.

Maybe we really do not know what the real bottleneck against further progress in this area is. Is it still computing? Or is it that we still don't really have a good representation of action, a representation that is suitable for machine perception? Or is it that we really do not know how we (humans) represent actions and we perceive them and since our model of machine perception is to be based on us, maybe we need to understand that first.

Most of the methods for machine perception either take a complete data-driven approach or an approach based on a known structure or a model representation of action. The data-driven methods try to define robust methodologies for tracking actions so that these actions can be perceived and recognized. No prior assumptions of what is being perceived is made and the hope is that the action has a unique, inherent and implicit representation to completely define the action that can then be used for recognition. As much as we would like to believe in the "pureness" of this methodology, we have to question the lack of a true representation that this methodology relies on for recognition.

One way of addressing the lack of an explicit representation is to build a structure or a model of what needs to be perceived. Limitation of this is how to deal with events and actions that we did not have the insight to incorporate in our a priori models. Additionally, building in a whole repertoire of actions and their explicit models is by no means a trivial problem. We might as well develop a system that does robust search over a very large space of possible solutions (why does it matter that the solutions are detailed model-based representations or just data specific estimates based on probability of captured signals).

Overall, our goal is that we have to extract some sensory information so that we can use it to label certain events in the scene. Capturing and labeling an event may not be exactly what we need for perception of action, as an action could be a series of related events or sometimes just a single event (i.e. walking vs. pointing). This causes a major problem in how to represent actions (or how to define an action) for recognition. Additionally, it suggests the importance of time in the representation of actions. It is for this reason some sort of spatio-temporal reasoning needs to be incorporated into our representations. Such spatio-temporal reasoning can be incorporated by using constructs that change with time; with expectations and probabilities assigned to these constructs so that we can predict the changes and estimate the actions. There are several ways we can attempt to do this.

We accomplish this by building a language of actions. A task though very interesting and challenging, in many ways not much difficult the some of the very hard AI problems. However, it seems like we do need to tackle some small parts of this problem to gain insight into interpretation of actions. Perhaps we can build some sort of a limited grammar for actions within certain domains. A prior script of a series of actions at various levels of granularity would be very useful, however the domain would be then limited to the possible actions defined in the script.

Another possibility could be that we can work towards representations that allow for causal reasoning, i.e. what is happening and how can we use it to interpret present and future actions. In many ways this method is aimed at extracting scripts of actions from series of snapshots (both static and dynamic). I believe this methodology could prove to be very useful in limited domains where we can define the context and also bootstrap some sort of representational model of action within the context.

Perhaps the concepts of physics-based modeling can be employed for developing detailed representations of actions. The major benefit of physics-based methods would be that the variability of time can be easily incorporated into the analysis and allow for spatio-temporal representations. Animation and mimicking of actions is a very important by-product of this method. Again, the limitation is the domain and the context.

So it seems that we are forced to deal with limited domains and devise methodologies that, for each specific domain, at various levels of detail (in both space and time) allow for a somewhat deep exploration of action interpretations. The type of representation, may it be completely data-driven (only look at signals and infer actions using probabilistic and/or rule-based classifiers) or model-based (look at signals that can be mapped onto known and modeled actions) also depends on the task at hand.

In my earlier work I have taken a much more of a model-based approach. Defining a complete spatio-temporal structure of what to observe and then placing it within the observer-controller framework, I have succeeded in extracting very detailed actions. Detailed models have allowed me to experiment with probabilistic models in the space of the states that the model can handle. I have also experimented with just using the models for energy minimization and constraints (and not for interpretations) on the data for exploring probabilistic data-driven models. Both these approaches though quite limited in their scope and domain, provide very detailed interpretations.

At present, I am interested in pursuing two directions. One aimed at extracting very detailed representations that can encompass a structured spatio-temporal model. A model that will allow both analyses in a scale space or in a granular representation space. The second direction we are pursuing is the building of Interactive Environments (see www.cc.gatech.edu/fce) that are context aware and are capable of various forms of information capture and integration. We are interested in using all forms of sensory information and we are trying to work on very specific domains like classroom, living-room, kitchen, car, etc.

I hope that during this workshop we will work towards identifying some core ideas for this community and focus on what are the hard problems and how can we solve them in the next few years without solving AI. I do believe we have a lot to learn from cognition.

Michael Isard

Perception of action shares a problem with many other areas of computer vision; although there are clear research problems to be addressed, the potential applications are somewhat nebulous. There are two approaches to this dilemma --- either continue along promising lines of research in the expectation that the solution of general problems is A Good Thing and will be ultimately useful, or, in the words of Dilbert's Boss, "Identify the problem, then solve it." Both approaches have in the past been carried out simultaneously, and there is no reason to think that things should be different for research into action. As well as an overall aim for the field, therefore, it is worth considering directions for each of these shorter term goals.

Applications for the perception of action

The most compelling applications are those which are closest to AI. An obvious example is a querying system for a video database. This contains most of the interesting foreseeable problems --- it is necessary to specify a language to describe the most general forms of action, and then match unknown test sequences to sentences in that language. This seems precisely the sort of problem, along with generalised object recognition and natural language understanding, which is waiting for a breakthrough in AI, and on which current methods seem unlikely to have much success. In the meantime, therefore, it is worth thinking about what useful tools might be built in the next few years, given moderate success in the field. The understanding of the full range of Sign Language cannot be far away; once tracking is sufficiently advanced to follow both hands occluding each other and their relation to the rest of the speaker, the understanding of the gestures should require little advance over current work. An interesting medium-term goal might be a robot helper. The task would be to watch a human carrying out some task, for example assembling an object, and flag mistakes in the assembly process given a description of the desired outcome. A continuation of the problem would be to learn the required sequence of actions by watching training examples. This is related to the teleoperations problem of designing a robot to intelligently mimic the human operator --- a screwing motion of the control device should cause the remote robot to accurately position a screwdriver and insert the desired screw. To begin such a project would require the design jointly of an alphabet and language for scene parts and their motions and a tracking system capable of following objects corresponding to letters from that alphabet. In a specialised domain such as human body-tracking, it should be possible to make rapid progress building on existing work, perhaps using a blob tracker for the articulated units of the body, along with higher resolution information taken from the (already localised) hand to identify intricate activities.

A general research topic for the perception of action

Existing methods of modelling motion and action fall broadly into two camps; discrete state-based and continuous-valued models. A useful line of research would be to investigate possible methods of combining the two methodologies. State-based descriptors are useful for capturing the long-term information in a sequence, but they rely heavily on being able to classify each image into one of the finite number of available states. As the problem domain becomes more general, it is not clear how well this will scale --- for example it may be possible to recognise "a person running" but much more difficult to generalise to "an animal running" where the differences in shape and natural timescale between, say, a mouse and an elephant, are rather large. It may also be difficult to represent such general notions as "an object moving to the right" without building enormously complex state sequences, so long as states are confined to be still snapshots of the motion. Discrete-state representations allow the choice of using the whole scene as input, for example to a neural network, or segmenting foreground objects from the sequence. For most applications it is probably necessary to allow unmodelled backgrounds, and thus to segment the object(s) of interest, but in complex scenes it may prove impossible to perform this segmentation using only low-level vision. Some information about the probable configuration of the object will be needed, and a purely state-based representation of action, with no detailed information about the size or position of scene objects, severely restricts the high-level information which can be fed down to aid a segmentation routine.

Continuous-valued motion models have so far been used mainly within trackers, as motion predictors. They tend to have rather short timescales, and so to be well suited to representing such broadly discriminable classes as coupled oscillators. When the useful information in a scene can be parameterised efficiently, using a tracker to provide time-varying estimates of those parameters is an attractive methodology. By constructing a mapping from the continuous parameter space to the discrete state space, a tracker can be used as the segmentation stage for a discrete-state model. More generally, if multiple continuous models are used, then the model in force can be used as a discrete label in its own right, and the discrete-state approach can be expanded to include motions rather than simply snapshots (for example the concept of "moving to the right" can be rather easily expressed by a continuous-valued model). The tracking paradigm has so far broken down when the scenes to be described become too general, as it becomes difficult to parameterise the shape of an unknown object in a robust tracker. It is also hard to see how it would be possible to parameterise an entire arbitrary scene, so the continuous-valued model approach may be restricted to problems where there is a collection of foreground objects moving against unmodelled background.

The effective combination of continuous and discrete models raises interesting questions. Clearly some gestures are more amenable to this type of description than others, and it is not always apparent what granularity there should be --- when does a complex continuous model break down into a sequence of several distinct but simpler actions? Fusing discrete and continuous models into a single state-vector is also attractive, since it presents the opportunity to allow information from each type of model to feed towards the prediction and recognition of the other. While methods are known separately to learn Hidden Markov Models and parameterised continuous motion models from training sequences, discovering a learning algorithm to estimate both forms of model jointly is an open problem.


The range of work which could be labelled "perception of action" is rather daunting. The furthest goals, of intelligent classification of action, seem somewhat outside the remit of computer vision, however --- along with many other domains, we are waiting for AI to deliver. There is clearly much useful short to medium term research to be done in the meantime which ties in with other current vision research such as tracking, object recognition, and robotics, so the perception of action should develop into a healthy field even without the holy grail of true machine intelligence.

Richard Mann and Allan Jepson


The general area of our research is computational perception with emphasis on high-level vision.

Our approach involves the specification of three general components considered essential for any perceptual system. First, we require an ontology that specifies the representation for the domain. Second, we require a domain theory that specifies which interpretations are consistent with the system's world knowledge. Finally, since the perception problem is typically under-constrained (there are multiple interpretations consistent with the sensor data), we require some form of preferences to select plausible interpretations. Together these provide a computational definition of a perceiver, for which percepts are defined to be maximally preferred interpretations consistent with both the sense data and the system's world knowledge (Jepson and Richards, 1993).

In our current work we consider a specific inference problem: the perception of {\sl qualitative scene dynamics} from motion sequences. An implemented perceptual system is described in detail in (Mann, Jepson, and Siskind, 1997; Mann, 1997).

For the first component of our system, we propose a representation based on the dynamic properties of objects and the generation and transfer of forces in the scene. Such a representation includes, for example: the presence of gravity, the presence of a ground plane, whether objects are 'active' or 'passive', whether objects are contacting and/or attached to other objects, and so on. Collectively, we refer to a set of hypotheses that completely specify the properties of interest in the scene as an interpretation.

For the second component of our system we use a domain theory based on the Newtonian mechanics of a simplified scene model. Specifically, we say that an interpretation is feasible if there exists a consistent set of forces and masses that explain the observed accelerations of the scene objects. If we consider scenes containing rigid bodies in continuous motion, such a feasibility test can be reduced to a linear programming problem.

In general, given feasibility conditions alone, there will be multiple interpretations consistent with the observed data. For example, a trivial interpretation can always be obtained by making every scene object active. In order to find informative interpretations we would like to choose interpretations that make the fewest assumptions about the participant objects. To do this, we use a set of preferences to choose among interpretations. In the work described here we use preference rules that choose interpretations which contain the smallest set of active objects and the smallest set of attachments between objects.

It is important to note, however, that while our system produces plausible explanations for the motion, there will often be multiple interpretations for a given frame of the sequence. For example, when considering the instantaneous motion of two attached objects, such as the hand lifting a can, we cannot determine which of the two objects is generating the lifting force. While it is possible to reduce ambiguity by integrating inferences over time (such as noticing that the hand is active in earlier frames of the sequence), this is only a partial solution to the problem. In particular, as described in (Mann and Jepson, 1997), if a behavior (such as attachment) is observed only when the objects interact we will be left with uncertainty about which object (if any) is the "cause" of this interaction.

This research raises several points relevant to the issues raised at this workshop.

*What is the role of "causal" or physics-based descriptions in understanding action?

In this work we have described a specific image sequence understanding system based on the Newtonian mechanics of a simplified scene model. While the actual domain theory is not critical, it is interesting to contrast the general physics-based approach to other approaches to motion understanding.

Physics-based approaches vs. appearance-based approaches. Appearance-based approaches, such as hidden markov models (Siskind and Morris, 1996), describe events based on the time course of features such as the positions and motions of the participant objects. Unlike the physics-based approach, however, there is no underlying representation for object properties (such as active vs.\/ passive objects or attachment) that are not directly observable from the sensory input. While one could argue that a hidden markov model could distinguish these cases, more observations of the events would be required to train the system since the model is not exploiting domain-specific knowledge about Newtonian mechanics.

Physics-based approaches vs. rule-based approaches. Several rule-based systems have been presented that attempt to describe the underlying "causal" structure of the domain, usually in the form of linguistic or conceptual descriptions. Sample domains include the analysis of support relations in static scenes (Brand et. al, 1993) and the analysis of changing kinematic relations in sequences (Borchardt, 1996). While such approaches are appealing, we feel they are inadequate since they do not provide an explicit representation and the validity of the rules can therefore not be evaluated.

*What do we mean by action?

So far we have described work that infers so called "force-dynamic" descriptions from observations of image sequences. In order to describe action, however, we need a way to relate natural categories of events such as "lift", "drop", "throw", etc. to the underlying force-dynamic representation.

To address this problem we could describe such events as sequences of qualitative dynamic states involving the participant objects. For example, as described in (Siskind, 1992), we may describe an event such as "A drops B" as a concatenation of two adjacent time intervals: an interval where the object B is at rest and stably supported by object A, followed by an interval where object B is falling.

A more subtle question, however, concerns how to define action itself. In particular, what differentiates "passive" or "natural" events such as a rolling or bouncing ball, from "willful" actions such as a person hitting, dropping or throwing an object? We believe that in order to address these issues a more elaborate ontology will be required, perhaps including a representation of objects as agents with internal state, goals, and intentions.

*What disciplines do we need to (re) consider?

The role of knowledge representation in vision. We believe that in order to build reliable perceivers, we need to reconsider the role of knowledge representation in vision. In particular, in order to evaluate perceivers, the three components described above (ontology, domain theory, preference policy) should be made explicit in the system. Only in this way can we ensure that the system is sound and complete (ie., finds all and only the appropriate interpretations) with respect to a given representation.

The role of domain structure in the search for interpretations. In some cases the difficult search problems in AI may be avoided through the use of domain structure. In particular, in our dynamics domain, we discovered that we could partition the domain and search for minimal sets of active objects while freezing the possible contacts and attachments at their maximal settings. This partition works (ie., finds all interpretations) because of a special structure of the domain which we call monotonicity (Mann, 1997). Although it is unconfirmed, we believe that this type of structure may be useful in other domains as well.

The role of world structure in knowledge representation. Regularities and structure in the world serve to constrain the forms of knowledge representation schemes we need to consider. In particular, as suggested in (Richards, Jepson, and Feldman, 1996), natural categories of world structure can be described by placing qualitative probabilities over a discrete set of "modal" processes operating at different spatiotemporal scales. For example, in the motion domain, we may describe various qualitative categories such as: resting (stably supported by a ground plane), rolling, falling, and projectile motion (see Jepson, Richards, and Knill, 1996). These categories, based on world regularities, provide a restricted set of possible concepts our system needs to consider.

The role of world structure in learning. Finally, throughout our work we have assumed that the categories used to describe the scene are given in advance. An important research question concerns how these categories are initially determined. In particular, in order to adapt to new domains, a system should be able to learn which regularities exist in the world. (See Mann and Jepson, 1993 for some preliminary attempts to address these issues for a motion domain.)

Allan Jepson and Whitman Richards, "What is a Percept?", Tech. Report RBCV-TR-93-43, Department of Computer Science, University of Toronto. 1993.

Richard Mann, Allan Jepson, and Jeff Siskind, "Computational Perception of Scene Dynamics", ECC\/, Cambridge, UK, 1996.

Richard Mann, "Computational Perception of Scene Dynamics", PhD thesis, Department of Computer Science, University of Toronto. 1997.

Richard Mann and Allan Jepson, "Towards the Computational Perception of Action". In progress, 1997.

Whitman Richards, Allan Jepson, and Jacob Feldman, "Priors, Preferences and Categorical Percepts", Perception as Bayesian Inference, David Knill and Whitman Richards (Eds), Cambridge University Press, 1996.

Allan Jepson, Whitman Richards, and David Knill, "Modal Structure and Reliable Inference", Perception as Bayesian Inference, David Knill and Whitman Richards (Eds), Cambridge University Press, 1996.

Richard Mann and Allan Jepson, "Non-accidental Features in Learning", AAAI Fall Symposium on Machine Learning in Computer Vision, Raleigh, NC, 1993.

Jeff Siskind and Quaid Morris, "Maximum Likelihood Event Classification", ECCV, 1996.

Matthew Brand, Lawrence Birnbaum, and Paul Cooper, "Sensible Scenes: Visual Understanding of Complex Scenes through Causal Analysis", AAAI, 1993.

Gary Borchardt, "Thinking between the lines: computers and the comprehension of causal descriptions",MIT Press, 1994.

Jeffrey Mark Siskind, "Naive Physics, Event Perception, Lexical Semantics, and Language Acquisition", PhD thesis, MIT, 1992.

Alex (Sandy) Pentland

Modeling and Prediction of Human Behavior


My approach is to modeling human behavior is to consider the human as a finite state device with a (possibly large) number of internal "mental" states, each with its own particular control behavior, and inter-state transition probabilities (e.g., when driving a car the states might be passing, following, turning, etc.). State transitions can be directly influenced by sensory events, or can form natural sequences of as a series of actions "play themselves out."

A simple example of this type of human model would be a bank of standard quadratic controllers, each using different dynamics and measurements, together with a network of probabilistic transitions between them [In fact, we have used this exact formulation to predict user's actions for telemanipulation tasks [5].] A very much more complex example would be the virtual dog Silas in our ALIVE project [4].

In this framework action recognition is identification of a person's current internal (intentional) state, which then allows prediction of the most-likely subsequent states via the state transition probabilities. The problem, of course, is that the internal states of the human are not directly observable, so they must be determined through an indirect inference/estimation process.

In the case that the states are configured into a Markov chain, there exist efficient and robust methods of accomplishing using the expectation-maximization methods developed for use of Hidden Markov Models (HMM) in speech processing. More recently, a variety of researchers at M.I.T. have developed methods of handing coupled Markov chains, hierarchical Markov chains, and even methods of working with general networks [2]. By using these methods we have been able to identify facial expressions [3], read American Sign Language [8], and recognize T'ai Chi gestures [2].

An Example: Predicting Driver's Actions

We are now using this approach to identify automobile drivers' current internal (intentional) state, and their most-likey subsequent internal state. In the case of driving the macroscopic actions are events like turning left, stopping, or changing lanes [6, 7]. The internal states are the individual steps that make up the action, and the observed behaviors will be changes in heading and acceleration of the car.

The intuition is that even apparently simple driving actions can be broken down into a long chain of simpler subactions. A lane change, for instance, may consist of the following steps (1) a preparatory centering the car in the current lane, (2) looking around to make sure the adjacent lane is clear, (3) steering to initiate the lane change, (4) the change itself, (5) steering to terminate the lane change, and (6) a final recentering of the car in the new lane. In our current study we are statistically characterizing the sequence of steps within each action, and then using the first few preparatory steps to identify which action is being initiated.

To recognize which action is occuring one compares the observed pattern of driver behavior to hidden Markov dynamic models of each action, in order to determine which action is most likely given the observed pattern of steering and acceleration/braking. This matching can be done in real-time on current microprocessors, thus potentially allowing us to recognize a drivers' intended action from their preparatory movements.

If the pattern of steering and acceleration is monitored internally by the automobile, then the ability to recognize which action the driver is beginning to initiate can allow intelligent cooperation by the vehicle. If heading and acceleration is monitored externally via video cameras [1], then we can more intelligently control the traffic flow.

Experimental Design

The goal is to test the ability of our framework to characterize driver's steering and acceleration/braking patterns in order to classify the driver's intended action. The experiment was conducted within the Nissan Cambridge Basic Research driving simulator, shown in Figure 1(a). The simulator consists of the front half of a Nissan 240SX convertible and a 60 deg (horizontal) by 40 deg (vertical) image projected onto the wall facing the driver. The 240SX is instrumented to record driver control input such as steering wheel angle, brake position, and accelerator position.

Eight adult male subjects were instructed to use this simulator to drive through an extensive computer graphics world, illustrated in Figure1(b). This world contains a large number of buildings, many roads with standard markings, and other moving cars. Each subject drove through the simulated world for approximately 20 minutes, during which time the driver's control of steering angle and steering velocity, car velocity and car acceleration were recorded at 1/10th second intervals. At random intervals subjects were instructed by the experimenter to begin one of six driving actions.

Using the steering and acceleration data recorded while subjects carried out these commands, we built three-state models of each type of driver action (stopping, turn left, turn right, lane change, car passing, and drive-normal). Actions modeled include (1) stopping at an intersection, (2) turn left at an intersection, (3) turn right at an intersection, (4) change lanes, (5) pass the car in front of you, and (6) drive normally with no turns or lane changes. A total of 72 stop, 262 turn, 47 lane change, 24 passing, and 208 drive-normal episodes were recorded. The time needed to complete each action varied from approximately 5 to 10 seconds, depending upon the complexity of both the action and the surrounding situation.

At approximately 0.5 seconds after the beginning of each action (roughly 10% of the way through the action) the computer used the HMM models to classify which action was being executed. Mean recognition accuracy was 95.24% +/-3.1%. These results demonstrate that many types of driving behavior are sufficiently stereotyped that they are reliably recognizable from observation of the driver's preparatory movements.

Problems and Future Directions

I see several basic problems that must be overcome for this approach to be generally applicable. In order of difficulty, these are:

1) Where do we get models of human action? The only answer we currently have is to collect lots of observations, and then use trial-and-error and intuition to develop a model of appropriate complexity and adequacy. We need more sophisticated ways to automatically generate models...there is a great deal of work currently happening in this area, but there is much more work to be done.

2) How can we find the most-likely internal state when the state structure becomes very complex? Currently we only have good tools for simple Markov chains. When dealing with more complex structures (e.g., hierarchical Markov structures, or even general Markov nets) our methods of matching data to model become much more cumbersome and expensive.

3) How can we progress beyond simple first-order Markov structures to higher-order structures? Do we even need to? Currently we don't know what order of model is required. We know we can handle complex phenomena like speech, handwriting, ASL, and driving using simple first-order Markov chains. It is not clear whether first-order models with increased complexity will be sufficient to model all of human behavior, or whether higher-order relationships will be required. This question is very similar to the question of whether associational mechanisms (which are first-order) are sufficient to account for most of human cognition, or whether true logical reasoning mechanisms are required.

[1] Boer, E., Fernandez, M., Pentland, A., and Liu, A., (1996) "Method for Evaluating Human and Simulated Drivers in Real Traffic Situations," IEEE Vehicular Tech. Conf., Atlanta, GA.

[2] Brand, M., Oliver, N., and Pentland, A., (1997) "Coupled HMMs for Complex Action Recognition," CVPR '97, San Juan, PR, June 15-20.

[3] Oliver, N., Bernard, F., and Pentland, A., (1997) "LAFTER: Lips and face tracking system," CVPR, San Juan, PR, June15-20.

[4] Maes, P., Blumburg, B., Darrell, T., and Pentland, A., (1995) "The ALIVE System: Full-body Interaction with Autonomous Agents." Proceedings of Computer Animation 95, IEEE Press, April 1995.

[5] M. Friedmann, T. Starner, and A. Pentland. "Device Synchronization using an Optimal Linear Filter, " Proc. ACM 1992 Symposium on Interactive 3D Graphics, Boston, MA, May 1992.

[6] Pentland, A., and Liu, A., (1995) "Toward Augmented Control Systems," Proc. Intelligent Vehicles '95, Detroit, MI, Sept. 25-26, pp. 350-355

[7] Pentland, A., and Liu, A., (1997) "Modeling and Prediction of Human Behavior" Image Understanding Workshop, New Orleans, LA, May 12-14, 1997.

[8] T. Starner, and A. Pentland., "Visual Recognition of American Sign Language Using Hidden Markov Models," Proc. Int'l Workshop on Automatic Face- and Gesture-Recognition, Zurich, Switzerland, June 26-28, 1995.

JianZhong Qian

Knowledge-based Action Recognition

I. Emphasizing the "intelligence" aspects of computer vision for action recognition

Prof. Bobick wrote: "while recent computer vision work has attempted to maintain as much distance as possible from AI/semantics/reasoning it seems difficult to maintain the separation if we are going to generate high level labels like 'chopping' or 'shoplifting.' Can we avoid complete reunification with AI? If not, are we doomed?" I would like to make some comments on this issue.

First, as industrial computer vision researchers we very often need a complete solution for a real-world problem. Therefore, we usually do not draw a clear line between computer vision and AI or "maintain as much distance as possible from AI/semantics/reasoning" in practice. For example, we take "knowledge-based image analysis" approach, which incorporated with reasoning about multiple visual evidence in time and space, to solving a variety of industrial image analysis tasks in which (a) the domain-specific knowledge is available (b) but the conventional or deterministic processing methods fail. By reasoning and systematically utilizing domain specific knowledge, including physical knowledge about the imaging process, semantic knowledge about the target object and its behavior, and perceptual knowledge about viewing the object, this approach is able to provide better solutions for complex industrial image understanding problems which were considered not possible before. Similar approaches may be needed for the machine perception of action.

We believe that "intelligence" should be one of important characteristics of computer vision. I prefer, therefore, to use the "enhancing the intelligence aspects of computer vision," rather than the term of "reunification with AI," to describe the above mentioned R&D activity. The reason for emphasizing the intelligence aspects of computer vision is based upon the following facts: the eye is just a sensor; the visual cortex of human brain is our primary organ of vision. It turns out that perception is not the simple result of analyzing a set of stimulus patterns, but rather a best interpretation of sensory data based upon prior knowledge. Many researchers believe that one of important abilities of human vision is handling uncertain information through a process of perceptual grouping, evidence gathering, and reasoning based upon prior knowledge. It can make use of partial and locally ambiguous information to achieve reliable identifications. This is done by allowing interpolations through data gaps and extrapolation to be made to new situations for which data are not available. In this way, humans may use knowledge to infer many aspects, including actions, of visual scenes that may not be directly supported by the visual data.

Thus, vision involves mobilizing knowledge and expectations about the environment and actions of objects of interest in it. That relates research in computer vision to some research areas of AI in which large amounts of problem-specific knowledge are used to obtain constrained solutions. We also need to take the following factors into account for real-world industrial image analysis/action recognition applications. Because of missing data, occlusion, and many forms of image degradation, the amount of available information in the raw images may be limited. The viewpoints may be unknown. Due to distortion and dissimilar views, some objects and their actions do not always present look the same. Since none of the currently existing low-level image processing operators is perfect, some important features are not extracted and erroneous features are detected. As a result, visual evidence from the observed data is often incomplete or may conflict, and the rules used in computer vision systems are often just intuitively accepted. In addition, uncertainty also arises from statistical information available from an inadequate training set. All of these facts require a computer system to have the capability to deal with such uncertainty and vagaries in the action recognition.

I believe that by increasing "intelligence" of a computer vision system, including utilizing domain-specific knowledge/semantics/reasoning , computer vision technology /research would be thriving with more and more applications. Machine perception of action is a good test-bed for this idea, since it is really difficult to generate high level labels like 'chopping' or 'shoplifting' without using reasoning at different levels of abstraction.


II. Action is the key content of all other contents in the video

Action recognition is a new technology with many potential applications. One of very important application is the automatic video content extraction. It is well known that automatic video content extraction is a very important technology that will play an important role in visual telecommunication and internet accesses in the next decade. What we would like to argue is that action is the key content of all other contents in the video. Just imagine if you can describe video content effectively without using a verb. A verb is just a description (or expression) of actions. Action recognition will provide new methods to generate video sketching in terms of high-level semantics. For each scene-cut at a segment of time, there are hundreds, even thousands, of frames of video for the description of a complete action process. By recognizing the action, we can probably use only two frames (one for the beginning state and one for the end state of the action) plus a line of description of the recognized action for the same purpose. Thus, the action content can play a central role to link other video contents in a very compact way. This technique will make the automatic video sketching, skimming, clipping, logging, indexing, querying, and browsing possible. Such intelligent content-based video processing methods will pave the way for video-on-demand and digital video library access through internet.

Some sport video may be good application domains that provide interesting contexts to limit the possible actions to be recognized at any single segment of time. Tennis, volleyball, and diving are some examples. In these sports, domain-specific knowledge is available by well defined geometry for the court or the platform, well defined game rules for player positions, and well defined, limited possible actions. In addition to the video processing, the action content is also crucial in real-time applications including key event (unusual actions) detection in subway platform monitoring, security monitoring, home care monitoring, etc. I believe that knowledge-based approach may be a promising way for the action recognition in these application areas. If we can make the action recognition really work for a few important real-world problems, then it may drive more research funding from different resources.

Whitman Richards

Story Constraints on Interpreting Actions

Most actions do not appear alone in isolation, but rather as part of a sequence of events. These actions are then constrained in part by the nature of the sequence, because the observer's explanation for the sequence often takes the form of a story. The Heider and Simmel (1938) movie is a compelling example. Although very simple shapes are engaged in the actions, most people assign complex roles and even gender to the circle or polygons. These high level cognitive conclusions can be seen as having the same belief maximization structure as that which occurs at lower perceptual levels, such as when we interpret simple cartoon-like drawings, for example.

I distinguish three levels of interpretation of story-telling actions by simple objects: (1) the general form of stories that have a small number of players, (2) a taxonomy of behaviors and their associated action types, which are fewer than the set of behaviors, and (3) perceptual features that help label action types. The latter require some minimal shape and heading information.

At the story level, Campbell's "Hero's Journey" and Propp's (1968) "Morphology" offer classifications of stories for a limited number of selected types of players. The simplest classifications may be construed as positive and negative forces acting upon or between objects, where there is first a positive attracting force, then a repelling force, and finally a positive force. The "boy-meets-girl" story is a classical example. Formless objects can easily be assigned appropriate roles in such a story.

At the behavior and action level, a first cut at a repertoire of primitive behaviours can be taken from ethology (Blumberg, 1996) Hence play, fight, feed, flight, etc are important elements of the taxonomy. Each behavior does not have its own special action type, however. So, for example, dance, fight and play may all have the same underlying action type, with different behavior categories assigned in different contexts (ie sequences of actions.)

At the level of action type, models have been proposed by Thibadeau (1986), Talmy (1992), and Jackendoff (1992) that are heavily influenced by natural language descriptions. I will stress basic causal and spatial factors and especially Jepson, Mann & Siskind's ( 1995 ) "non-accidental" configurations and coordinate frame inferences that are required as precursors to labelling action types with semantic content.

Yaser Yacoob

Recognize Measurements or Measure Recognizables?: The Case for DIRECT Activity Recognition

The choice between top-down and bottom-up strategies for object recognition has accompanied computer vision since early research. This dichotomy has been, however, largely ignored in research on recognition of activities. Almost invariably, activity recognition is posed as a problem where a set of measurements taken from the visual-field are used in a framework of temporal pattern matching. As a result, statistical pattern recognition methods such as Hidden Markov Models, Dynamic Time Warping, Neural Networks, etc. are employed.

A top-down approach to both the spatial and temporal aspects of modeling and recognition of activities may be critical to overcoming the complexity of the problem. The challenge is to devise spatio-temporal representations that integrate both the activity and measurement levels. Although it is possible to explicitly use information available in the time-progression of activities to control low-level measurements, this use serves more to prune the measurement search space and therefore only superficially integrates high level knowledge into estimation. It is more economic to have a universal representation that simultaneously embeds low and high level information.

While a spatial top-down strategy could draw from earlier work in object recognition, a temporally driven top-down strategy raises issues that have not been encountered previously. The most prominent of these issues are:

(1) What is a temporal window and how are spatial scale-space ideas extended to spatio-temporal analysis?

(2) What are the temporal invariants of activities?

(3) What spatio-temporal representations are effective?

(4) When does time begin? And how important is it?

It is necessary to acknowledge and enquire into the differences between time and space, since time cannot be simply treated as an additional dimension to space.

DIRECT activity recognition (after Gibson) is an example of the integration of bottom-up and top-down strategies. In this paradigm, measurements are intertwined with the recognition task so that delineation among visual processes is non-existent.

In my recent research on motion estimation, a framework for learning and estimation of temporal models of motion has proven effective in dealing with complex problems such as: leg and arm tracking under self-occlusion and variations in execution, performers and view-point of activities.

The framework consists of a learning stage in which appearance motion trajectories are computed and then converted into a representation that can be used in a direct-activity-recognition in image sequences.

In the meeting I will discuss the following issues

(1) Are instantaneous measurements sufficient for recognition?

I will propose that a temporal framework is far more appropriate and economic since the ambiguity of instantaneous measurements brings to question the feasibility of effective recognition.

(2) What is DIRECT activity recognition? What are the pros and cons?

DIRECT activity recognition is an approach by which changes in image sequences are immediately interpreted using a priori learned activities.

(3) Activity invariants, What are they?

Defining activity invariants under spatio-temporal transformations is critical to recognition. What are the spatial and temporal "fingerprints'" of activities?

(4) How does spatial and temporal context affect activity recognition?

Context, spatial and temporal can be detrimental to the interpretation of activities. How can context be represented and used?

Emre Yilmaz

Digital Puppetry

(html version of this paper can be found at:

http://www.protozoa.com/~emre/action_gesture/action_gesture_B01.html )


I feel like a bit of an impostor at a conference on the machine recognition of action and gesture-I'm not a computational vision scientist. I'm a "digital puppeteer," from Protozoa, a small animation and technology company in San Francisco. However, I have a background in perception and action research, as well as puppetry and graphics. (I am an ex-student of Bill Warren at Brown, and Ken Nakayama at Harvard.) So my perspective on what I do is fairly analytical and perception- informed. I hope my observations on what I do will be of interest in thinking about the issues of this workshop.

At Protozoa, we are constantly working with issues of how to represent gesture. Protozoa, an offshoot of Colossal Pictures, is a technology and entertainment company that specializes in "real time character animation." To animate our characters, we wear a suit of sensors the computer can follow, allowing us to literally act out their movements (a technique also known as "motion capture.") It can be a lot of fun-imagine looking in a mirror and seeing yourself as you make different movements-except that in the "mirror" (computer screen), you're an orange dog, or a crafty monkey. We also do some characters that are far from human shape; for instance we've made a worm and a spider. (Few places besides Protozoa do this.) With a very human shaped character, the technique feels almost like acting; with non human characters it feels just like puppetry.

This is very different technique than is used in most computer animation (and, for that matter, drawn, sculpted, or painted animation). Predictably, there are both benefits and problems. Aesthetically, motion capture does not easily produce the kind of wild exaggeration and dramatic motions that are so important in normal animation-people just don't move like that! As such, motion capture doesn't suit some cartoon characters well-Daffy Duck done with motion capture couldn't be anywhere near as funny. However, it easily reproduces little tiny gestures, shrugs of the shoulder, subtle body language, and a sense of weight and mass-which take great talent to represent with keyframe animation or stop-motion. I think there are certain kinds of characters, and certain kinds of stories, that this more naturalistic motion suits well. We're really only at the beginning of exploring what motion capture animation is good for and what kinds of characters work well with it.


While I do not have any systematic analysis of puppeteering and motion capture, I have noticed many interesting phenomena in this field. Some of these things have to do with how you perform, some with how you watch the results. I hope that these will be individually interesting, and collectively give a sense of what's involved in doing this kind of work. I hope this will also give some sense of other interesting issues in other forms of animation and puppetry.

How many sensors do you need?

You'd be surprised how few sensors you need to get a good sense of the performer's motion. With Moxy, one of our early successes, we used only 7: head, hands, feet, hips, chest. (The sensors measure position and orientation.) We usually use 11 now (elbows and knees added.) More sensors can help on some characters, but the basic sense of a gesture comes across very well with just 11. (Just like with Johansson's point light walkers.)

That's assuming you want to capture the human body as such. With characters that are more puppet like, sometimes you only need three or four sensors to be able to get expressiveness. These are not attempting to reproduce human movement, of course. But you certainly can read sadness, joy, and anger, not to mention weight and physical action, in three or four sensors.

Captured motion vs. caricatured motion

Motion capture draws gesture straight from the performer and puts it on the character. The motion you get this way is very realistic. From the point of view of people who do "real" animation with keyframe systems or pencils, this is part of its problem. In "real" animation (i.e., Disney's "Aladdin", Pixar's "Toy Story", etc.) movements are not necessarily realistic. They're better if they're more cartoony. (Exception taken when the characters are very realistic and humanoid.)

Good animation-like good puppeteering, or for that matter, good mime-is rarely about trying to duplicate realistic movement. It's more often about drawing from a real movement, developing the essence of it into something larger than the reality. It's also about distilling down all the movements you could be doing to the movements that are necessary, and presenting those essentials clearly. Realistic motion feels different to watch than artistically interpreted motion. Reality is a guideline, but usually only a starting point, not the final goal.

In Whitaker and Halas' excellent book, Timing for Animation, there's a great illustration of cartoony vs. realistic (and very flat) motion. It tells the story better than I can in words:

[This image can be found at:


This image's caption reads, "Cartoon is a medium of caricature- naturalistic motion looks weak in animation. Look at what actually happens, simplify down to the essentials of the movement and exaggerate these to the extreme." (Whitaker and Halas, Timing for Animation, Focal Press, London, 1981, pp. 28-29. This excellent book is unfortunately out of print and hard to track down.)

While there are animators who hate motion capture, I think it has its place. Although it is faithfully copying a performer's movements, that performer doesn't have to be doing something particularly realistic. They can be doing something stylized, distilled down from what a person would naturally do. They can, in essence, be doing puppetry or mime. Furthermore, I think there are cases where naturalistic motion works well, even on cartoony looking characters. However, I don't think it is going to take over "real" animation either.

Using the same motion on different characters

You can fairly easily use the same performance on several different characters. We often use past performances from other characters to help us set up new ones. A tall skinny bug can do the same dance an orange gremlin did last week.

This works and generally works acceptably for some kinds of motion. Good dances, for instance, tend to look good whatever character we put them on. However, to my eye, many good performances aren't as compelling when they're moved to another character. The performer is using the character for feedback as they perform, so they're making their movements look right for that character. When moved to another, it doesn't affect him visually the same way.

Good acting vs. good puppeteering

It is pretty easy now to hook some sensors to a 3-D character and dance it around. The hardware and software easily captures a sense of weight and timing (though only *realistic,* not exaggerated, weight and timing.) However, that alone doesn't make good animation. The hardware knows nothing about staging, anticipation, economy of movement, or acting. The performer needs some awareness of all these. And what's more, the performer should have a good sense of the character and its body-in other words, they should be a good puppeteer.

There is a difference between good acting and good puppeteering. An actor might do a very expressive and vivid performance that looks great on them, but looks flat and lifeless when re-mapped onto a puppet, or rotoscoped into drawings. Puppeteers who work with costume puppets have the exact same issue to deal with-it's not their own body they have to focus on, but the costume or puppet that is seen by the outside world, and how that body is moving. To focus on what the character is doing, rather than what one's own body is doing, we use visors in which we can watch our characters as we perform them-an idea borrowed from the Muppets. This helps, but it still isn't something that can be taken for granted. In my experience with puppetry, I've seen some people take to this sensibility easily, and I've seen others never take to it even after a lot of practice.

Floops: Motion capture with exaggeration

Floops is an episodic animated cartoon for the web. So far, we've made 40 25-second episodes. SGI webcasts these in VRML2, the 3-D extension of HTML. Working for this application means working within limitations: few polygons, few textures, no morphing, only one full character.

With so many limitations, I wanted to make the character animation as good as possible. The main issue I quickly found myself wrestling with is that Floops is supposed to be a peppy, hyperactive little pet. Motion capture doesn't do this kind of motion easily-it's realistic, and unexaggerated. How could I get the kind of stylized movements I wanted?

One part of exaggerating Floops' motion was a grab bag of techniques we are developing for characters. I devised a me-to-Floops mapping that exaggerated his limbs while retaining the overall sense of mass in his body. With this, even when I was walking casually across the stage, Floops would bound across it like a 5 year old. These tricks only go so far, and aren't generically usable in all cases. They also have an unfortunate tendency of deadening the original motion a little. So this alone wasn't enough.

Another part was to exaggerate my own movements as I performed. I found it was a big help to think like a puppeteer-that is, to focus on Floops' movements and ignore my own entirely. (We use a video visor to watch our characters as we perform.) It was not a matter of being casual or performing movements that felt natural, even though you'd think that would always be best with motion capture. What reads well on a person doesn't necessarily read well on a Floops. For that matter, given that I was using an unusual movement mapping, movements wouldn't be 1:1 translated on Floops anyhow. Even though I'm sure I looked silly, the moves looked right to me on Floops.

Another part was stylizing as I performed. I tried to simplify the movement down to its essence, avoiding all inessential movements. This would happen after rehearsing a scene a number of times-it would start to fall into a pattern, and I'd know where and when every major pose was going to be. I would almost call it "pose-to-pose" planning-I imagined it felt like dance choreography or keyframe animation. Settling into poses also made it more likely that people would be able to read the movements even at 6 fps.

I was excited to receive mail from a couple of keyframe animators who'd seen Floops, and liked it, but weren't sure whether it was motion capture or keyframe!

Using motion capture even more like puppetry

You can also do some even stranger mappings. One of our characters is a worm. Your right hand manipulates his head, your left hand his tail, and your feet his mid-segments. This requires real puppeteering skills to do well, but it also takes on a stylization that is fun and interesting, with a very different look than the serious realism of most motion capture. Another example is a series of walking letters that spell out words. To give these a less human, peppier look, I put sensors on a piece of foam rubber, and manipulated that as a puppet, walking it around our stage. Yet another example is a dinosaur character where the performer's right hand performs the puppet's head.

A lot of motion capture people would never think to do this. Why would you want to perform its head with your hand, when you can get much better motion off a real head? Well, you can get more realistic motion off a real head, but that's not necessarily better motion. Non-realistic motion can sometimes be a lot more interesting; it has gone through an artistic interpretation. The other reason to do it is that there are some body types you can do with puppetry-such as the worm-that are so different from the human shape that there's no other way to do it live.

This is a messy, fuzzy, unscientific process, and not as all-purpose as keyframe animation. Yet, I think this whole grey area has some of the benefits of keyframe animation (exaggerated, stylized movement), while keeping the real-time benefits of motion capture. Maybe it won't be long before we see worms as well as monkeys hosting TV shows.

Animation and puppetry: contrasts in usage

Animation has tended to tell stories involving a lot of camera cuts, a lot of different scenes, and a lot of different characters. Puppetry has tended to tell more sedate stories, involving a small number of characters. Things tend to run at a slower pace. I'm not sure exactly why that is, but I believe it's simply an outgrowth of some very practical matters. If you have to draw every single frame in a sequence, you're going to pack as much into that sequence as possible.

With puppetry, as with motion capture, there is a big expense in developing and building a character, but not a lot of additional expense per minute of use. If you get every frame essentially for free, you can afford to stand there, heave a sigh, shuffle around for a minute, and so forth. It makes more sense to tell stories with slow paces.

Also, there tends to be less use of different backgrounds and scenes in puppetry (and motion capture) than drawn animation. If a prop or a scene is going to be there, someone is going to have to build it, and that takes time. With drawing, on the other hand, it isn't out of the question to draw 20 different backgrounds for a 3 minute cartoon.

While there's nothing wrong with using motion capture to try to replicate the look and feel of cel animation, ultimately I think it will be best off when we learn to use it for its native strengths.

Consistency of style: Realistic design, realistic movement; Caricatured design, caricatured movement

I remember seeing a documentary about the Henson film, The Dark Crystal. One of the characters in this was a very realistic, almost human puppet which Henson performed as a hand puppet. I remember he remarked that to make the walk look right, the only thing he could do was to walk himself, right under it, and have that motion carry over up to the puppet. Whereas, with Kermit the Frog, it was easy to make a walk look right-just by bopping his hand up and down. To a first approximation, a realistic character requires realistic motion, and an un-realistic character does better with cartoony motion.

However, you can break these rules, to interesting effect. We all the time put realistic movements and gestures on cartoony characters. When you watch it, you quickly get accustomed to that, and deal with it on that level. It doesn't look like usual cartoon movement, but that's OK. In some ways it's fascinating to see the subtleties of human gesture coming through on something cartoony. The other direction, on the other hand, just looks silly or wrong to my eye (ultra realistic humanoid characters moving in a cartoony way.)

Mixing the two looks in the same production is another matter. "The Muppet Movie" had puppets doing puppet movement, and people doing people movement- and it looked fine. "Roger Rabbit" had animated characters mixed with real people, and that worked too. But it's hard to do well. I think those both worked because the look and feel of the cartoony or puppety characters was so different than the people universe. But if you had an all-CG feature, with some characters motion captured and some keyframed with cartoony motion, I think it would be a real challenge to make it look good. One drawn animated feature used rotoscoping on some of the major semi-realistic characters, but traditional animation on the animals, birds, and so forth. To my eye it looked strange and didn't quite work.

Puppetry is easier with visual feedback-how is the character looking?

The central issue of puppet manipulation is that your movements affect something that isn't shaped like you. To get good at this you focus on the movements of the puppet, and how its gestures look, completely ignoring how your own gestures look and feel. It is not a matter of being casual or performing movements that feel natural-what reads well on a person doesn't necessarily read well on a puppet. It takes practice, but eventually one gets to a point where the mechanics are automatic and one can instead focus on the emotions and the acting. It's best to watch the character's movements-as if you're doing a dance to a mirror, but in the mirror you're an orange pet. We use a video visor to watch our performances as we go. This idea comes from the Muppets. Their puppeteers watch TV monitors as they perform, so they can see the puppet performance exactly as the audience will see it.

Too many degrees of freedom

There are many puppets made these days that are incredibly complicated. They have actuators to control "facial muscles", eyelid twitches, nose wiggles, etc. To my eye, these often look unconvincing. (Convincing on the surface, but they never really feel alive to me.) On the other hand, there are also many computer-animated characters that are equally or more complicated. These sometimes look really good (though they can also be done badly).

The major classical equivalent of this kind of puppet is the "Bunraku" style of puppetry of Japan. Bunraku puppets usually take around four puppeteers to perform-the master on the head and torso, and helpers on the arms, legs, and so forth. To do this well, though, they practice as an ensemble for years or decades. After that much time working together, they're able to work as one unit and give a fairly well integrated performance. There's probably no numerical limit on # of performers, but it depends on how long they've been working together and how good they are at anticipating each other.

However, puppet crews working on films usually don't have decades to practice together! I think it can easily happen that there are simply too many things to control well. The chances of all the performers giving a performance that is synchronized, become smaller and smaller the more performers you add. A single phrase has to be rehearsed over and over again, until it becomes like an orchestra playing a symphony, to get it right. That's why it's so hard to do well. Spontaneity and improvisation with a puppet this complicated are even more difficult.

In animation, the equivalent happens all the time. For some of the photorealistic dinosaurs and dragons of recent years, there is the equivalent of one actuator for every muscle in the animal. The advantage animators have over puppeteers here is that it's no problem to get all these mini-performances in sync, if you know what you're doing. Unlike puppetry, you don't have to get it all in one take! You can go back over it, as many times as you want, frame by frame.

Miss Piggy: The power of suggestion through underspecification

The Muppets apparently got many letters from puppet makers, asking how they made the mechanism to make Miss Piggy's eyes blink. This is fascinating, because in fact the puppet did not have any blinking eyelids. When the overall gesture of the performance is right, people will fill in the details. (Thanks to Frank Oz' phenomenal puppet performances, this happened a lot with Miss Piggy. Puppets with less gifted hands guiding them don't give this impression as easily. This is one of the things that makes a good puppet performance.)

It's the same filling-in process that happens when you're watching motion capture dots, and that is disrupted when you provide too much information. If you let the audience use their imagination to fill in the details, they will do it perfectly because they know what they want to see. If you fill in the details for them, you'd better do them right, or they'll be distracting. (By the same token, you have to provide a good base performance in the first place. If the highest level degrees of freedom are done well, the rest will fall out of it for the viewer. It's fair to say that it's better not to perform a detail at all than to perform it badly (at least for a certain range of characters.)

Underspecification (suggestiveness) vs. overspecification

Part of the problem with going photorealistic is then you're competing with reality. When you're trying to compete with reality, the viewer will judge it against that standard. You're no longer allowing the audience to take the technique for granted. So your animation has to be correspondingly realistic, and every detail filled in, or you lose the illusion.

Another way of putting this, is to point out some extremes. Imagine watching a puppet show where someone is manipulating a puppet made of scrap wood. Although the puppet is simple, the manipulation and story could be very compelling if it's a good show and a good puppeteer, suggesting a whole personality with the puppet. Now think about the dinosaurs of "Jurassic Park" (which were excellent, convincing, very scary, photorealistic dinosaurs, in the middle of a live action movie.

These two things work very differently. In the case of the handkerchiefs, you are experiencing the story, the characters, and the emotions-and are probably amazed that you can be getting all this by watching scrap wood. You don't consciously believe the scrap wood is alive. Maybe some subconscious part of your head does, and maybe that's where the magic is... but the point is, you are accepting the technique, and experiencing the show through it. Similarly, with The Muppets, people know that's Jim Henson's hand in Kermit's head. In the case of the dinosaurs, on the other hand, the animators are trying to fool more of your head into believing. They are aiming for 100% total realism. That's not necessarily better or worse. But it's certainly different. (A friend of mine, after seeing Jurassic Park, said it got boring to him after a while because the dinosaurs were too believable. His head started to think, "Oh. I guess these are trained dinosaurs they had on the set." And for whatever reason, that wasn't as interesting to him.)

I've seen other interesting phenomena along these lines. For instance, pure motion capture data is fun to look at. Usually it's displayed just like Johannson figures-a lot of points of light at the key positions. It's interesting and compelling to watch, as anyone who's seen Johansson's displays can attest. Sometimes an interesting thing happens when you then apply a photorealistic character to that motion. As good as the character modeling and rendering are, it often loses what I can only describe as "magic." The actions become less compelling. Maybe that's because you are now on a different plane-competing with reality-and maybe there's something in the scene tipping you off to the illusion. Maybe it's simply because people's imaginations don't get to do as much of the work. I think this is why photorealism often looks sort of cheesy. If it's not really, really perfect, it won't look good. Whereas if you're working on an un-realistic plane, you can get people to accept the illusion. Personally, I think the latter is more fun, but that's just my bias.

What's so bad about non-exaggerated motion?

To be consistent with my overall set of opinions, I'd like to be able to say that non-exaggerated, naturalistic motion capture motion doesn't look good on cartoony characters. However, we put motion capture on cartoony characters all the time, and it often looks fine to me. Why this works as well as it does, I don't know. I guess it depends on the story and the characters. I'd imagine there are many cases where it doesn't work.

It's more or less standard to say that good animation inherently depends on exaggeration. I don't think this is true for all forms of animation. In particular, I think with ultra-realistic 3D CG animation, in particular of humanoid characters, you want the motion as realistic as you can get it. Still, it's working on a different plane than most animation of cartoony characters; it's not clear that this should be called "animation."

Trying hard vs. not trying hard

Sometimes a casual performance is better and more appropriate than an intense, rehearsed, well planned one. One of Protozoa's founders, Marc Scaparro, was telling me about the first time he saw motion capture in action-on an animation where a clown character was playing the guitar. The performer did a very theatrical performance-which somehow wasn't that compelling. Then, the performer ended the take, and shuffled around for five seconds, asking, "How was that? Was that good?" Somehow these five seconds were compelling and exciting.

I think what happens is, people sometimes don't know how to be economical when they're performing live. There's a temptation to go off and move a lot-to overdo it. When you get the hang of it, you get over that.

We do some live shows where we get an improvisational comedian to be in the suit and act out a character, interacting with passers-by. These can be a lot of fun to watch, if the performer is a good improviser.

Other capture devices

There are many other devices for capturing real human motion. We also use face-tracking devices and datagloves. The gloves are good for several purposes. We have tried them for puppeting characters' fingers, but we have also tried hooking them to other body movement. For instance, I've performed characters' mouths with my hand, just like with a hand puppet. We've also done things like hooking the first finger to eyebrows, the second to angry-sad, and so forth.

It was funny when we first received a face tracking system. Those of us who were used to puppeteering the facial movements were suspicious and mistrustful of the system. You could almost hear us say, "Real movements? They'll never look good. This is all about exaggeration. The interpretation, that's where the real art is." That's exactly the same argument people who do keyframe animation make against motion capture animation.

Level of performance

It's easier to perform "happy-sad" than "eyebrow-left-y" and "eyebrow-left-x." You might say that you want the performing to be happening at the same level as the reception of it. We generally perform facial animation with joysticks and sliders, not face trackers. We try to hook up our characters' facial animation with the highest level reasonable controls possible.

In the extreme, we've also dabbled in characters that are fully procedural. For instance, we have a fish that you don't animate at all. You just "play" with him by trying to feed him a piece of food. You move the food around, and he follows it, inspecting it, trying to decide whether or not to eat it.


This article is copyright © 1997 Emre Yilmaz and Protozoa

Back to Workshop Homepage