Question #5 - Representation of Action

Question #5 - Representation of Action

5) How do we represent action? Clearly <X(t), Y(t), Z(t)> is limited. What is the action equivalent of a hierarchical description? Is it just the extension of scale space techniques to space-time? Is there a "natural scale" for time? Does the domain or context change the granularity? Are there non-homogeneous hierarchies (representation space versus scale space)?

Minoru Asada

It depends on "the agent (capabilities in sensing, acting, and cognition)," "tasks," and "environments." Without this context, the discussion will not be fruitful.

Francois Bremond

To represent the perception of actions, I use the notion of scenario. I recursively define a scenario as a combination of sub-scenarios. At level zero, a scenario is directly defined based on the properties of mobile objects in the scenario. Mobile objects can have two roles in a scenario: source or target. A source mobile object performs the action associated to the scenario and the target object is the reference of the action. A target object can be either static (e.g. an exit zone) or dynamic (e.g. a mobile object).

I also define two types of combination: non temporal or temporal. A scenario can correspond to a non temporal constraint on a set of sub-scenarios and it can also correspond to a temporal sequence of sub-scenarios. For example, the scenario "the man is swaying" represents a non temporal combination of two properties: the man trajectory is not straight and its shape is changing. The scenario "the man drives up the car then he goes away" represents a temporal combination of two sub-scenarios. I try to recognize a non temporal scenario through an abductive diagnosis and to recognize a temporal scenario thanks to an automaton.

Several static hierarchies have already been defined for the recognition of car scenarios~: elementary verbs at the base of the pyramid (e.g. "to change of lines") and complex ones at the top (e.g. "to overtake"). I prefer recursive hierarchies, because we can adapt the scenario granularity to the application domain.

Irfan Essa

Maybe we really do not know what the real bottleneck against further progress in this area is. Is it still computing? Or is it that we still don't really have a good representation of action, a representation that is suitable for machine perception? Or is it that we really do not know how we (humans) represent actions and we perceive them and since our model of machine perception is to be based on us, maybe we need to understand that first.

Most of the methods for machine perception either take a complete data-driven approach or an approach based on a known structure or a model representation of action. The data-driven methods try to define robust methodologies for tracking actions so that these actions can be perceived and recognized. No prior assumptions of what is being perceived is made and the hope is that the action has a unique, inherent and implicit representation to completely define the action that can then be used for recognition. As much as we would like to believe in the "pureness" of this methodology, we have to question the lack of a true representation that this methodology relies on for recognition.

One way of addressing the lack of an explicit representation is to build a structure or a model of what needs to be perceived. Limitation of this is how to deal with events and actions that we did not have the insight to incorporate in our a priori models. Additionally, building in a whole repertoire of actions and their explicit models is by no means a trivial problem. We might as well develop a system that does robust search over a very large space of possible solutions (why does it matter that the solutions are detailed model-based representations or just data specific estimates based on probability of captured signals).

Overall, our goal is that we have to extract some sensory information so that we can use it to label certain events in the scene. Capturing and labeling an event may not be exactly what we need for perception of action, as an action could be a series of related events or sometimes just a single event (i.e. walking vs. pointing). This causes a major problem in how to represent actions (or how to define an action) for recognition. Additionally, it suggests the importance of time in the representation of actions. It is for this reason some sort of spatio-temporal reasoning needs to be incorporated into our representations. Such spatio-temporal reasoning can be incorporated by using constructs that change with time; with expectations and probabilities assigned to these constructs so that we can predict the changes and estimate the actions. There are several ways we can attempt to do this.

We accomplish this by building a language of actions. A task though very interesting and challenging, in many ways not much difficult the some of the very hard AI problems. However, it seems like we do need to tackle some small parts of this problem to gain insight into interpretation of actions. Perhaps we can build some sort of a limited grammar for actions within certain domains. A prior script of a series of actions at various levels of granularity would be very useful, however the domain would be then limited to the possible actions defined in the script.

Perhaps the concepts of physics-based modeling can be employed for developing detailed representations of actions. The major benefit of physics-based methods would be that the variability of time can be easily incorporated into the analysis and allow for spatio-temporal representations. Animation and mimicking of actions is a very important by-product of this method. Again, the limitation is the domain and the context.

So it seems that we are forced to deal with limited domains and devise methodologies that, for each specific domain, at various levels of detail (in both space and time) allow for a somewhat deep exploration of action interpretations. The type of representation, may it be completely data-driven (only look at signals and infer actions using probabilistic and/or rule-based classifiers) or model-based (look at signals that can be mapped onto known and modeled actions) also depends on the task at hand.

Michael Isard

Existing methods of modelling motion and action fall broadly into two camps; discrete state-based and continuous-valued models. A useful line of research would be to investigate possible methods of combining the two methodologies. State-based descriptors are useful for capturing the long-term information in a sequence, but they rely heavily on being able to classify each image into one of the finite number of available states. As the problem domain becomes more general, it is not clear how well this will scale --- for example it may be possible to recognise "a person running'' but much more difficult to generalise to "an animal running'' where the differences in shape and natural timescale between, say, a mouse and an elephant, are rather large. It may also be difficult to represent such general notions as "an object moving to the right'' without building enormously complex state sequences, so long as states are confined to be still snapshots of the motion. Discrete-state representations allow the choice of using the whole scene as input, for example to a neural network, or segmenting foreground objects from the sequence. For most applications it is probably necessary to allow unmodelled backgrounds, and thus to segment the object(s) of interest, but in complex scenes it may prove impossible to perform this segmentation using only low-level vision. Some information about the probable configuration of the object will be needed, and a purely state-based representation of action, with no detailed information about the size or position of scene objects, severely restricts the high-level information which can be fed down to aid a segmentation routine.

Continuous-valued motion models have so far been used mainly within trackers, as motion predictors. They tend to have rather short timescales, and so to be well suited to representing such broadly discriminable classes as coupled oscillators. When the useful information in a scene can be parameterised efficiently, using a tracker to provide time-varying estimates of those parameters is an attractive methodology. By constructing a mapping from the continuous parameter space to the discrete state space, a tracker can be used as the segmentation stage for a discrete-state model. More generally, if multiple continuous models are used, then the model in force can be used as a discrete label in its own right, and the discrete-state approach can be expanded to include motions rather than simply snapshots (for example the concept of "moving to the right'' can be rather easily expressed by a continuous-valued model). The tracking paradigm has so far broken down when the scenes to be described become too general, as it becomes difficult to parameterise the shape of an unknown object in a robust tracker. It is also hard to see how it would be possible to parameterise an entire arbitrary scene, so the continuous-valued model approach may be restricted to problems where there is a collection of foreground objects moving against unmodelled background.

The effective combination of continuous and discrete models raises interesting questions. Clearly some gestures are more amenable to this type of description than others, and it is not always apparent what granularity there should be --- when does a complex continuous model break down into a sequence of several distinct but simpler actions? Fusing discrete and continuous models into a single state-vector is also attractive, since it presents the opportunity to allow information from each type of model to feed towards the prediction and recognition of the other. While methods are known separately to learn Hidden Markov Models and parameterised continuous motion models from training sequences, discovering a learning algorithm to estimate both forms of model jointly is an open problem.

Yasuo Kuniyoshi

<X(t), Y(t), Z(t)> is almost useless, because raw trajectories mean nothing. This level of data should not be kept. Global spatio-temporal patterns of motion will be much more useful.

In [Kuniyoshi93] I used:

---- contents of an action unit ----

base = subject of action and primary attention.

targets = targets of action and target of temporary attention-shift.

initial-event = detected event marking the start of this action

final-event = detected event marking the end of this action

motion-feature = invariant feature of the motion during this time interval.

change/invariance= change or invariance observed regarding the base and targets.

---

The temporal extent of an action is defined as the time span within one causality between the base and the targets holds invariant. A temporal segmentation point is detected as follows: When there is a change in the motion feature (which partially defines the context), an active visual search for new targets driven by a set of alternative causalities anticipated for the next action interval. When the search finds and confirms a new combination of base, targets and the connecting causality, this means a new context is established and the system signals the segmentation point.

Richard Mann and Allan Jepson

The general area of our research is computational perception with emphasis on high-level vision.

Our approach involves the specification of three general components considered essential for any perceptual system. First, we require an ontology that specifies the representation for the domain. Second, we require a domain theory that specifies which interpretations are consistent with the system's world knowledge. Finally, since the perception problem is typically under-constrained (there are multiple interpretations consistent with the sensor data), we require some form of preferences to select plausible interpretations. Together these provide a computational definition of a perceiver, for which percepts are defined to be maximally preferred interpretations consistent with both the sense data and the system's world knowledge (Jepson and Richards, 1993).

In our current work we consider a specific inference problem: the perception of qualitative scene dynamics from motion sequences. An implemented perceptual system is described in detail in (Mann, Jepson, and Siskind, 1997; Mann, 1997).

For the first component of our system, we propose a representation based on the dynamic properties of objects and the generation and transfer of forces in the scene. Such a representation includes, for example: the presence of gravity, the presence of a ground plane, whether objects are 'active' or 'passive', whether objects are contacting and/or attached to other objects, and so on. Collectively, we refer to a set of hypotheses that completely specify the properties of interest in the scene as an interpretation.

For the second component of our system we use a domain theory based on the Newtonian mechanics of a simplified scene model. Specifically, we say that an interpretation is feasible if there exists a consistent set of forces and masses that explain the observed accelerations of the scene objects. If we consider scenes containing rigid bodies in continuous motion, such a feasibility test can be reduced to a linear programming problem.

In general, given feasibility conditions alone, there will be multiple interpretations consistent with the observed data. For example, a trivial interpretation can always be obtained by making every scene object active. In order to find informative interpretations we would like to choose interpretations that make the fewest assumptions about the participant objects. To do this, we use a set of preferences to choose among interpretations. In the work described here we use preference rules that choose interpretations which contain the smallest set of active objects and the smallest set of attachments between objects.

It is important to note, however, that while our system produces plausible explanations for the motion, there will often be multiple interpretations for a given frame of the sequence. For example, when considering the instantaneous motion of two attached objects, such as the hand lifting a can, we cannot determine which of the two objects is generating the lifting force. While it is possible to reduce ambiguity by integrating inferences over time (such as noticing that the hand is active in earlier frames of the sequence), this is only a partial solution to the problem. In particular, as described in (Mann and Jepson, 1997), if a behavior (such as attachment) is observed only when the objects interact we will be left with uncertainty about which object (if any) is the "cause'' of this interaction. This research raises several points relevant to the issues raised at this workshop.

Maja Mataric

Inspired by biological data about motor coding, we are interested in representing action (movement, in our case) in the form of motor primitives or basis behaviors. We assume that these are innate or learned models of particular movements which can be sequenced and overlapped (much like basis behaviors described above) to generate a flexible motor repertoire. This hypothesis is supported by and consistent with existing literature in both neuroscience and psychology. We are particularly interested in the interaction between motor primitives and the perceptual system. Does the perceptual system use the same primitives, i.e., is the observed action/behavior mapped directly onto the intrinsic motor system, even without movement. Recent evidence from neuroscience suggest that watching movement results in neural activation that may correspond to unconscious early stages of movement preparation, as if seeing any movement is almost beginning to imitate it. This is the motivation between our study of imitation, a natural domain of sensory-motor integration.

Randal Nelson

I think it is quite obvious there is no single "right" answer to this question. Just as in the case of static imagery, the representations that are appropriate depend strongly on what the task is. That said, a number of useful representations have been looked at, and it makes some sense to classify them. One observation is that the same (semi) distinction between analog (continuous) and symbolic (discrete) representations that shows up in static image processing, shows up in the perception of action, and with the same low-level/high-level, concrete/abstract connotations.

At the lowest analog level, we have time dependent images of various sorts (2+1D, 2-1/2+1D, 3+1D), and their processed versions (normalized, smoothed, enhanced, motion fields and derived quantities, temporally warped sequences etc.)

Aside from the fact that some of the processing operations are fundamentally and asymmetrically temporal in nature, these representations lend themselves to the same sort of pattern matching and analysis techniques that are useful for static imagery (basically various sorts of template matching operations, discretized, warped, inverted, transformed etc.) and suffer from the same limitations. It is possible to go a very long way with representations of this sort; and a lot of useful and impressive results have been obtained. However, there are certain limitations to the degree of abstraction that can be handled with analogical representations alone.

There is a lot less uniformity of style in the more discrete representations that have been employed, and less close ties to conventional image processing. Some of these are listed below

1) Spatio-temporal segmentation:

Probably the simplest symbolic level, this is analogous to region segmentation - separation is into regions of the same (moving) stuff and persists through time. Pretty close to a direct analog representation.

2) Labeled moving blobs:

Above segmentation forms a basis for this. Essentially individual objects are tracked, and simple information (identity, size, speed, etc.) is carried along. This can be quite useful for some sorts of surveillance operations

3) Annotated trajectory representations:

The labeled moving blob representation is modified to contain information about temporal events and interactions between objects, and the time line is made more symbolic. (e.g. Here's a walking man, here he sat down, opened a door, got in a car and drove off). At this point the basic data structure is generally some sort of labeled graph.

4) Articulated moving object:

The emphasis here is to fit a physical model more complex than a moving blob to the basic data, e.g. a human puppet. This is potentially useful for making certain sorts of fine gestural distinctions, or recognizing detailed activities like shaking hands or opening a box. Once the fit has been made, the same sort of labeled part and annotated trajectory descriptions can be put on top of the more complex model (e.g. here he picks up a book with his right hand, here she writes a check).

5) Coordinated activity representations:

The most abstract sort of representation that has been attempted to date. The idea is to represent activities that involve multiple, temporally related components, or multiple, coordinated actors. Examples might include football plays, a man assembling an engine, a construction team building a roadway etc. Representations tend to, again consist of some sort of labeled graph, but now with higher order constraints on the global structure. This work is still quite preliminary.

Tom Olson

What do we mean by action? AND/OR How do we represent action?

In classical planning research, an action is a state change in the world resulting (or potentially resulting) from a decision taken by an agent. It is characterized by a mapping from the world states in which it can occur to the set of world states that may result from it. This definition has many problems, but it remains useful both for planning and for action recognition.

As an example, consider the action "depositing an object". The classical STRIPS definition is something like:

BEFORE: there is an agent and an object, and the agent is holding the object.

AFTER: there is an agent and an object, and the agent is not holding the object.

A definition used for planning would probably add other predicates defining the effect of the action on the agent's manipulator and the location where the object is deposited. However, even the simple definition above can be used for recognition. The recognizer must be able to detect the presence of agents and objects, and to determine whether or not an agent is holding a particular object. In a static environment, change detection can be used to identify objects, and the 'not holding' predicate can be asserted whenever the agent is physically separated from the object by a sufficient distance.

The advantage of this type of state-based description is that it is extremely abstract: it makes no explicit reference to appearance, and specifies only those aspects of the world state that are important to the definition of the action. This makes the definition independent of viewing parameters, as well as of the particular set of subactions used to perform the top-level action. A hierarchy of actions can be induced by adding additional predicates to the definition. For example, "throwing an object" is an instance of "depositing an object" that adds a constraint on the velocity of the object at the moment of separation from the agent. "Jaywalking" is an instance of "crossing the street" that adds constraints on location of the agent and/or state of the crossing signals.

Abstract action definitions do have important limitations. Limits on their representational power have been extensively discussed in the planning literature. Their use in action recognition raises other concerns. These include:

* observability of the predicates. Definitions for planning purposes rightly stress the effects of an action on the world state. These effects may not be the most salient and recognizable features of the action from a perceptual point of view. This is particularly true of actions whose primary purpose is communication; all of the critical state changes occur inside the heads of the agents involved.

* parsimony. The definitions of certain actions (e.g., human body actions) may depend on long sequences of relatively complex states. Fully specifying these states may be extremely onerous, and is unnecessary if the actions can be recognized by some sort of signature that can be extracted from the image sequence.

Alex Pentland

My approach to modeling human behavior is to consider the human as a finite state device with a (possibly large) number of internal "mental" states, each with its own particular control behavior, and inter-state transition probabilities (e.g., when driving a car the states might be passing, following, turning, etc.). State transitions can be directly influenced by sensory events, or can form natural sequences of as a series of actions "play themselves out."

A simple example of this type of human model would be a bank of standard quadratic controllers, each using different dynamics and measurements, together with a network of probabilistic transitions between them [In fact, we have used this exact formulation to predict user's actions for telemanipulation tasks [5].] A very much more complex example would be the virtual dog Silas in our ALIVE project [4].

In this framework action recognition is identification of a person's current internal (intentional) state, which then allows prediction of the most-likely subsequent states via the state transition probabilities. The problem, of course, is that the internal states of the human are not directly observable, so they must be determined through an indirect inference/estimation process.

In the case that the states are configured into a Markov chain, there exist efficient and robust methods of accomplishing using the expectation-maximization methods developed for use of Hidden Markov Models (HMM) in speech processing. More recently, a variety of researchers at M.I.T. have developed methods of handing coupled Markov chains, hierarchical Markov chains, and even methods of working with general networks [2]. By using these methods we have been able to identify facial expressions [3], read American Sign Language [8], and recognize T'ai Chi gestures [2].

Tomaso Poggio and Pawan Sinha

A possible representation of some stereotype actions (say Johannesma-type motions) is to extend to time the morphable models of Jones and Poggio (see also Taylor, Cootes, Blake,). One action consists then of a t-field - defined as the sequence of correspondences between the images in the sequence, that is the sequence of optical flows computed between successive and overlapping pairs of images in the sequence. Correspondence between two actions is provided by correspondence between their t-fields - which is in principle provided by one optical flow between two images in the two sequences at the "same" time t.

A morphable model of a given action is then the linear combination of the t-fields of a few prototypical examples of the action. A morphable model can be used for analysis and synthesis like in the static case.

There are several open issues in an approach of this type. The first one regards the primitives used for matching: pixels may be used but seem too expensive a representation. Coarser scale image-based representations may be one solution. Another is to compress say an initial pixel representation to contain only points that move in different ways on the image plane. Another issue is related to time warping: it should be possible to establish correspondence between two actions even if their duration is different. This calls for extensions of the usual optical flow algorithms to space-time, which in turn may change our definition of t-fields and the primitives on which they are based.

James Rehg

The basic representation of action patterns is labelled space-time volumes. There are at least three levels in the representation hierarchy, which can be illustrated for the case of the hands. At the first level is a statement simply that the hands are moving. At the second level is the identification of a specific hand gesture, and at the third is the estimation of some parameters of the gesture, such as its velocity, throughout its duration. In each case it should be possible to localize the action in image space. But 3D locations are likely to be application specific (such as the upstage and downstage notation used for dance steps.) A notion of spatio-temporal scale space could be useful for action patterns. Time plays a similar role to space at this level, the goal is to remain invariant to it.

Action concepts are in turn likely to be composed of a series of action patterns or other lower-level events. Time should play more of an ordering role here, regulating the sequencing of events. It may not be easy to confine high-level actions to a localized time and place. For example, building a house comprises a large collection of activities taking place at different places and times. However, the observer's reference frame obviously plays a role here, as detecting the action "building a house" >from satellite imagery is obviously different from doing so from 100 yards away.

Whitman Richards

I distinguish three levels of interpretation of story-telling actions by simple objects: (1) the general form of stories that have a small number of players, (2) a taxonomy of behaviors and their associated action types, which are fewer than the set of behaviors, and (3) perceptual features that help label action types. The latter require some minimal shape and heading information.

At the story level, Campbell's "Hero's Journey" and Propp's (1968) "Morphology" offer classifications of stories for a limited number of selected types of players. The simplest classifications may be construed as positive and negative forces acting upon or between objects, where there is first a positive attracting force, then a repelling force, and finally a positive force. The "boy-meets-girl" story is a classical example. Formless objects can easily be assigned appropriate roles in such a story.

At the behavior and action level, a first cut at a reperatoire of primitive behaviours can be taken from ethology (Blumberg, 1996) Hence play, fight, feed, flight, etc are important elements of the taxonomy. Each behavior does not have its own special action type, however. So, for example, dance, fight and play may all have the same underlying action type, with different behavior categories assigned in different contexts (ie sequences of actions.)

At the level of action type, models have been proposed by Thibadeau (1986), Talmy (1992), and Jackendoff (1992) that are heavily influenced by natural language descriptions. I will stress basic causal and spatial factors and especially Jepson, Mann & Siskind's ( 1995 ) "non-accidental" configurations and coordinate frame inferences that are required as precursors to labelling action types with semantic content.

Mubarak Shah

Action can be represented by a trajectory in some feature space. A trajectory is a sequence of feature vectors, F_i, for i=1,..,n, where i denotes time or frame number. A feature vector can simply be an image location of a particular point on the object, a centroid of image region, moments of an image region, gray levels in a region, optical flow in a region (used as magnitude of optical flow, or concatenated and in a vector), sum of all changed pixels in each column ( XT-trace), 3-D locations, (X_i,Y_i,Z_i) of particular point on the object, joint angels; how the parts of body move with respect to time, muscle actuations, properties of optical flow in a region like curl, divergence, etc, coefficients used in the eigen decomposition of above features, etc.

A trajectory can thus be considered as a vector valued function, that is, at each time we have multiple values. However, sometimes a single valued function is better suited for computations, and therefore parameterization of trajectories is necessary. A trajectory can be parameterized in several ways. For instance, image trajectory (consisting of (x_i,y_i) locations with respect to time) can be parameterized using \phi-S curve, speed and direction, velocities v_x and v_y, and spatiotemporal curvature. The first parameterization completely ignores time; two very different trajectories might have the same curves. The remaining parameterizations are time dependent.

An alternative way to represent action is to generate MHI (Motion History Images) and MRI (Motion Recency Images). MRI is the union of all changed detected images, which represent all the pixels which have changed in a whole sequence. MHI is a scalar-valued image where more recent moving pixels are brighter. MRI does not have any notion of time; while MHI have some notion of time.

Many actions are cyclic in nature. A { cycle} is a natural scale to represent an action. Action can be represented using units of cycles. This is one way to deal with variable length feature trajectories. Each cycle may contain different a number of frames, but the same qualitative information. It should be possible to detect cycles at different scales: coarser cycles would appear at a coarser scale, while finer cycles could be detected at a finer scale.

Yaser Yacoob

The choice between top-down and bottom-up strategies for object recognition has accompanied computer vision since early research. This dichotomy has been, however, largely ignored in research on recognition of activities. Almost invariably, activity recognition is posed as a problem where a set of measurements taken from the visual-field are used in a framework of temporal pattern matching. As a result, statistical pattern recognition methods such as Hidden Markov Models, Dynamic Time Warping, Neural Networks, etc. are employed.

A top-down approach to both the spatial and temporal aspects of modeling and recognition of activities may be critical to overcoming the complexity of the problem. The challenge is to devise spatio-temporal representations that integrate both the activity and measurement levels. Although it is possible to explicitly use information available in the time-progression of activities to control low-level measurements, this use serves more to prune the measurement search space and therefore only superficially integrates high level knowledge into estimation. It is more economic to have a universal representation that simultaneously embeds low and high level information.

While a spatial top-down strategy could draw from earlier work in object recognition, a temporally driven top-down strategy raises issues that have not been encountered previously. The most prominent of these issues are:

(1) What is a temporal window and how are spatial scale-space ideas extended to spatio-temporal analysis?

(2) What are the temporal invariants of activities?

(3) What spatio-temporal representations are effective?

(4) When does time begin? And how important is it?

It is necessary to acknowledge and enquire into the differences between time and space, since time cannot be simply treated as an additional dimension to space.

[Other Important Questions about Representation:]

* Activity invariants--what are they?

Defining activity invariants under spatio-temporal transformations is critical to recognition. What are the spatial and temporal "fingerprints'' of activities?

*How does spatial and temporal context affect activity recognition?

Context, spatial and temporal can be detrimental to the interpretation of activities. How can context be represented and used?

Emre Yilmaz

In Whitaker and Halas' excellent book, Timing for Animation, there's a great illustration of cartoony vs. realistic (and very flat) motion. It tells the story better than I can in words:

[image can be found at:
http://www.protozoa.com/~emre/action_gesture/action_gesture_B01.html]

This image's caption reads, "Cartoon is a medium of caricature- naturalistic motion looks weak in animation. Look at what actually happens, simplify down to the essentials of the movement and exaggerate these to the extreme." (Whitaker and Halas, Timing for Animation, Focal Press, London, 1981, pp. 28-29. This excellent book is unfortunately out of print and hard to track down.)

Back to Workshop Homepage