Question #1 - What is action?

Question #1 - What is action?

1) What do we mean by action? Using some of my own work as a source of confusion, "action" has been used to refer to everything from a simple sitting movement to the action of mixing ingredients in a bowl. Clearly the higher the level we address the more diverse and complicated the type of information required to make the assertion. While recent computer vision work has attempted to maintain as much distance as possible from AI/semantics/reasoning it seems difficult to maintain the separation if we are going to generate high level labels like "chopping" or "shoplifting."

Minoru Asada

Before defining the action, we should consider why this definition is necessary in what context, and how useful or effective for what.

**(Our Standing Point)**

An autonomous agent is regarded as a system that has a complex and ongoing interaction with a dynamic environment that is difficult to predict its changes. Our final goal, in designing and building an autonomous agent with vision-based learning capabilities, is to have it perform a variety of tasks adequately in a complex environment. In order to build such an agent, we have to make clear the interaction between the agent and its environment.

The ultimate goal of our research is to design the fundamental internal structure inside physical entities having their bodies (robots) which can emerge complex behaviors through the interactions with their environments. In order to emerge the intelligent behaviors, physical bodies have an important role of bringing the system into meaningful interaction with the physical environment-complex, uncertain, but with automatically consistent set of natural constraints. This facilitates the correct agent design, learning from the environment, and rich meaningful agent interaction. The meanings of "having a physical body" can be summarized as follows:

1) Sensing and acting capabilities are not separable, but tightly coupled.

2) In order to accomplish the given tasks, the sensor and actuator spaces should be abstracted under the resource boundedconditions (memory, processing power, controller etc.).

3) The abstraction depends on both the fundamental embodiments inside the agents and the experiences (interactions with their environments).

4) The consequences of the abstraction are the agent-based subjective representation of the environment, and its evaluation can be done by the consequences of behaviors.

5) In the real world, both inter-agent and agent-environment interactions are asynchronous, parallel and arbitrarily complex. There is no justification for adopting a particular top-down abstraction level for simulation such as a global clock 'tick', observable information about other agents, modes of interaction among agents, or even physical phenomena like slippage, as any seemingly insignificant parameter can sometimes take over and affect the global multi-agent behavior.

6) Natural complexity of physical interaction automatically generates reliable sample distributions of input data for learning, rather than from a priori Gaussian distribution in simulations which does not always capture the correct distribution.

Design principles are

1) the design of the internal structure of the agent which has a physical body able to interact with its environment, and

2) the policy how to provide the agent with tasks, situations, and environments so as to develop the internal structure.

Francois Bremond

An action is any process performed by (at least) one mobile object. A mobile object can be, for example, a human, a group of humans, a vehicle or a robot. At a higher level, actions are used to understand the behaviors of mobile objects. An action is perceived through the computation of properties (and property evolutions) relative to the moving regions corresponding to the involved mobile objects. "The man has stayed in the dangerous zone for more than three minutes", "the man size shrinks" and "the group of men scatters" are examples of properties (or property evolutions). The property relative to an action is then not necessary dynamic. Properties are computed by image processing routines whereas actions are described in natural language. Therefore works on describing action and behaviors in natural language are useful. As far as I am concerned, I have used several works from the semantic community. The following issue is how to deal with the gap between numerical properties (computed by image processing routines) and symbolic information (used by human operator).

Trevor Darrell

To me, actions are simply objects in the spatio-temporal domain.

Larry Davis

People's faces have attracted the most attention, and the greatest successes have involved the application of "appearance-models" to problems such as face detection, estimation of head orientation and face recognition. This has led many researchers, ourselves included, to be tempted to apply similar techniques to the representation and recognition of human "activities." So, for example, Black and Yacoob employ appearance models of facial feature deformation to recognize facial expressions from video; Pentland and Starner employ them for reading American Sign Language; and Bobick and Intille introduced motion energy templates for recognizing simple actions such as sitting, kneeling, etc. There is also, of course, a strong and vocal community in human perception research that argues in favor of appearance-based methods for human visual perception.

Yasuo Kuniyoshi

An action is a causal process starting with the intention of the actor, mediated by the bodily motion, leading to an effect (change or invariance) in the environment [Dretske88]. The "observable portion" of an action is the portion of the above process starting from the bodily motion and ending at the effect in the environment [Kuniyoshi93]. If the effect is a purely physical phenomenon the action is called physical. If the above process involves the mental process of another agent, the action is called communication. As long as the above structure holds, the constituent movements and environmental changes can take various forms; they can be continuous and parallel, repetitive or one-shot, the effect can be over self body, etc.

The distinction between actions and motions are important. Actions must involve the subject of motion, the target of action, and the causality which connects them. Motions involve the subject of motion and its movement only.

The deep reason in adopting the above definition, in contrast with a more conventional, computer-vision oriented one focussing on only the motion features, is the following:

By 'action' we assign certain meanings to the observed phenomena. This meaning is the only source of discriminating different actions. Since actions are qualitative by nature, as discussed in the answer to the next question, there are no objective motion-parametric discrimination rules. In the definition above, the 'meaning' is captured by 'causality'. Actually the above discussion is not complete because there is nowhere to anchor the meaning. This will only be possible when the problem is placed in an inter-agent situation, e.g. Agent P assigns Meaning M to the action A of Agent Q, where M itself will be explained as a product of interactions between the agents (see later discussions in this paper).

Richard Mann and Allan Jepson

So far we have described work that infers so called "force-dynamic'' descriptions from observations of image sequences. In order to describe action, however, we need a way to relate natural categories of events such as "lift'', "drop'', "throw'', etc. to the underlying force-dynamic representation.

To address this problem we could describe such events as sequences of qualitative dynamic states involving the participant objects. For example, as described in (Siskind, 1992), we may describe an event such as "A drops B'' as a concatenation of two adjacent time intervals: an interval where the object B is at rest and stably supported by object A, followed by an interval where object B is falling.

A more subtle question, however, concerns how to define action itself. In particular, what differentiates "passive'' or "natural'' events such as a rolling or bouncing ball, from "willful'' actions such as a person hitting, dropping or throwing an object? We believe that in order to address these issues a more elaborate ontology will be required, perhaps including a representation of objects as agents with internal state, goals, and intentions.

Maja Mataric

The granularity of "action" within a system is entirely dependent on the most suitable representation for the task, which means it is task-and domain-specific, and as open-ended as the entire representation problem in AI. The problem of choosing the right level of action representation has been endemic in robot control as well as in learning; both fields tend to use low-granularity representations for tighter control but slower processing and less robust performance. My work is based on the behavior-based paradigm, and uses behaviors as the basic unit of action. Rather than relying on an ad hoc notion of behaviors, we define them to be control systems that achieve and maintain a particular goal. For example, "wall following" is a behavior, as is "avoidance." It is important to distinguish externally observable behaviors from internally controllable ones. The goal of our work is to design a small set of what we call "basis behavior" which are internally controllable by the agent, and use those to generate the desired, much larger, set of externally observable behaviors. Behaviors in our system can be temporally combined (executed in parallel) or sequenced (mutually exclusive). It is also important to note that behaviors are dually constrained: from the top down by the goals of the system (e.g., to be as high granularity as possible as long as the tasks are achievable) and from the bottom up by the sensors, effectors, and most strongly the dynamics (e.g., to be as low granularity as possible to make the system sensitive and reactive). Consequently, our physical systems never have behaviors of the type "turn left by 90 degrees" since those make little sense in systems with dynamical actuators. Furthermore, our behaviors, in particular the basis behaviors, are geared toward spatial interactions and social interactions of multi-agent and multi-robot systems, so we have behaviors such as "avoidance", "following", "aggregation", "dispersion", and "homing", and higher-level resulting behaviors such as "flocking", "hoarding", "foraging", "herding", and more complex social structures such as dominance hierarchies, territorial division, competition, and cooperation. The key point here is that the action representation is high-enough level to hide the details of control; a "following behavior" is a control system that encapsulates the details and can be effectively used as a building block of higher-level behaviors. Similarly, we have used this notion to implement effective real-time learning on multiple physical robots. Because the systems did not rely on low-granularity action-space, they could be more resistant to noise and errors, and make more progress toward their different goals, which generated more frequent reinforcement and more continuous estimation of progress, both of which accelerate learning even in dynamic multi-robot domains.

Randal Nelson

In the broadest sense, when we talk about action, we imply that we are interested not only about what IS in the world, but what is HAPPENING there. Some notion of time is consequently implicit. Most of the action addressed by robot and machine perception researchers has involved physical movement or change (as opposed e.g. to geopolitical events or stock market activities, though these certainly might be profitably studied).

A lot of work also involves some notion of actors, purpose, and intention, though this is not universal.

Otherwise, the word action has come up in a lot of different contexts, and rather than attempt to exclude some areas by definition, it would seem more profitable to attempt some rough classification of the sorts of issues that have fallen under the rubric.

So here goes, more or less in order of complexity

1) Happenings:

Sort of the lowest level; simple events that do not (apparently) involve actors, purpose, or intention, but produce perceptible movement/change in the world.

Examples might be trees blowing in the wind, snowfall, lightning, etc. There are also complex events with structure but no assignable purpose or actors (e.g. thunderstorms, earthquakes). These seem naturally to fall under more structured semantic categories even if they can't be assigned a purpose.

2) Primitive actions/activities:

Basically simple patterns executed by some actor that stand a fair chance of being recognizable independent of context. Examples might include walking, running (people or animals), sitting down, standing up, (sitting still??) etc.

3) Simple actions in context:

These are actions that need some sort of local contextual knowledge to make sense, but which are fairly stereotyped within the context; for example, opening a door (of a building or car), throwing a ball, picking up/leaving an object, entering or exiting a building, meeting another person, walking a dog, crossing a street etc.

4) Purposive action in complex context:

These are actions that involve considerable complexity, and are defined by a more sophisticated context; examples might include shoplifting (or shopping), yardwork, doing dishes, stalking, attacking, retreating, convoying, playing baseball etc.

5) Complex, multi-part:

At the end of this scale there seems to be a set of very complex, long-term, organized activities, such as building airfields or roads, which involve a structured combination of lots of components. There are people who are interested in identifying such goings on.

6) Communicative actions:

Finally, along another dimension, there are what might be termed communicative actions - those designed deliberately to actively convey information (not just passively so). Examples include talking, gestures, sign-language, (lip reading?) etc.

Tom Olson

What do we mean by action? AND/OR How do we represent action?

(I have a little trouble keeping these two questions separate. Clearly they're not the same, but what you think about the first surely affects what you think about the second. Here I'm basically making the case for abstract logical representations, while also pointing out some of their limitations. You could put it in either section, or split it.)

In classical planning research, an action is a state change in the world resulting (or potentially resulting) from a decision taken by an agent. It is characterized by a mapping from the world states in which it can occur to the set of world states that may result from it. This definition has many problems, but it remains useful both for planning and for action recognition.

As an example, consider the action "depositing an object". The classical STRIPS definition is something like

BEFORE: there is an agent and an object, and the agent is holding the object

AFTER: there is an agent and an object, and the agent is not holding the object.

A definition used for planning would probably add other predicates defining the effect of the action on the agent's manipulator and the location where the object is deposited. However, even the simple definition above can be used for recognition. The recognizer must be able to detect the presence of agents and objects, and to determine whether or not an agent is holding a particular object. In a static environment, change detection can be used to identify objects, and the 'not holding' predicate can be asserted whenever the agent is physically separated from the object by a sufficient distance.

The advantage of this type of state-based description is that it is extremely abstract: it makes no explicit reference to appearance, and specifies only those aspects of the world state that are important to the definition of the action. This makes the definition independent of viewing parameters, as well as of the particular set of subactions used to perform the top-level action. A hierarchy of actions can be induced by adding additional predicates to the definition. For example, "throwing an object" is an instance of "depositing an object" that adds a constraint on the velocity of the object at the moment of separation from the agent. "Jaywalking" is an instance of "crossing the street" that adds constraints on location of the agent and/or state of the crossing signals.

Abstract action definitions do have important limitations. Limits on their representational power have been extensively discussed in the planning literature. Their use in action recognition raises other concerns. These include:

observability of the predicates. Definitions for planning purposes rightly stress the effects of an action on the world state. These effects may not be the most salient and recognizable features of the action from a perceptual point of view. This is particularly true of actions whose primary purpose is communication; all of the critical state changes occur inside the heads of the agents involved.
parsimony. The definitions of certain actions (e.g., human body actions) may depend on long sequences of relatively complex states. Fully specifying these states may be extremely onerous, and is unnecessary if the actions can be recognized by some sort of signature that can be extracted from the image sequence.

James Rehg

It's useful to divide the action recognition problem into two levels, pattern and concept (for want of better terms.) By pattern I mean a low-level activity of recognizing a specific motion pattern. The common notion of gesture recognition is an example, in which a specific set of hand motion patterns are modeled (either explicitly or by observation) and later recognized. This parallels the current object recognition paradigm of discriminating between specific instances of objects described by CAD-or appearance-type models. Action recognition at the concept level gets at higher level concepts, detecting "muggings" for example. Concept actions are analogous to the object recognition notion of a chair as a class of object by some functional (or other) criteria.

While concept actions pose the most challenging problems, it seems clear that action patterns will play an important role, and will probably be the first to have a commercial impact. For example, working systems in which simple gesture recognition plays an important role already exist. Action pattern recognition may be an easier problem than the specific 3D object recognition problem. Sets of interesting action patterns tend to be smaller and more distinct than sets of 3D objects. Consider two classes of action patterns: hand gestures and tank formations. It's highly unlikely that a single recognition system would be required to discriminate between them. Moreover, within each class the number of distinct patterns that one cares about is certainly smaller than the tens of thousands of objects in an average American supermarket.

There seem to be at least three different factors involved in recognizing concept actions. The first is context, which acts to limit the range of possible interpretations. For example, a person standing in a kitchen and moving their hands over a bowl is quite likely to be mixing something. The second factor is pattern action recognition. Specific hand motions, such as rotating the hand in a circle about the wrist, are associated with the concept of mixing. Observing these patterns provides evidence for a particular concept. Note however that action patterns are likely to be ambiguous in isolation. For example, the same rotating hand motion is also used in spinning a lariat. The third factor is process, the notion of causality or qualitative physics. If I had never seen a mixing machine before, I could still deduce that it produces a mixing action from the combination of the kitchen context and the observation of the action of the beaters on the ingredients. I will return to the notion of process below.

Whitman Richards

Most actions do not appear alone in isolation, but rather as part of a sequence of events. These actions are then constrained in part by the nature of the sequence, because the observer's explanation for the sequence often takes the form of a story. The Heider and Simmel (1938) movie is a compelling example. Although very simple shapes are engaged in the actions, most people assign complex roles and even gender to the circle or polygons. These high level cognitive conclusions can be seen as having the same belief maximization structure as that which occurs at lower perceptual levels, such as when we interpret simple cartoon-like drawings, for example.

Mubarak Shah

Webster's dictionary defines action: the doing of something; state of being in motion; the way of moving organs of the body; the moving of parts: guns, piano; military combat; appearance of animation in a painting, sculpture, etc. More or less, hand gestures, sign language, facial expressions, lip movement during speech, human activities like walking, running, jumping, jogging, etc, and aerobic exercises are all actions. Consider a typical office scene, at a given time a person can be performing either one of the following actions: reading, writing, talking to other people, working on a computer, talking on a phone, leaving or entering the office.

Actions can be classified into three categories: events, temporal textures, and activities. Motion events do not exhibit temporal or spatial repetition. Events can be low level descriptions like a sudden change of direction, a stop, or a pause, which can provide important clues to the type of object and its motion. Or they can be high level descriptions like 'opening a door, starting a car,

throwing a ball, or more abstractly pick up, put down, push, pull, drop, throw, etc. Motion verbs can also be associated with motion events. For example, motion verbs can be used to characterize trajectories of moving vehicles, or normal or abnormal behavior of the heart's left ventricular motion.

The temporal textures exhibit statistical regularity but are of indeterminate spatial and temporal extent. Examples include ripples on water, the wind in the leaves of trees, or a cloth waving in the wind. Activities consists of motion patterns that are temporally periodic and possess compact spatial structure. Examples include walking, running, jumping, etc.

Actions like "chopping" or "leaving a suspicious package at the airport" have been recognized using standard computer vision methodologies. I feel we can recognize many actions without AI/reasoning/Semantics. Therefore, I suggest that we keep a separation from AI for the present time.

Back to Workshop Homepage