Question #1 - What is action?
1) What do we mean by action? Using some of my own
work as a source of confusion, "action" has been used to refer
to everything from a simple sitting movement to the action of mixing ingredients
in a bowl. Clearly the higher the level we address the more diverse and
complicated the type of information required to make the assertion. While
recent computer vision work has attempted to maintain as much distance as
possible from AI/semantics/reasoning it seems difficult to maintain the
separation if we are going to generate high level labels like "chopping"
Before defining the action, we should consider why this definition is necessary in what context, and how useful or effective for what.
**(Our Standing Point)**
An autonomous agent is regarded as a system that has a complex and ongoing interaction with a dynamic environment that is difficult to predict its changes. Our final goal, in designing and building an autonomous agent with vision-based learning capabilities, is to have it perform a variety of tasks adequately in a complex environment. In order to build such an agent, we have to make clear the interaction between the agent and its environment.
The ultimate goal of our research is to design the fundamental internal structure inside physical entities having their bodies (robots) which can emerge complex behaviors through the interactions with their environments. In order to emerge the intelligent behaviors, physical bodies have an important role of bringing the system into meaningful interaction with the physical environment-complex, uncertain, but with automatically consistent set of natural constraints. This facilitates the correct agent design, learning from the environment, and rich meaningful agent interaction. The meanings of "having a physical body" can be summarized as follows:
1) Sensing and acting capabilities are not separable, but tightly coupled.
2) In order to accomplish the given tasks, the sensor and actuator spaces should be abstracted under the resource boundedconditions (memory, processing power, controller etc.).
3) The abstraction depends on both the fundamental embodiments inside the agents and the experiences (interactions with their environments).
4) The consequences of the abstraction are the agent-based subjective representation of the environment, and its evaluation can be done by the consequences of behaviors.
5) In the real world, both inter-agent and agent-environment interactions are asynchronous, parallel and arbitrarily complex. There is no justification for adopting a particular top-down abstraction level for simulation such as a global clock 'tick', observable information about other agents, modes of interaction among agents, or even physical phenomena like slippage, as any seemingly insignificant parameter can sometimes take over and affect the global multi-agent behavior.
6) Natural complexity of physical interaction automatically generates reliable sample distributions of input data for learning, rather than from a priori Gaussian distribution in simulations which does not always capture the correct distribution.
Design principles are
1) the design of the internal structure of the agent which has a physical body able to interact with its environment, and
2) the policy how to provide the agent with tasks, situations, and environments
so as to develop the internal structure.
An action is any process performed by (at least) one mobile object. A
mobile object can be, for example, a human, a group of humans, a vehicle
or a robot. At a higher level, actions are used to understand the behaviors
of mobile objects. An action is perceived through the computation of properties
(and property evolutions) relative to the moving regions corresponding to
the involved mobile objects. "The man has stayed in the dangerous
zone for more than three minutes", "the man size shrinks"
and "the group of men scatters" are examples of properties
(or property evolutions). The property relative to an action is then not
necessary dynamic. Properties are computed by image processing routines
whereas actions are described in natural language. Therefore works on describing
action and behaviors in natural language are useful. As far as I am concerned,
I have used several works from the semantic community. The following issue
is how to deal with the gap between numerical properties (computed by image
processing routines) and symbolic information (used by human operator).
To me, actions are simply objects in the spatio-temporal domain.
People's faces have attracted the most attention, and the greatest successes
have involved the application of "appearance-models" to problems
such as face detection, estimation of head orientation and face recognition.
This has led many researchers, ourselves included, to be tempted to apply
similar techniques to the representation and recognition of human "activities."
So, for example, Black and Yacoob employ appearance models of facial feature
deformation to recognize facial expressions from video; Pentland and Starner
employ them for reading American Sign Language; and Bobick and Intille introduced
motion energy templates for recognizing simple actions such as sitting,
kneeling, etc. There is also, of course, a strong and vocal community in
human perception research that argues in favor of appearance-based methods
for human visual perception.
An action is a causal process starting with the intention of the actor, mediated by the bodily motion, leading to an effect (change or invariance) in the environment [Dretske88]. The "observable portion" of an action is the portion of the above process starting from the bodily motion and ending at the effect in the environment [Kuniyoshi93]. If the effect is a purely physical phenomenon the action is called physical. If the above process involves the mental process of another agent, the action is called communication. As long as the above structure holds, the constituent movements and environmental changes can take various forms; they can be continuous and parallel, repetitive or one-shot, the effect can be over self body, etc.
The distinction between actions and motions are important. Actions must involve the subject of motion, the target of action, and the causality which connects them. Motions involve the subject of motion and its movement only.
The deep reason in adopting the above definition, in contrast with a more conventional, computer-vision oriented one focussing on only the motion features, is the following:
By 'action' we assign certain meanings to the observed phenomena. This meaning is the only source of discriminating different actions. Since actions are qualitative by nature, as discussed in the answer to the next question, there are no objective motion-parametric discrimination rules. In the definition above, the 'meaning' is captured by 'causality'. Actually the above discussion is not complete because there is nowhere to anchor the meaning. This will only be possible when the problem is placed in an inter-agent situation, e.g. Agent P assigns Meaning M to the action A of Agent Q, where M itself will be explained as a product of interactions between the agents (see later discussions in this paper).
So far we have described work that infers so called "force-dynamic'' descriptions from observations of image sequences. In order to describe action, however, we need a way to relate natural categories of events such as "lift'', "drop'', "throw'', etc. to the underlying force-dynamic representation.
To address this problem we could describe such events as sequences of qualitative dynamic states involving the participant objects. For example, as described in (Siskind, 1992), we may describe an event such as "A drops B'' as a concatenation of two adjacent time intervals: an interval where the object B is at rest and stably supported by object A, followed by an interval where object B is falling.
A more subtle question, however, concerns how to define action itself.
In particular, what differentiates "passive'' or "natural'' events
such as a rolling or bouncing ball, from "willful'' actions such as
a person hitting, dropping or throwing an object? We believe that in order
to address these issues a more elaborate ontology will be required, perhaps
including a representation of objects as agents with internal state,
goals, and intentions.
The granularity of "action" within a system is entirely dependent
on the most suitable representation for the task, which means it is task-and
domain-specific, and as open-ended as the entire representation problem
in AI. The problem of choosing the right level of action representation
has been endemic in robot control as well as in learning; both fields tend
to use low-granularity representations for tighter control but slower processing
and less robust performance. My work is based on the behavior-based paradigm,
and uses behaviors as the basic unit of action. Rather than relying on an
ad hoc notion of behaviors, we define them to be control systems that achieve
and maintain a particular goal. For example, "wall following"
is a behavior, as is "avoidance." It is important to distinguish
externally observable behaviors from internally controllable ones. The goal
of our work is to design a small set of what we call "basis behavior"
which are internally controllable by the agent, and use those to generate
the desired, much larger, set of externally observable behaviors. Behaviors
in our system can be temporally combined (executed in parallel) or sequenced
(mutually exclusive). It is also important to note that behaviors are dually
constrained: from the top down by the goals of the system (e.g., to be as
high granularity as possible as long as the tasks are achievable) and from
the bottom up by the sensors, effectors, and most strongly the dynamics
(e.g., to be as low granularity as possible to make the system sensitive
and reactive). Consequently, our physical systems never have behaviors of
the type "turn left by 90 degrees" since those make little sense
in systems with dynamical actuators. Furthermore, our behaviors, in particular
the basis behaviors, are geared toward spatial interactions and social interactions
of multi-agent and multi-robot systems, so we have behaviors such as "avoidance",
"following", "aggregation", "dispersion",
and "homing", and higher-level resulting behaviors such as "flocking",
"hoarding", "foraging", "herding", and more
complex social structures such as dominance hierarchies, territorial division,
competition, and cooperation. The key point here is that the action representation
is high-enough level to hide the details of control; a "following behavior"
is a control system that encapsulates the details and can be effectively
used as a building block of higher-level behaviors. Similarly, we have used
this notion to implement effective real-time learning on multiple physical
robots. Because the systems did not rely on low-granularity action-space,
they could be more resistant to noise and errors, and make more progress
toward their different goals, which generated more frequent reinforcement
and more continuous estimation of progress, both of which accelerate learning
even in dynamic multi-robot domains.
In the broadest sense, when we talk about action, we imply that we are interested not only about what IS in the world, but what is HAPPENING there. Some notion of time is consequently implicit. Most of the action addressed by robot and machine perception researchers has involved physical movement or change (as opposed e.g. to geopolitical events or stock market activities, though these certainly might be profitably studied).
A lot of work also involves some notion of actors, purpose, and intention, though this is not universal.
Otherwise, the word action has come up in a lot of different contexts, and rather than attempt to exclude some areas by definition, it would seem more profitable to attempt some rough classification of the sorts of issues that have fallen under the rubric.
So here goes, more or less in order of complexity
Sort of the lowest level; simple events that do not (apparently) involve actors, purpose, or intention, but produce perceptible movement/change in the world.
Examples might be trees blowing in the wind, snowfall, lightning, etc. There are also complex events with structure but no assignable purpose or actors (e.g. thunderstorms, earthquakes). These seem naturally to fall under more structured semantic categories even if they can't be assigned a purpose.
2) Primitive actions/activities:
Basically simple patterns executed by some actor that stand a fair chance of being recognizable independent of context. Examples might include walking, running (people or animals), sitting down, standing up, (sitting still??) etc.
3) Simple actions in context:
These are actions that need some sort of local contextual knowledge to make sense, but which are fairly stereotyped within the context; for example, opening a door (of a building or car), throwing a ball, picking up/leaving an object, entering or exiting a building, meeting another person, walking a dog, crossing a street etc.
4) Purposive action in complex context:
These are actions that involve considerable complexity, and are defined by a more sophisticated context; examples might include shoplifting (or shopping), yardwork, doing dishes, stalking, attacking, retreating, convoying, playing baseball etc.
5) Complex, multi-part:
At the end of this scale there seems to be a set of very complex, long-term, organized activities, such as building airfields or roads, which involve a structured combination of lots of components. There are people who are interested in identifying such goings on.
6) Communicative actions:
Finally, along another dimension, there are what might be termed communicative
actions - those designed deliberately to actively convey information (not
just passively so). Examples include talking, gestures, sign-language, (lip
What do we mean by action? AND/OR How do we represent action?
(I have a little trouble keeping these two questions separate. Clearly they're not the same, but what you think about the first surely affects what you think about the second. Here I'm basically making the case for abstract logical representations, while also pointing out some of their limitations. You could put it in either section, or split it.)
In classical planning research, an action is a state change in the world resulting (or potentially resulting) from a decision taken by an agent. It is characterized by a mapping from the world states in which it can occur to the set of world states that may result from it. This definition has many problems, but it remains useful both for planning and for action recognition.
As an example, consider the action "depositing an object". The classical STRIPS definition is something like
BEFORE: there is an agent and an object, and the agent is holding the object
AFTER: there is an agent and an object, and the agent is not holding the object.
A definition used for planning would probably add other predicates defining the effect of the action on the agent's manipulator and the location where the object is deposited. However, even the simple definition above can be used for recognition. The recognizer must be able to detect the presence of agents and objects, and to determine whether or not an agent is holding a particular object. In a static environment, change detection can be used to identify objects, and the 'not holding' predicate can be asserted whenever the agent is physically separated from the object by a sufficient distance.
The advantage of this type of state-based description is that it is extremely abstract: it makes no explicit reference to appearance, and specifies only those aspects of the world state that are important to the definition of the action. This makes the definition independent of viewing parameters, as well as of the particular set of subactions used to perform the top-level action. A hierarchy of actions can be induced by adding additional predicates to the definition. For example, "throwing an object" is an instance of "depositing an object" that adds a constraint on the velocity of the object at the moment of separation from the agent. "Jaywalking" is an instance of "crossing the street" that adds constraints on location of the agent and/or state of the crossing signals.
Abstract action definitions do have important limitations. Limits on their representational power have been extensively discussed in the planning literature. Their use in action recognition raises other concerns. These include:
It's useful to divide the action recognition problem into two levels, pattern and concept (for want of better terms.) By pattern I mean a low-level activity of recognizing a specific motion pattern. The common notion of gesture recognition is an example, in which a specific set of hand motion patterns are modeled (either explicitly or by observation) and later recognized. This parallels the current object recognition paradigm of discriminating between specific instances of objects described by CAD-or appearance-type models. Action recognition at the concept level gets at higher level concepts, detecting "muggings" for example. Concept actions are analogous to the object recognition notion of a chair as a class of object by some functional (or other) criteria.
While concept actions pose the most challenging problems, it seems clear that action patterns will play an important role, and will probably be the first to have a commercial impact. For example, working systems in which simple gesture recognition plays an important role already exist. Action pattern recognition may be an easier problem than the specific 3D object recognition problem. Sets of interesting action patterns tend to be smaller and more distinct than sets of 3D objects. Consider two classes of action patterns: hand gestures and tank formations. It's highly unlikely that a single recognition system would be required to discriminate between them. Moreover, within each class the number of distinct patterns that one cares about is certainly smaller than the tens of thousands of objects in an average American supermarket.
There seem to be at least three different factors involved in recognizing
concept actions. The first is context, which acts to limit the range of
possible interpretations. For example, a person standing in a kitchen and
moving their hands over a bowl is quite likely to be mixing something. The
second factor is pattern action recognition. Specific hand motions, such
as rotating the hand in a circle about the wrist, are associated with the
concept of mixing. Observing these patterns provides evidence for a particular
concept. Note however that action patterns are likely to be ambiguous in
isolation. For example, the same rotating hand motion is also used in spinning
a lariat. The third factor is process, the notion of causality or qualitative
physics. If I had never seen a mixing machine before, I could still deduce
that it produces a mixing action from the combination of the kitchen context
and the observation of the action of the beaters on the ingredients. I will
return to the notion of process below.
Most actions do not appear alone in isolation, but rather as part of
a sequence of events. These actions are then constrained in part by the
nature of the sequence, because the observer's explanation for the sequence
often takes the form of a story. The Heider and Simmel (1938) movie is a
compelling example. Although very simple shapes are engaged in the actions,
most people assign complex roles and even gender to the circle or polygons.
These high level cognitive conclusions can be seen as having the same belief
maximization structure as that which occurs at lower perceptual levels,
such as when we interpret simple cartoon-like drawings, for example.
Webster's dictionary defines action: the doing of something; state of
being in motion; the way of moving organs of the body; the moving of parts:
guns, piano; military combat; appearance of animation in a painting, sculpture,
etc. More or less, hand gestures, sign language, facial expressions, lip
movement during speech, human activities like walking, running, jumping,
jogging, etc, and aerobic exercises are all actions. Consider a typical
office scene, at a given time a person can be performing either one of the
following actions: reading, writing, talking to other people, working on
a computer, talking on a phone, leaving or entering the office.
Actions can be classified into three categories: events, temporal textures, and activities. Motion events do not exhibit temporal or spatial repetition. Events can be low level descriptions like a sudden change of direction, a stop, or a pause, which can provide important clues to the type of object and its motion. Or they can be high level descriptions like 'opening a door, starting a car,
throwing a ball, or more abstractly pick up, put down, push, pull, drop,
throw, etc. Motion verbs can also be associated with motion events. For
example, motion verbs can be used to characterize trajectories of moving
vehicles, or normal or abnormal behavior of the heart's left ventricular
The temporal textures exhibit statistical regularity but are of indeterminate
spatial and temporal extent. Examples include ripples on water, the wind
in the leaves of trees, or a cloth waving in the wind. Activities consists
of motion patterns that are temporally periodic and possess compact spatial
structure. Examples include walking, running, jumping, etc.
Actions like "chopping" or "leaving a suspicious package at the airport" have been recognized using standard computer vision methodologies. I feel we can recognize many actions without AI/reasoning/Semantics. Therefore, I suggest that we keep a separation from AI for the present time.
Back to Workshop Homepage