Other Positions

Other Positions - other points that people wanted to make, extracted from their responses.

Michael Black

There has been a lot of work on precisely measuring the motion of human limbs as a precursor to recognizing action. Introspection might suggest that this is a reasonable strategy but I am beginning to suspect that this level of analysis is unnecessary and possibly misguided.

I propose a move away from "estimation" to "explanation" in which the spatio-temporal variation in an image region is characterized by a simple model such as a probability, texture, or eigenspace model. The goal becomes one of explaining the image data; for example, "that region of the image has the spatio-temporal variation characteristic of walking." I see this explanation being based on approximate (or textural) descriptions rather than detailed ones based on motion. Of course, a finer estimation process could be built on top of this explanation stage.


Francois Bremond

How can be divided the global process of image sequence interpretation?

There are several hard tasks in image sequence interpretation: context acquisition, movement detection, tracking mobile objects, computing symbolic information from numerical properties, symbolic recognition of actions. A way to handle the relative problems consists in the cooperation of all these tasks. So one issue is also to divide the image sequence interpretation process into several sub-tasks and to define the connections between these sub-tasks. For example, How can work an action recognition module with a non robust tracking module? How the action recognition module can improve the tracking module?

What are the basic properties for action recognition?

In my opinion, the basic properties used to recognize an action are:

These properties do not replace realistic interpretations, they just allow us to generate hypotheses on the action. They can be seen as clues to build the interpretation. The main point is not really what are the basic properties but rather how can we use them.


David Hogg

The role of learning

Current computer vision systems, including those concerned with action perception, are insufficiently robust and general purpose. A promising way to overcome these limitations is to improve the kinds of object (and action) models used within systems. In particular, we need to:

Unfortunately, hand-crafted models are often imprecise and are very time-consuming to produce. The answer lies in automating their acquisition. Within the context of action perception, this automation is readily achievable. An open question, is whether or not it is possible to readily segment and internalise individual actions in a general domain (e.g. traffic scenes) given only a passive camera with no other form of input. Perhaps it is necessary to inject data from a different perceptual modality to provide a critical mass, of raw material from which to infer the actions. For example, could a computer system infer the rules and characteristic actions of cricket/baseball from watching lots of games? If not, would it make a difference if the commentary was available?

Some of the most interesting behaving, systems currently around have been produced within Alife simulations (e.g. the blockies, of Karl Sims, the new game Creatures from Millenium). This approach provides an alternative away to construct intelligent entities which may well have their own emergent vision systems. However, the approach may also give some insight into ways of discovering the models underlying modes of behaviour from simply observing scenes over long periods. Such automatically acquired models could then be used within our hand-crafted vision systems.


Yasuo Kuniyoshi

Fusing perception and action --- Imitation as the fundamental mode.

The above scheme was nice in AI context. However, later I realized that this (at least superficially) symbolic representation is no good. It becomes clear when you try to build a system which responds to your action in real time based on action perception, and also learns how to respond.

Action understanding for a real world agent is not merely a labelling problem. The sole purpose of action understanding for a real world agent is to decide what/how to do based on the observation of other agent's action.

Here an agent should not necessarily 'label' an action. The only important thing is that it can make use of action perception in its self actions. If the designer of the agent classifies all the actions, labels them and write perception and corresponding action generation programs, the agent will be quite rigid in response abilities. More interesting and flexible way would be to let the agent figure out the correspondence between each observed action and appropriate response to it through learning. This is very important because only in this way, the agent can ground the 'meaning' of each observed action. (Contrast this to the previous hard-coded symbolizer and action generator.)

Now, in the course of learning, not only the perception-action mapping (in classical terms, labelling) but also segmentation and attention control must be learned. It gives the system too many unknown variables.

I have a hypothesis here, that imitation is the key to solving this problem. Imitation is doing the same thing through observing the other's actions. Note that 'same' here is undefined. But due to the nature of imitation, an agent can close a loop by 'mutual imitation', (or alternatively by feedback from the other), and find a fixed point purely in behavioral terms without resorting to explicit similarity analysis. The agent can learn by doing trial and error in imitating the other better and better, the other agent also imitates the first and the agents can find the common meaning there.

Since the new action 'representation' (which is not really a representation in computational sense) is actually a perception-action unit, it works in accordance with any other 'ordinary' perception-action units within behavior contexts, and the overall action pattern of the agent will impose certain bias on the perception part, providing a basis for context based attention control. Interestingly, in this scheme, action selection will result in attention control, and action learning can improve perception also.

Btw, a perception-action unit defines the extent/granularity of each action in this new framework. Each range is temporal interval (or pattern) + attention(feature & space).


Maja Mataric


One might argue that vision is hard enough, so why add another hard problem (namely learning) to it and make it worse? However, thinking of the perceptual system as an inherently adaptive rather than a rigid, optimized process often leads to more flexible and possibly more efficient solutions.


Randal Nelson

One issue not mentioned in the questions is the question of sensor modalities. The presumption seems to be that the input is (more or less) video image sequences or the equivalent, but there are other modalities that have been used and might be important. Some examples are listed below.

Video image sequences:
Grayscale or color visible-light imagery taken at 30 frames/sec (more or less). At any rate, taken at a rate where motion extraction algorithms can be applied. Slight generalization includes IR, LADAR, and depth-map images.
Periodic image sequences:
Images taken at intervals where direct measurement of motion is difficult - seconds, minutes, or days apart. There are some important problems involving change detection, and interpretation (inference?) of activity in such sequences. Do we want to include them under the aegis of action perception?
Artificially enhanced sequences/ tailored environments:
For example, Moving light displays, and more generally, situations where the environment has been tailored to make detection easy, e.g. by having actors where black clothing except for faces and hands. A lot of work has been done under such conditions, and particularly in the case of deliberate communication actions, it seems reasonable to allow it.
Non-visual position sensors:
For example 6-DOF position sensors, datagloves, bodysuits etc. A lot of useful information can be acquired through such means, again, particularly in the deliberate communication regime. Do we want to include such sensors?
Even in non-tailored environments, audio cues can be immensely helpful in determining what is HAPPENING. Car engine noise, doors slamming, sticks being stepped on, gunshots etc. Not to mention spoken language. If the general problem is to figure out what is going on, this particular sensory integration area (vision/audio) would seem to be of paramount importance.


Tom Olson

Many human actions function as non-verbal communications, performed in order to transmit information to another human. The information is usually simple; we may use posture or facial expression to indicate an affective state, or use head, eye and hand movements to draw someone's attention (perhaps covertly) to some aspect of the environment. This raises two questions. First, is there any hope of recognizing some of these subtle physical cues? (My guess is that there is little we can do in the near term). Second, if we *can't* handle these cues, are there important applications that we won't be able to handle? It seems as though there are some distinctions that will be very difficult to make: between children waving toy guns and adults waving real ones, between a rugby game and a street brawl, or between a father picking up a crying child and a kidnapper picking up a frightened one.


Mubarak Shah

Action recognition using a single camera vs multiple cameras

Most approaches for action recognition employ single camera. It may happen that due to self-occlusion some parts of an actor are not visible in the image, which may result in not having enough information for action recognition. Another possible problem is that due to the limited field of view of one camera, the actor may quickly move out of the field of view of the camera.

In order to deal with these difficulties, some researchers have advocated the use of multiple cameras. However, with the introduction of additional cameras there is additional overhead, and we need to answer the following questions: How many views should be employed? At a given time, should information from all cameras be used or from only one? How to associate the image primitives among images obtained by multiple cameras? etc.

What might be a good compromise is to use a camera with automatic control for pan, tilt and zoom such as the Sony EVI-D30. In a typical system, the conditions are detected when the tracking window gets very close to the borders of the image, this may happen due to the field of view of the camera. When this condition is detected the camera will be appropriately rotated around the or directions using pan and tilt control.

The rotation of the camera will not solve the occlusion problem, and it may happen that the sufficient scene information needed for action recognition is not available. In that case, we would need to move the camera appropriately in order to bring the actor in a fronto-parallel view with respect to the camera. This transformation (rotation, translation of camera) could easily be computed in the system, which explicitly maintains the pose of actor with respect to camera. The motion of camera is useful in situations, when the camera is mounted on some moving platform, like an airplane, motor vehicle, or mobile robot.

This approach for camera control is essentially equivalent to a system with multiple cameras located at different locations, in which given a pose of actor, a particular camera is polled for a corresponding view. However, in the former approach, one camera is used, and it is moved at the location where one of the cameras would have been located if we were using multiple cameras. Note that, here it is assumed that there are no obstacles when the camera is moved. Also, there is only { self} occlusion.

Action recognition using 2-D vs 3-D

In the most simple case of action recognition, 2-D models and 2-D input are used. In this case, the computations are simple, but 2-D information can be ambiguous. In the most general case of action recognition, 3-D models and 3-D input are used, however, this approach is computationally expensive. Typically a 3-D body model and or a 3-D motion model (joint angles) from sequences of images are computed. This results in a non-linear problem in multi-dimensional space (e.g., in one case 22 dimensions). The work of last three decades in the so called { structure from motion} problem in computer vision has demonstrated difficulty in obtaining a solution of the non-linear problem using real sequences.

In between is the approach in which 3-D models and 2-D input are used. In this case, the actor's pose (3-D rotation and translation) of the model needs to be computed such that when the model is projected onto image plane it exactly matches with the input. Here the 3-D body model and 3-D motion model of an actor is assumed to be known, which makes action recognition simpler.

How do we recognize action?

Hidden Markov Models (HMMs) can be used to recognize action. HMMs can be employed to build a stochastic model of a time-varying observation sequence by removing the time dependency. A HMM consists of a set of states, a set of output symbols, state transition probabilities, output symbol probabilities, and initial state probabilities. The model works as follows. Sequences are used to train the HMMs. Matching of an unknown sequence with a model is done through the calculation of the probability that an HMM could generate the particular unknown sequence. The HMM giving the highest probability is the one that most likely generated that sequence. HMMs are able to deal with variable length trajectories.

In model-based approach for activity recognition a Kalman filter can be used to recognize actions (e.g., walking, running, jumping). In our approach, one Kalman filter is used for each activity. Each filter, using joint curves, predicts what the model (3-D body model using cylinders) should look like at a particular time for that activity. Based upon how well this prediction matches the scene, we can determine if a filter is allowed to continue or not. In the end, only one filter will remain, and that filter gives the correct activity. In order to deal with variable length sequences each action is represented using joint angles as a function of cycle.

Clustering followed by classification can be used to recognize action. This is a traditional pattern recognition approach. During the training phase the data is clustered into various classes based on the features. During the recognition phase, given a feature vector of unknown class, its class is determined by computing the distance of feature vector from the cluster centers. This assumes that all sequences or trajectories representing particular action are of the same length. Otherwise, they need to be warped to the same length using some kind of dynamic programming method.

Back to Workshop Homepage