Question #3 - Relation to Human Perception
3) What can we learn from human perception of action? Is the machinery for detecting and processing motion in the periphery just for queing foveation? When do humans recognize action without first seeing "all the pieces"? What types of action recognition tasks seem context free, which seem to create their own contexts, which require external provision of context?
Definition of "context free" seems difficult. It depends on how large classes of tasks we should consider. "More or less context free" might make sense.
I suppose that context dependent behaviors are
obtained first, then these behaviors are gradually more generalized in accordance
with task variations. Then, we can regard these behaviors "more context
free." But, note that still these behaviors can be context dependent
from a viewpoint of wider class of tasks. What is more important is to specify
the class of tasks that the agent can achieve although it is not an easy
"task" for the designers of the agent.
Arguably, causal dynamic properties of objects and scenes are more informative, more universal and more easily computed. These properties- substantiality, solidity, contiguity, contact, and conservation of momentum -- are governed by simple physical laws at human scales and are thus consistent across most of visual experience. They also have very simple signatures in the visuo-temporal stream. But most importantly, since these properties are causal, we may find that a small number of qualitative rules will provide satisfactory psychological and computational accounts of much of visual understanding.
For example: There is a growing body of psychological evidence showing
that infants are fluent perceivers of lawful causality and violations thereof.
Spelke and Van de Valle found that nine-month old infants can make sophisticated
judgments about the causality of action-sometimes before they can reliably
segment still objects from backgrounds! Yet, only a handful of rules suffices
to describe all the causal distinctions made by the
Humans can usually recognize an action with little information (or a global information on the action). This is possible only because humans possess the context of the action.
I do believe this is also the case of a computer system for any type of action. For example, context can help with:
The main problem with context is its acquisition
In a sense, in active object perception we have the potential for full access to the state of the object, since given sufficient viewing time we can scan the perceptual apparatus to an arbitrary set of locations and obtain the highest resolution information everywhere on the object. If we are unsure about what we saw the moment before-was that the letter I or the number 1?--we can, and do, rapidly scan back to the relevant portion of the object. As many of the pioneers of active perception have pointed out, when the world is available to represent itself the brain will likely take advantage of this fact.
But this is impossible with dynamic objects, by definition! If an active
perceptual apparatus is attending to a particular spatial portion of the
spatio-temporal signal flowing past, some information at other spatial regions
will be inevitably be missed, and the perceptual system will not in general
have any way of recovering that information. Perception of action is therefore
a more challenging task, or at least more unforgiving to the errant active
observer. It also makes explicit the need for internal representations,
insofar as aspects of action can influence interpretation (i.e., context)
across relatively long time scales. Whether there is internal representation
in static perception may be open for debate, but not so
in the perception of action.
I believe that these problems cannot be solved
solely within an appearance-based framework using any computer, man or machine,
with reasonable size and power. Instead, these considerations argue strongly
for the use of symbolic, or linguistic, methods at least at high levels
of representation of human form and action. Our ability to reason about
human movement or to recognize novel activities - or familiar activities
viewed in novel ways - suggests that there is a constructive component to
recognition of human action. I am reminded of Fodor's interesting arguments
against neural networks as a foundation for thought because they do not
admit of recursive composition. Appearance-based models mostly ignore the
deep interface between perception and language (not necessarily natural
language, but some formal structure for drawing spatial inferences and planning
activities in space). Our own work, here, uses temporal logic programming
to represent and recognize people's actions and interactions, with those
logic programs operating on (and controlling the construction of) a database
of facts concerning the positions and trajectories of body parts in image
D. Newtson et al.[Newtson77] carried out a series of psychological experiments on '(meaning) attribution' to human actions by human subjects. The study shows important notions including: The existence of temporal segmentation in human cognition of actions. Humans extract important information at these time points. The unit of temporal segmentation is affected by high level beliefs about the global context. The segmentation point is quite stable... does not change its position in the event stream (always at the same frame in the stimulus film) even in the presence of temporal fluctuations (varied film speed etc.). My system [Kuniyoshi93] exploited these findings. It detects temporal segmentation points using very clear and stable event condition and extracts key information from images at the segmentation points.
Hierarchical conception can be formed for actions. Actually, in terms of "motions", in my terminology [Kuniyoshi93]. Motions can form a natural hierarchy defined by their temporal extents and sweeping spatial volumes. And this hierarchy changes in the course of time, providing a clue for the action context. (Ex. In a pick and place task, small finger movements usually make little sense while in a large hand motion, but when the hand stays roughly in the same position, the finger movements becomes important.)
Now, the fovea-periphery structure can be used to keep track of the dynamic
action context and its direct perception, e.g. When no motion in periphery
and motion in fovea, suspect a fine finger manipulation, etc.
Besides the multi-robot control and learning work, what brings me to this meeting is my work on modeling learning by imitation. Toward this end I have conducted three different psychophysical studies in which subjects watched short videos of human finger, hand, and arm movements with and without the use of a pointing device, and with or without other nearby objects, and attempted to imitate what they saw immediately after the completion of the short videos. In all experiments, eye-trackers were used to gather perceptual fixation data. The second and third studies also gathered subjects' movement data by placing markers on the arm they used for imitation; the second experiment used optical recording equipment (Optotrak) while the third used a 6DOF magnetic field tracker (Fastrak).
The first study conclusively showed that subjects did not uniformly visually sample the presented movements. Instead, their gaze was largely restricted to the end-point (i.e., finger tip or end of a pointer) throughout the presentation of the stimulus video. Although fixation behavior appeared to ignore much of the details of the presented stimuli, the subjects were able to imitate with reasonable accuracy, indicating that movement imitation may rely on internal models of movement (possibly internal movement primitives or basis behaviors) that filled in the details of posture and movement. The second pilot study introduced equipment for tracking the subjects' imitation in order to analytically evaluate it. As in the first, the subjects were shown short videos of a moving arm, in this case some featured a pointing stick as well, and objects being pointed to. The latter were introduced for testing fixation behavior relative to task specification that involved objects not connected to the arm generating the movement.
The first study also resulted in an interesting and consistent interference effect: the subjects told to perform simultaneous imitation, i.e., to rehearse, during stimulus presentation were unable to remember and imitate the stimulus after a delay. This effect was addressed in the most recent, third, pilot study in which subjects were asked to, in some cases, rehearse while watching and then imitate, and in others to just watch and then imitate. Several other controls were performed, and both eye-tracking and imitation data were collected. Furthermore, the issue of coding and interference was addressed in this experiment; in some trials the subjects were asked to perform deferred imitation and were either given a distractor during the delay (counting aloud backwards by two) or a memory facilitator (visualizing the behavior to be imitated). We are interested in any performance differences between the two conditions since some preliminary previous work indicates that interference or facilitation of different types of memory representations (motor, visual, semantic) results in great variation in imitation performance. The results of this study should contribute to issues of the role of context and semantic bias in perception of action.
We are also studying the role of context on the subjects' performance, in particular the effects of different instructions and types of stimulus presentation. These effects clearly make all the difference in learning by demonstration, and are effectively manipulated by good teachers. The perceptual system, in a sense, has evolved to extract some of those features automatically, and our role is to figure out what the features are so that we can better design both the stimuli and the perceptual systems.
Object recognition in human vision seems in some cases to use a view-based strategy and to be unable to use view-invariant 3D information. It seems that the perception of some actions is not different. Sinha, Buelthoff and Buelthoff have shown that the perception of sequences of biological motions such as the ones developed by Johansson is strongly view-based. Motion sequences acquired from familiar view-points are recognized much more readily than the ones from unfamiliar view-point. Such view-point dependency is more consistent with a view-based representation than with an object-centered 3D representation. Furthermore, experimental data suggests that recognition of biological motion sequences is largely unaffected by large distortions in the 3D depth structures of the moving figure, so long as its 2D appearance is not overly distorted. In fact, subjects seem to be perceptually unaware of the depth distortions - an interesting instance of top-down recognition based influences overwhelming the information being provided by the bottom-up depth extraction processes such as binocular stereo. It would be interesting to conceive of sequences where depth-structure might in fact be critical for action recognition.
Several approaches to understanding sequences of humans in motion have
been based upon the idea of maximizing rigidity (motivated by the observation
that limbs are piece-wise rigid). We have developed an illusion that suggests
that such approaches might be only part of the story. The illusion consists
of a completely rigid stick figure that is recognizable as human seen from
the side. Upon rocking this figure around a vertical axis, observers perceive
not a rigid structure but rather a human walking (a non-rigid interpretation).
We interpret this illusion as demonstrating the strong influence of recognition
and experience with human figures upon the perception of motion sequences.
Recognition of the figure as a human through the use of figural cues seems
to overwhelm any biases towards rigidity. Rigidity based approaches, however,
might be useful when the figural cues in a sequence are greatly impoverished
and there is no alternative but to use the motion information to bootstrap
the recognition process. What relative weights govern the combination of
motion and figural cues in a sequence is currently an open question.
You'd be surprised how few sensors you need to get a good sense of the performer's motion. With Moxy, one of our early successes, we used only 7: head, hands, feet, hips, chest. (The sensors measure position and orientation.) We usually use 11 now (elbows and knees added.) More sensors can help on some characters, but the basic sense of a gesture comes across very well with just 11. (Just like with Johansson's point light walkers.)
That's assuming you want to capture the human body as such. With characters that are more puppet like, sometimes you only need three or four sensors to be able to get expressiveness. These are not attempting to reproduce human movement, of course. But you certainly can read sadness, joy, and anger, not to mention weight and physical action, in three or four sensors.
Back to Workshop Homepage