Introductory Remarks
We discuss how so-called "intelligence" can be emerged as a cognitive process, that is, how an agent can develop its internal representation according to the complexity of the interactions with its environment through its capabilities of sensing and acting. The complexity might be increased by the existence of other active agents, and the development can be possible depending on how the agent can find a new axis in the internal representation in trying to accomplish a given task in the environment including other agents. As an example of such a development, we show a case study of vision-based mobile robots whose task is to perform a soccer game performing tasks such as shooting and passing a ball or avoiding an opponent, along with preliminary experiments by real robots.
For most of the past thirty years the computer vision community has focused
its attention on a world without people, making substantial progress on
problems such as recognition of rigid 3-D objects, estimation of egomotion
through (mostly) rigid scenes and understanding the physical relationships
between images, scenes, sensors and illumination. During the past five years
people have entered the picture, both complicating our lives and bringing
to our attention a new set of fundamental and applied research problems
in perception and cognitive vision.
Present research efforts in the area of machine perception of action
are not much different in their goals to the earlier works aimed at extracting
intentionality from the environment or extracting syntactic and semantic
cues from the scene. What we do have going for ourselves at present is that
computing power can support our needs to undertake robust searches through
somewhat toy domains, or allow us to model limited scenarios so that we
can just look for "stuff" and "things" that we understand
and make rule-based or probabilistic inferences from them. No real attempts
at undertaking machine perception of actions, and I specifically mean human
actions, in real domains have been made to date and perhaps can't be made
for some time in the future.
In my earlier work I have taken a much more of a model-based approach.
Defining a complete spatio-temporal structure of what to observe and then
placing it within the observer-controller framework, I have succeeded in
extracting very detailed actions. Detailed models have allowed me to experiment
with probabilistic models in the space of the states that the model can
handle. I have also experimented with just using the models for energy minimization
and constraints (and not for interpretations) on the data for exploring
probabilistic data-driven models. Both these approaches though quite limited
in their scope and domain, provide very detailed interpretations.
Perception of action shares a problem with many other areas of computer
vision; although there are clear research problems to be addressed, the
potential applications are somewhat nebulous. There are two approaches to
this dilemma --- either continue along promising lines of research in the
expectation that the solution of general problems is A Good Thing and will
be ultimately useful, or, in the words of Dilbert's Boss, "Identify
the problem, then solve it.'' Both approaches have in the past been carried
out simultaneously, and there is no reason to think that things should be
different for research into action. As well as an overall aim for the field,
therefore, it is worth considering directions for each of these shorter
term goals.
I feel like a bit of an impostor at a conference on the machine recognition
of action and gesture-I'm not a computational vision scientist. I'm a "digital
puppeteer," from Protozoa, a small animation and technology company
in San Francisco. However, I have a background in perception and action
research, as well as puppetry and graphics. (I am an ex-student of Bill
Warren at Brown, and Ken Nakayama at Harvard.) So my perspective on what
I do is fairly analytical and perception- informed. I hope my observations
on what I do will be of interest in thinking about the issues of this workshop.
At Protozoa, we are constantly working with issues of how to represent gesture. Protozoa, an offshoot of Colossal Pictures, is a technology and entertainment company that specializes in "real time character animation." To animate our characters, we wear a suit of sensors the computer can follow, allowing us to literally act out their movements (a technique also known as "motion capture.") It can be a lot of fun-imagine looking in a mirror and seeing yourself as you make different movements-except that in the "mirror" (computer screen), you're an orange dog, or a crafty monkey. We also do some characters that are far from human shape; for instance we've made a worm and a spider. (Few places besides Protozoa do this.) With a very human shaped character, the technique feels almost like acting; with non human characters it feels just like puppetry.