Luxomatic: Performance Animation with a Sock Puppet

Andy Wilson

Introduction
The Luxo Jr. graphic short is a fine demonstration of a principle that puppeteers know well: all you need is a sock puppet to give a convincing impression of human motion. For a previous class project (Natural Modes of Human Form and Motion), I considered the perceptual phenomenon of seeing human motion in the lamps in Luxo Jr. as well as the magic carpet in Disney's Aladdin. Essentially the point there was that our ability to anthropomorphize objects is living proof that we don't need a particularly complete representation of the human body to see human motion. That discussion closed with the suggestion that such an analysis might guide research in computational vision.
In this, the latest stab at my continuing deconstruction of Luxo Jr, I consider the following:
The abstract quality of human motion in the Luxo Jr. lamps is the same as that used by a puppeteer using a sock puppet to depict an animate character. The fundamentals of the animation are the position and gaze of the character's head over time.
To show this, I developed a computer vision application that enables the animation of a graphical Luxo lamp by hand movements.

Visually guided animation with hands
The use of the unadorned hand is the departure point of this work from most works in visually guided animation. I chose to the use the hand for two reasons: first, much of my own research has involved the machine perception of hand gestures. Because it is highly nonrigid and has many degrees of freedom, the hand is a particularly difficult object to track and estimate the pose of. Unless the goal is to animate a graphical hand, however, we need not compute all the joint angles of the hand, for exmaple. We can do interesting work without computing complete representations; this is a continuing research theme in the Vision and Modeling Group.
Secondly, the hand is more interesting from the standpoint of showing an interesting abstract representation of human body motion, than is the human body itself. That is, a representation of the full body could have been used, but the argument that you can produce a convincing impression of human motion from such a representation seems vacuous. Having said that, I should also note that there is a lot to learn in considering the problem of extracting the essence of motion (enough to animate a lamp) from a rather complete representation of the human body without having to be so literal as hardwiring one set (or subset) of joint angles to another.

Position and pose
Without considering motion, my guess is that a sock puppet can be well approximated by the position in space and pose of the hand. Accordingly, Luxomatic's computer vision algorithm extracts the approximate position and pose of a hand in an image. It assumes that the hand is at a particular scale and is rather small with respect to the size of the frame. Position and pose estimation are computed simultaneously: where ever there is motion in the image, the algorithm clips a 30x30 pixel region of the image and tries to classify the pose assuming that the cropped image is centered on the hand. In addition to the esimated pose, this test returns a rough estimate of how much the cropped image looks like any one of a set of training images of cropped hands that were collected before runtime. In Luxomatic, pose is specified by the yaw and roll of the hand about the wrist; these are exactly the degrees of freedom present in the a real Luxo lamp.
Here are the 100 images of the hand that were used to compute a representation of what hands look like to Luxomatic's camera:

More on the details of the vision algorithm may be found in Vismod tech report 512.
The two-dimensional position of the hand in the image is mapped to a three-dimensial position of the Luxo head by simply setting the depth to a constant value. Stereo vision techniques might be used to extract approximate depth information in the future.
The position of the lamp's head in space is then used to compute the three joint angles that specify the configuration of the arm of the lamp. This computation is a simple instance of inverse kinematics, a well known problem in computer graphics.
Given the three joint angles of the arm and the yaw and pitch of the head, the Luxo graphic can be rendered.

System architecture
The vision algorithm currently runs on a 200Mhz SGI Indy workstation. Video images from a monochrome camera are digitized in realtime by the Indy. The joint angles of the lamp, once computed, are sent via an RPC (Internet) connection to an Indigo2 which renders a scene with the Luxo lamp. The Indigo2 is equipped with a Galileo board which supports the output of the rendered image to a standard NTSC monitor.
Currently, the frame rate of the system is influenced by the speed of motion of the hand in the image: the search for the hand is limited to those pixels that change substationally from one frame to the next. Thus the frame rate can vary from 8Hz to 30Hz depending upon the amount of motion in the scene, but is typically around 20Hz.

So does it work?
So far, few people besides the author have had the opportunity to use Luxomatic. My experience with it suggests that there is indeed a feeling of realtime control of the lamp. With a bit of tweaking and honing of the vision algorithm, an animator may find the system an interesting way to capture expressive human motions.
The system is very responsive to changes in position. The pose information, however, tends to be a bit noisy, such that without using a Kalman filter to smooth the yaw and pitch of the head, the lamp appears to have a mild case of Tourette's syndrome. Smoothing the head joint angles has the drawback, however, of compromising the responsiveness of the head animation such that very quick motions are not always rendered.
Most people that have seen the system in operation missed the fact that the depth component of the head position in space is not modeled.
(6.0MB/no audio) This QuickTime 4 video shows the Luxomatic system. The actual system runs much faster than shown here. But you get the idea.

Two lamps
The system architecture is easily extended to handle two people each animating a lamp. In deference to Luxo Jr., in which a father (mother?) lamp and kid lamp are the characters, the two lamp Luxomatic system has one person controlling each lamp.
This screen capture shows both lamps being controlled simultaneously by two people (well, it's just me in a canned image in both, but you get the idea):

The two lamp Luxomatic is the kind of system that might be interesting for collaborative storytelling or play. Imagine two children acting out a story with these virtual sock puppets. They might be in the same room or connected by a slim Internet connection: the communications link between the computer doing the vision and the renderer is well within the capabilities of today's phone lines.
In such a scenario one "performer" sends what the character is doing to the other, and both render locally on their own computers to see the action. Or perhaps each performer only sees what the other charcter is doing and the graphics depicts the first-person view of his character. Furthermore, the entire scene might be viewed by a third party (the "audience"), or simply recorded for later viewding.

Future work: mappings
One topic for future consideration is how the operation of mapping from one motion character (e.g. the hand) to another (the lamp) scales when there are many and perhaps very different motion characters. Consider taking another mapping step from the lamp to Aladdin's magic carpet. Could you animate the magic carpet with your hand?
Now imagine having a whole network of different motion characters. By using successive mappings to move from one representation to another, is it the case that we can sidestep the question of what is the right representation of human form (or motion)? Taken together, such a set of mappings implicitly encodes a representation that is independent of any one particular representation. This approach has consequences for dimensionality reduction, pattern recognition, and the day-to-day problems of performance animation.
For example, some motion characters may be better suited for certain computational tasks than others. A particular character may hold representations useful for gesture recognition (for example, Hidden Markov Models), or it may collapse some degrees of freedom that are irrelevant to its application, or accentuate ("caricature") others.
Another trick to explore is the interpolation of mappings. That is, given the mapping from character A to character B and the mapping of A to another character C, can a mapping from A to some mixture of B and C be computed? Characters B and C may in fact be instances of a single character (e.g. "happy" and "sad"). This idea can be extended to interpolation among more than two characters as well.

The Seagull is another vision-based performance animation system I've put together.

Andy Wilson, awilson@media.mit.edu