Smart Spaces

"Smart spaces" are ordinary environments equipped with visual and audio sensing systems that can perceive and react to people without requiring them to wear any special equipment. The "smart desk" research project consists of a self-calibrating visual system for 3D motion capture of a person, a face-tracking and facial-expression analysis system, and an audio-based expression and speech analysis system. Using these sensing systems, prototype applications have been developed for literal and non-literal physical and expressive animation of virtual characters that the user can interact with in real time.

"Smart Spaces" was a booth in 1996 Siggraph Digital Bayou . The Bayou was a haven away from the vendor floor indended to showcase interesting research in graphics and interaction.

Core Technologies

The Smart Spaces demonstration was built on top of a diverse set of technologies. Each application combines some set of these tools to solve a unique challenge:

Vision: Head and hands are tracked as flesh-colored blobs in real time using color image processing.
Three-dimensional position of the head and hands is estimated from blob correspondences among multiple cameras. Position, orientation and approximate size are extracted.
Blob features allow self-calibration of multiple cameras by tracking the head and hands over a short time period. Painstaking calibration is unnecessary.
Audition: Segmentation extracts speech events from background noise.
Prosody analysis determines pitch, volume and timing of speech.
Pitch, volume and timing is used to synthesize "wah wah" utterances by manipulating a sampled bugle note to give the system the ability to mimic speech, and to drive other aspects of expressive behavior.
Facial Expression: The face is tracked by a computer controlled pan/tilt/zoom camera, using a statistical description of skin color. The mouth is detected using a learned statistical model.
Face orientation and mouth shape information is extracted.
Facial parameters are used to control the expression and head orientation of an animated character.
Gesture Recognition: Position of the hands over time is statistically modeled. The models are computed automatically from a set of example motions from several people.
The input stream is recognized by computing a quantitative score of how similar the input is to the stored models.
The user then gets feedback showing how the input motion differs from a given model.
Dynamic Simulation: An animated character is controlled by a dynamic model which reacts to several potential fields: position of head and hands, gravity and behavioral priors.
Behavioral priors permit animation to follow more than just physically correct simulation: correct elbow placement, and in the future, more general habits of the user that constrain motion.
Performance Animation: The mapping from head and hands to animated character is interactively set by showing the system example correspondences. Interpolation is then used to drive the animation.
The mapping can be chosen from a set of existing mappings on the fly by recognizing features of the head and hand's motion that are consistent with a mapping.

Demonstration Applications

A large number of applications were showcased at the Digital Bayou to demonstrate the flexibility of the underlying technologies. The applications covered a wide range of domains, including animation by example, gesture understanding, education, entertainment, and information retrieval:

Whacka Game: Whack the wuggles and pop the bubbles! The puppet character follows your movements and keeps score. Uses vision technology. Video Avaliable.
Waldorf: Waldorf mimics your movement, voice, and facial expression. Eerie, huh? Uses vision, facial expression, audition and dynamic simulation technologies.
Luxo Lamp: Animate the Luxo lamp. Inspired by how people gesture when describing actions. Uses vision and performance animation technologies.
T'ai Chi Teacher: Watch the master as he performs t'ai chi, try the moves yourself then watch as the master rates your performance. Uses vision and gesture recognition technologies.
Seagull: First show the bird how you flap your wings, then take control and soar over the landscape. Uses vision and performance animation technologies.
Netspace: Navigate a three-dimensional web-space with body movements and voice command. Uses vision and speech recognition technologies.
Text Actor: Choreograph a dynamic typographic actor with your voice. Uses audition and speech recognition technologies.

Contributors

Christopher R. Wren, wren@media.mit.edu

Last modified: Mon Dec 30 14:50:42 EST 1996