-- DEMOS ---- DEMOS ---- DEMOS ---- DEMOS ---- DEMOS --

Research Projects

This page contains descriptions and on-line demos of many research projects in Vismod. More information about these projects and others is available in the descriptions accessible from the people pages or from the faculty project descriptions below:

Perceptual Intelligence - Sandy Pentland
Affective Computing - Rosalind Picard
Seeing Action - Aaron Bobick

Workgroup-Logo Behavior and Learning Workgroup

The behavior and learning workgroup is an ongoing project at the media lab where various professors and students meet weekly to discuss how learning and intelligent behavior arise in animals, humans and machines. Weekly presentations from guest speakers cover topics from statistical inference to song learning in birds to interactive synthetic characters. The group then typically engages in open brain storming sessions, searching for new ideas, explanations and practical projects.

facilitator-Logo Conversational Context

We describe an aspect of an ongoing project broadly named "The Facilitator" project. A facilitator or a mediator plays an important role in group meetings and conversations without necessarily understanding the details and particulars of the subject at hand. Could a machine play such a role granted that it can only have a coarse idea of the actions and words of the people in a meeting? Conversational context detection is an approach which which tracks the broad subject areas and general 'roles' the people engage in through the meeting. It then attempts to augment the meeting in various ways, i.e. asks relevant questions given the current topic that is being discussed.

dyna-puppet Dyna

Dyna is a framework for perception of human motion that explicitly acknowledges several important facts about the human form: it is a three dimensional object, it obeys the laws of physics, and it is subject to numerous physiological and habitual constraints. Combining these facts into a recursive estimation framework provies a powerful tool for human-centered, artificial perception.

borg cube the Netrek Collective

We endeavor to build advanced interfaces for the game netrek. We intend to build interfaces that allow high-level control over an entire netrek team: more resources than a single person could possibly control explicitly. These interfaces will employ perceptual intelligence technologies such as: vision-based body and face tracking, context-embedded speech analysis, and gesture analysis.

Automatic Language Learning

We are working on a model of language acquisition which learns from natural spoken and visual input. Practical applications include adaptive speech interfaces for information browsing, assistive technologies, education, and entertainment.


DyPERS, 'Dynamic Personal Enhanced Reality System', uses augmented reality and computer vision to autonomously retrieve 'media memories' based on associations with real objects the user encounters. These are evoked as audio and video clips relevant for the user and overlayed on top of real objects the user encounters. The system utilizes an adaptive, audio-visual learning system on a tetherless wearable computer


A (S)ilhouette-based (I)nteractive (D)ual-screen (E)nvironment. In this system, we overcome the inherent problems associated with blue-screening, while opening a new venue for multi-screen interaction. This system allows computer vision systems to robustly sense people within a multi-screen environment without attaching any special clothing or hardware to the people.

A Virtual Personal Aerobics Instructor

Unlike workout video tapes or TV exercise shows, this system allows the user to create and personalize an aerobics session to meet the user's needs and desires. Various media technology and computer vision algorithms are used to enhance the interaction of the character by enabling it to watch and talk to the user (instead of just the user watching the TV).

The KidsRoom

An interactive, narrative playspace using computer vision action recognition.

Recognition of Action

In this project, we analyze visual motion characteristics to classify various human activities (e.g. sitting, waving, crouching). This real-time activity detector is incorporated into a variety of interesting and fun tasks, as in "The KidsRoom" and the "Virtual Aerobics Instructor".

Smart Rooms

Smart Rooms act like invisible butlers. They have cameras, microphones, and other sensors, and use these inputs to try to interpret what people are doing in order to help them. We have already built smart rooms that can recognize who is in the room and can interpret their hand gestures, and smart car interiors that know when drivers are trying to turn, stop, pass, etc., without being told.

Smart Spaces

Smart Spaces was a demonstration in New Orleans at Siggraph 1996. It showcased many different perceptual technologies as well a as a wide variety of interface applications.

Smart Desks

Smart Desks are a type of Smart Room, but specialized for the business environment. The goal of the Smart Desk project is to develop a desk that acts like a good office assistant. Such a desk should know your work habits and preferences, remember where you put things, know when you are feeling frustrated or tired, and know enough about your work to anticipate many of your needs. The smart desks project will try to accomplish this by using cameras, microphones, and biosensors to monitor you as you work, and active agents technology to figure our how to help.

Smart Chair

Smart chairs are intelligent devices that are aware of the user's activities (posture, movement, and sitting habits) and respond with feedback mechanisms that assist the user in a friendly manner. They may one day interface with smart rooms, smart desks, and smart clothes to form an interactive electronic environment that is friendly and useful to the user.

Smart Clothes

Smart Clothes act like human assistants. By building microprocessors, cameras, microphones, and wireless communication into clothing we can help users get around in the world. We have already built demonstrators that can help you remember people's names and help you find your way around town. We are now working with clothing designers to make these devices not only helpful but also fashionable and attractive.


Pfinder (Person Finder) is the real-time, vision-based sensor first used in the SmartRoom Project. Pfinder provides real-time human body analysis for several research projects. It has been deployed in several locations (research labs and museums alike) and has processed thousands of users.

Computers Watching Football

The goal of this project is to study the problems associated with automatic video annotation. In the first stage of this project we have developed a method for tracking football players directly from real video. In future work we will use the recovered player trajectories as input to an automatic play labeling system.


Smartcams are TV cameras which can operate without a camera-man. A Smartcam receives a request for a shot in verbal form, e.g., "Close-up guest", and, using computer vision techniques, looks for the subject in the studio, and adjusts panning and zooming till an adequate framing is found. Smartcams are also a very interesting domain to develop concepts and architectures suitable for multiple levels of representation and use of context in computer vision problems.

Face Recognition

We have developed a fully automatic system for detection, recognition and model-based coding of faces for potential applications such as video telephony, database image compression, and automatic face recognition. The system consists of a two-stage object detection and alignment stage, a contrast normalization stage, and a Karhunen-Loeve (eigenspace) based feature extraction stage whose output is used for both recognition and coding. This leads to a compact representation of the face that can be used for both recognition as well as image compression. Good-quality facial images are automatically generated using approximately 100-bytes worth of encoded data. The system has been successfully tested on a database of nearly 2000 facial photographs from the ARPA FERET database with a detection rate of 97%. Recognition rates as high as 99% have been obtained on a subset of the FERET database consisting of 2 frontal views of 155 individuals. In September 1996, we participated in the final round of FERET tests administered by the US Army Research Laboratory which consisted of a large gallery test containing nearly 3,800 frontal images. Our automatic system was found to be the top competitor (by a typical margin of 10% to the next best competitor).

Photobook and FourEyes

Photobook is an image database query environment which supports query-by-content and multiple views of the images. For example, the user can say "show me the images which look like this one" in terms of overall content; or "show me the images whose Fourier transform looks like this one's". The user can choose among a large repertoire of algorithms for such queries, including an algorithm which incorporates feedback into the other algorithms and allows them to "learn" the user's concept of image similarity. We are currently applying Photobook, under the new name "FourEyes", to interactive scene segmentation and annotation.

Affective Computing

Recent neurological evidence indicates that emotions are not a luxury; they are essential for "reason" to function normally, even in rationaldecision-making. Furthermore, emotional expression is a natural and significant part of human interaction. Whether it is used to indicate like/dislike or interest/disinterest, emotion plays a key role in multimedia information retrieval, user preference modeling, and human-computer interaction. Affective computing is a new area of research focusing on computing that relates to, arises from, or deliberately influences emotions. The focus of the present project is on giving computers the ability to recognize affect. Current applications include better learning systems (computer recognizes interest, frustration, or pleasure of pupil), and smarter "things" such as a steering wheel/seatbelt that sense when a driver is angry or incapacitated.

Speech and Audio Processing

Speech and Audio Processing encapsulate a set of projects which push the bounds of adaptive communication and multimodal interfaces. Toco the Toucan, inspired by models of language acquisition and animal learning is a system which combines speech recognition, vision and machine learning. Audio Wearables are a class of devices that use speech and audio as their primary interface.


The FaceView project is concerned with observing, understanding, and synthesizing actions of the face and head. The current work on this project is focused on the areas of head-tracking, facial expression recognition, and non-rigid deformation of head models for animation.

"Seamless Hockney" --- painting with looks

A new technique has been developed for estimating the projective (homographic) coordinate transformation between pairs of images of a static scene, taken with a camera that is free to pan, tilt, rotate about its optical axis, and zoom. The new algorithm is applied to the task of constructing high resolution still images from video. This approach generalizes inter-frame camera motion estimation methods which have previously used an affine model and/or which have relied upon finding points of correspondence between the image frames. The new method, which allows an image to be created by ``painting with video'' is used in conjunction with WearCam so that image mosaics can be generated simply by looking around --- ``painting with looks''.


Combining mobile multimedia with wearable computing and wireless communications gives rise to a new form of connectivity, through an antenna mounted on the tallest building in the city. This connectivity extends across the MIT campus and nearby cities as well. One obvious application is in the personal safety device where connectivity through the Internet could allow friends and relatives to look out for one-another, forming a "Safety Net". This "global village" would be far preferable to relying on more Orwellian methods of crime reduction such as pole-top cameras mounted throughout the city. Another application of WearCam is the Personal Visual Assistant, or the visual memory prosthetic.

-- MAIN -- Back up to Main Menu