Project Web -- Archive

This page contains links to papers and web sites authored by project members.

Each entry has a title, a summary, a list of authors, a date of publication or when it was last updated, and links to the material in various formats.

Papers

A Bayesian Computer Vision System for Modeling Human Interactions

In this paper we describe a real-time computer vision and machine learning system for modeling and recognizing human behaviors in a visual surveillance task. The system is particularly concerned with detecting when interactions between people occur, and classifying the type of interaction. Examples of interesting interaction behaviors include following another person, altering one's path to meet another, and so forth. Our system combines top-down with bottom-up information in a closed feedback loop, with both components employing a statistical Bayesian approach. We propose and compare two different state-based learning architectures, namely HMMs and CHMMs, for modeling behaviors and interactions. The CHMM model is shown to work much more efficiently and accurately. Finally, a synthetic agent training system is used to develop a priori models for recognizing human behaviors and interactions. We demonstrate the ability to use these a priori models to accurately classify real human behaviors and interactions with no additional tuning or training.
By:           Nuria Oliver, Barbara Rosario and Alex Pentland
Date:         January 99
Formats:      .ps

Back to Top

DyPERS: Dynamic Personal Enhanced Reality System

DyPERS, 'Dynamic Personal Enhanced Reality System', is a wearable system which uses augmented reality and computer vision to autonomously retrieve 'media memories' based on associations with real objects the user encounters. These are evoked as audio and video clips taken by the user and overlayed on top of real objects the user looks at. The user's visual and auditory scene is stored in real-time by the system (upon request) and is then associated (by user input) with a snap shot of a visual object. The object acts as a key which is detected by a real-time vision system when it is in view, triggering DyPERS to play back the appropriate audio-visual sequence. The vision system is a probabilistic algorithm which is capable of discriminating between hundreds of everyday objects under varying viewing conditions (lighting, pose changes, etc.). The record-and-associate paradigm of the system has many potential applications. Results of the use of the system in a museum's tour scenario are described.
By:           Tony Jebara, Bernt Schiele, Nuria Oliver and Alex Pentland
Date:         May 1998
Formats:      .ps

Back to Top

Tracking Conversational Context for Machine Mediation of Human Discourse

We describe a system that tracks conversational context using speech recognition and topic modeling. Topics are described by computing the frequency of words for each class. We thus reliably detect, in real-time, the currently active topic of a group discussion involving several individuals. One application of this 'situational awareness' is a computer that acts as a mediator of the group meeting, offering feedback and relevant questions to stimulate further conversation. It also provides a temporal analysis of the meeting's evolution. We demonstrate this application and discuss other possible impacts of conversational situation awareness.
By:           Tony Jebara, Yuri Ivanov, Ali Rahimi, and Alex Pentland
Date:         August 2000
Formats:      .ps, .pdf, HTML

Back to Top

Framing through perception

Context is an essential line of information for systems that rely on real world inputs. However, it is frequently ignored because modeling context by definition requires modeling features outside of the chosen domain. We model context by using peripheral perception, which basically means non-attentional features. This naturally and intuitively defines what it means to model context. We give the results of two experiments in the domain of wearable sensors (camera and microphone).
By:           Brian Clarkson and Alex Pentland
Date:         September 10-13, 2000
Formats:      .ps, .pdf, HTML

Back to Top

Recognition of Visual Activities and Interactions by Stochastic Parsing

This paper describes a probabilistic syntactic approach to the detection and recognition of temporally extended activities and interactions between multiple agents. The fundamental idea is to divide the recognition problem into two levels. The lower level detections are performed using standard independent probabilistic event detectors to propose candidate detections of low level features. The outputs of these detectors provide the input stream for a stochastic context-free grammar parsing mechanism. The grammar and parser provide longer range temporal constraints, disambiguate uncertain low level detections, and allow the inclusion of a priori knowledge about the structure of temporal events in a given domain. We develop a real-time system and demonstrate the approach in several experiments on gesture recognition and in video surveillance. In the surveillance application we show how the system correctly interprets activities of multiple, interacting objects.
By:           Yuri Ivanov and Aaron Bobick
Date:         July 2000
Appears in:   Transactions on Pattern Analisys and Machine Intelligence

Back to Top

Unsupervised clustering of ambulatory Audio and Video

A truly personal and reactive computer system should have access to the same information as its user, including the ambient sights and sounds. To this end, we have developed a system for extracting events and scenes from natural audio/visual input. We find our system can (without any prior labeling of data) cluster the audio/visual data into events, such as passing through doors and crossing the street. Also, we hierarchically cluster these events into scenes and get clusters that correlate with visiting the supermarket, or walking down a busy street.
By:           Brian Clarkson and Alex Pentland
Date:         1999
Formats:      HTML

Back to Top

Recognizing User's Context from Wearable Sensors: Baseline System

We describe experiments in recognizing a person’s situation from only a wearable camera and microphone. The types of situations considered in these experiments are coarse locations (such as at work, in a subway or in a grocery store) and coarse events (such as in a conversation or walking down a busy street) that would require only global, non-attentional features to distinguish them.
By:           Brian Clarkson, Kenji Mase and Alex Pentland
Date:         March 4, 2000
Formats:      .ps, .pdf, HTML

Back to Top

Document Format Definitions

HTML    web page (HyperText Markup Language)
.ps     PostScript
.doc    Microsoft Word
.rtf    Rich Text
.zip    archive file for Windows
.tar    archive file for UNIX

Back to Top

Copyright 2000, MIT Media Laboratory.
For problems or questions regarding this web contact [viswebmaster@media.mit.edu].
Last updated: September 25, 2000.

Copyright 2000, MIT Media Laboratory. For problems or questions regarding this web contact [viswebmaster@media.mit.edu]. Last updated: September 25, 2000.

Copyright 2000, MIT Media Laboratory.
For problems or questions regarding this web contact [viswebmaster@media.mit.edu].
Last updated: September 25, 2000.