| | This page contains links to papers
and web sites authored by project members.
Each entry has a title, a summary, a list of authors, a date of publication or when it was last updated, and links to the material in various formats.

A Bayesian Computer Vision System
for Modeling Human Interactions
In this paper we describe a real-time computer vision and machine learning
system for modeling and recognizing human behaviors in a visual surveillance
task. The system is particularly concerned with detecting when interactions
between people occur, and classifying the type of interaction. Examples of
interesting interaction behaviors include following another person, altering
one's path to meet another, and so forth. Our system combines top-down with
bottom-up information in a closed feedback loop, with both components
employing a statistical Bayesian approach. We propose and compare two
different state-based learning architectures, namely HMMs and CHMMs, for
modeling behaviors and interactions. The CHMM model is shown to work much more
efficiently and accurately. Finally, a synthetic agent training system is used
to develop a priori models for recognizing human behaviors and interactions.
We demonstrate the ability to use these a priori models to accurately classify
real human behaviors and interactions with no additional tuning or training.
By: Nuria Oliver, Barbara Rosario and Alex Pentland
Date: January 99
Formats: .ps
DyPERS: Dynamic Personal Enhanced
Reality System
DyPERS, 'Dynamic Personal Enhanced Reality System', is a wearable system
which uses augmented reality and computer vision to autonomously retrieve
'media memories' based on associations with real objects the user encounters.
These are evoked as audio and video clips taken by the user and overlayed on
top of real objects the user looks at. The user's visual and auditory scene is
stored in real-time by the system (upon request) and is then associated (by
user input) with a snap shot of a visual object. The object acts as a key
which is detected by a real-time vision system when it is in view, triggering
DyPERS to play back the appropriate audio-visual sequence. The vision system
is a probabilistic algorithm which is capable of discriminating between
hundreds of everyday objects under varying viewing conditions (lighting, pose
changes, etc.). The record-and-associate paradigm of the system has many
potential applications. Results of the use of the system in a museum's tour
scenario are described.
By: Tony Jebara, Bernt Schiele, Nuria Oliver and Alex Pentland
Date: May 1998
Formats: .ps
Tracking Conversational Context for
Machine Mediation of Human Discourse
We describe a system that tracks conversational context using speech recognition and topic modeling. Topics are described by computing the
frequency of words for each class. We thus reliably detect, in real-time, the
currently active topic of a group discussion involving several individuals.
One application of this 'situational awareness' is a computer that acts as a
mediator of the group meeting, offering feedback and relevant questions to
stimulate further conversation. It also provides a temporal analysis of the
meeting's evolution. We demonstrate this application and discuss other
possible impacts of conversational situation awareness.
By: Tony Jebara, Yuri Ivanov, Ali Rahimi, and Alex Pentland
Date: August 2000
Formats: .ps, .pdf, HTML
Framing through
Context is an essential line of information for systems that rely on real
world inputs. However, it is frequently ignored because modeling context by
definition requires modeling features outside of the chosen domain. We model
context by using peripheral perception, which basically means non-attentional
features. This naturally and intuitively defines what it means to model
context. We give the results of two experiments in the domain of wearable
sensors (camera and microphone).
By: Brian Clarkson and Alex Pentland
Date: September 10-13, 2000
Formats: .ps, .pdf, HTML
Recognition of Visual Activities and
Interactions by Stochastic Parsing
This paper describes a probabilistic syntactic approach to the detection and
recognition of temporally extended activities and interactions between multiple agents. The fundamental idea is to
divide the recognition problem into two levels. The lower level detections are performed using standard independent probabilistic
event detectors to propose candidate detections of low level features. The outputs of these detectors provide the input stream for a
stochastic context-free grammar parsing mechanism. The grammar and parser provide longer range temporal constraints, disambiguate
uncertain low level detections, and allow the inclusion of a priori knowledge about the structure of temporal events in a given domain.
We develop a real-time system and demonstrate the approach in several experiments on gesture
recognition and in video surveillance. In the surveillance application we show how the system
correctly interprets activities of multiple, interacting objects.
By: Yuri Ivanov and Aaron Bobick
Date: July 2000
Appears in: Transactions on Pattern Analisys and Machine Intelligence
Unsupervised clustering of ambulatory Audio and Video
A truly personal and reactive computer system should have access to the
same information as its user, including the ambient sights and sounds. To this
end, we have developed a system for extracting events and scenes from natural
audio/visual input. We find our system can (without any prior labeling of
data) cluster the audio/visual data into events, such as passing through doors
and crossing the street. Also, we hierarchically cluster these events into
scenes and get clusters that correlate with visiting the supermarket, or
walking down a busy street.
By: Brian Clarkson and Alex Pentland
Date: 1999
Formats: HTML
Recognizing User's Context from Wearable Sensors: Baseline System
We describe experiments in recognizing a person’s situation from only a
wearable camera and microphone. The types of situations considered in these
experiments are coarse locations (such as at work, in a subway or in a grocery
store) and coarse events (such as in a conversation or walking down a busy
street) that would require only global, non-attentional features to
distinguish them.
By: Brian Clarkson, Kenji Mase and Alex Pentland
Date: March 4, 2000
Formats: .ps, .pdf, HTML

HTML web page (HyperText Markup Language)
.ps PostScript
.doc Microsoft Word
.rtf Rich Text
.zip archive file for Windows
.tar archive file for UNIX