Adaptive Multimodal Interfaces

Deb Roy and Alex Pentland

Introduction

We are developing a human computer interface (HCI) which combines multimodal input processing with machine learning in an interactive environment. The goal is to create an interface which can be taught to understand communicative primitives. In our current work we are focusing on understanding combinations of deictic (pointing) gestures and speech.

A central problem which we address is how a machine can learn words by interacting with a person. We are interested in learning both acoustic models of words and the semantic associations of the words for a constrained domain. Semantics are defined in terms of (possibly complex) associations between human input and appropriate machine actions.

A second related problem which are studying is to learn temporal relations between gestures and speech. For example we expect to learn (possibly person-dependent) temporal relations between the apex of a deictic gesture referring to an object and the word referring to the object embedded in a continuous speech utterance. Learning such temporal relations can then help predict the occurrance of new words or help find salient words from a continuous speech stream.

Learning in the Interface

Traditional HCI interfaces have hard-wired assumptions about how a person will communicate. In a typical speech recognition application the system has some preset vocabulary and (possibly statistical) grammar. For proper operation the user must restrict what she says to words and vocabulary built into the system. However, studies have shown that in practice it is difficult to predict how different user will use available input modalities to express their intents. For example Furnas et. [Furnas] did a series of experiments to see how people would assign keywords for operations in a mock interface. They conclude that:

"There is no one good access term for most objects. The idea of an "obvious", "self-evident,' or "natural" term is a myth! ... Even the best possible name is not very useful...Any keyword system capable of providing a high hit rate for unfamiliar users must let them use words of their won choice for objects."

Our conclusion is that to make effective interfaces there need to be adaptive mechanisms which can learn how individuals use modalities to communicate.

The Current System

The interface is embodied through an animated character (Toco the Toucan) which reacts in real-time to the user's actions. Toco's gaze and facial expressions give the user feedback of what Toco senses. The user sits at a desk which has a large projection screen attached for display of the animated character and objects which can be manipulated in a virtual 3-D world.

The speech processor consists of a recurrent neural network which is used to recognize phonemes from the speech stream. The user currently wear a head-mounted microphone for speech input. The visual processor uses color class based skin tracking to track the user's hands. Triangulation is used to estimate the 3-D position of each hand by combining the hand position estimates from two overhead cameras [Azarbayejani].

The current learning component of the system associates voice labels with objects in Toco's virtual world. The user may get Toco's attention by calling his name. When Toco hears his name (his face and "voice" shows when he is attentive) he is ready to learn new words or respond to learned words. The user can point to an object in the virtual world and name it. Toco generates an HMM model for the word and associates it with the object the user points to. If Toco hears a word without the user pointing to the virtual world, he finds the most similar previously learned acoustic model and then gestures toward the associated object.

Future Work

We are studying protocols for teaching infants, parrots, and primates language skills as inspirations for the training protocols which will be used to teach Toco. The goal is to make the training process robust, gradual, natural and engaging for the user. We wish to use natural protocols such as pointing to objects when naming them as elements of the training interface.

Some other areas of future work include:

- Statistically modeling temporal relationships between deictic gestures and speech for spatial management tasks

- Incorporating explicit and implicit feedback into the interactive protocol between Toco and the user to facilitate learning

- Creation of a game which user's can play with Toco as a vehicle for data collection

- Transition to continuous speech recognition (the current systems assumes isolated words or phrases)

References

Ali Azarbayejani, Christopher Wren, Alex Pentland (1996) TR#374: Real-time 3-D tracking of the human body Appears in: Proceedings of IMAGE'COM 96, Bordeaux, France, May 1996

Furnas, G.W. Landauer, T.K. Gomez, L.M and Dumais, S.T. (1987) The vocabulary problem in human-system communications. Communications of the Association for Compuring Machinery, 30, 964-971.


Back to Toco the Toucan
Back to Speech and Audio Processing