We are working on interfaces which enable people to communicate with natural modalities such as speech and gesture. The problem with previous multimodal interfaces is that the user must learn which words and gestures the system can understand before using the interface. Our approach is to use statistical machine learning techniques to enable the interface to learn from the user. An example of these ideas are demonstrated by an animated character called Toco the Toucan. Toco watches where the user points using computer vision, and listens to their speech using a phonetic speech recognizer. As the user interacts with Toco, Toco learns the acoustic models and meanings of spoken words.