The following list of chapters outlines this thesis. Several concepts are introduced at various levels ranging from the practical to the theoretical with new contributions at multiple levels.
The Action-Reaction Learning framework is initially discussed. The approach treats present activity as an input and future activity as an output and attempts to uncover a probabilistic mapping between them (i.e. a prediction). In particular, by learning from a series of human interactions, one can treat the past interaction of two individuals as an input and try to predict the most likely reaction of the participants. The subsystems are coarsely described including the perceptual engine, the learning engine and the synthesis engine.
The perceptual system (a computer vision based tracker) is presented for acquiring real-time visual data from human users. A technique for the reliable recovery of three blobs corresponding to the head and hands is shown as well as a simple graphical output (the synthesis engine) that generates a corresponding stick figure.
The output of the vision system is discussed as a time series and a variety of pre-processing steps are motivated. This forms a pre-processing engine for the machine learning stage. The representation of the time series data is discussed as well as the feature space used to describe gestures (the actions and the reactions of the human users).
The machine learning engine is investigated and some learning issues are brought into consideration. These include mainly the advantages and disadvantages between generative and discriminative types of learning. The chapter argues for the use of a conditional density (versus a joint density) on the actions and reactions between two human participants and also proposes a Bayesian formalism for it.
The Generalized Bound Maximization framework is introduced as a general tool for optimization. The purpose of the optimization is to resolve some of the learning and estimation issues central to the Action-Reaction Learning paradigm. A set of techniques are presented and tested and form a toolbox which will be utilized in subsequent derivations.
The previous chapter's machinery is applied to the particular problem of learning a conditional density. This results in the derivation of the Conditional Expectation Maximization algorithm. This algorithm is shown to be optimal in the sense of maximum conditional likelihood and will be used as the learning system in the ARL. The machine learning system's model of the relation between input and output (a conditioned mixture of Gaussians) is presented and implementation issues are addressed.
The details of the integrated system as a whole are shown, including the interaction and data-flow between the sub-systems. This forms a high-level summary of the perceptual unit's functionality, the temporal features, the probabilistic inference and the graphical realization of the synthetic character and its animation.
The usage of the system is presented including a particular application domain. The task is put forth and the training and the testing conditions are described. The system is demonstrated to learn some simple gestural behaviours purely from training observations of two humans and then interacts appropriately with a single human participant. Effectively, the system learns to play not by being explicitly programmed or supervised but simply by observing other human participants. Quantitative and qualitative results are presented.
Important extensions to the work are described including more sophisticated perceptual systems, temporal modeling, probabilistic learning and synthesis systems. In addition, other modes of learning are also presented including continuous online learning. Conclusions and contributions are summarized.
This appendix gives a non-intuitive proof-by-example that motivates the the distinction between direct conditional estimation and conditioned joint density estimation. The advantages of conditional probability densities are carried out formally and argue favorably for the input-output learning that is utilized by the ARL system.