Facial pose, 3D structure and position provide a vital source of information for applications such as face recognition, gaze tracking and interactive environments. We describe a real-time system that automatically provides such measurements from real-world video streams. These two key attributes (real-world video and real-time) limit us to the types of image processing we can do. Computations must be fast without sacrificing generality and robustness to a wide variety of face tracking scenarios. We propose a system that involves the marriage of robust face detection and fast face tracking. The system gracefully reverts to face detection when tracking fails and re-initializes fast face tracking anew. Tracking is accomplished by minimizing normalized correlation over translation, rotation and scale. However, tracking is intimately coupled with feedback from a parametrized structure from motion framework. This allows us to overcome some limitations of linearized 2D image patches by the simultaneous recovery of underlying global 3D structure.
Motion provides a strong cue for estimating 3D structure, pose and camera geometry. However, stable and accurate structure from motion has typically been a purely bottom-up approach requiring high quality feature tracking. Moreover, structure from motion (SfM) is usually constrained exclusively by rigidity assumptions. However, it is possible to further constrain the estimation of 3D shape if the range of the 3D structures is defined a priori. In other words, if only faces are to be tracked, SfM can be limited by 3D head models of human faces so that unlikely configurations will be eliminated. We describe a global tracking framework which takes advantage of automatic initialization and 3D parametrized structural estimation to perform reliable feature tracking.
The details of such a tracking system are discussed starting with initialization which is performed via automatic detection of facial features. The components of our face detection algorithm include skin classification, symmetry transforms, 3D normalization and eigenface analysis. Once initial locations of these facial interest points are determined, the system tracks these features using 2D SSD correlation patches (spanning rotation, scale and translation). However, such tracking alone is incapable of dealing with 3D out-of plane and other non-linear changes. Thus, the 2D tracking and its noise characteristics are coupled to a structure from motion algorithm that simultaneously recovers an estimate of the pose and of the underlying 3D structure of the face. This structure is further constrained by a training set of 3D laser-scanned heads represented as a parametrized eigenspace. This prevents invalid 3D shape estimates in the structure from motion computation. This final filtered 3D facial structure and pose estimate is fed back to control the 2D feature tracking at the next iteration and overcome some of its inherent 2D limitations.
The fully integrated system is displayed in Figure 1. Note the fast face tracking loop and the slower face detection loop. The system switches between these two modes using eigenface measurements. If the object being tracked is a face, tracking continues. However, if the object being tracked is not face-like, reliable face detection is used to search the whole image for a new face. In addition, note the coupling of feature tracking, structure from motion and 3D eigen head modeling. This closed loop feedback prevents tracking from straying off course.