Next: Testing and Performance Up: MIT Media Laboratory, Perceptual Previous: Initialization and Parametrization of

System Integration and Feedback

We now go over the implementation details of the system integration and the feedback process. The system begins with the face detection loop and repeats until a face is detected and satisfies a threshold on distance from face-space. The facial features detected are eyes, nose and mouth. From these features, a set of templates can be placed on the face (one on each tip of the mouth, one on each side of the nose, and two for each eye). These acquire the underlying texture and then a sensitivity analysis is performed to obtain the mapping between spatial uncertainty and correlation residual. A depth map of the face is obtained by fitting a 3D model to the position of the features and this is used to initialize the depth parameters of a Kalman filter that recovers structure from motion.

The correlation-based feature trackers begin by tracking in a nearest-neighbour sense and search locally for the facial features. However, at each iteration, the Kalman filter computes an estimate of the rigid 3D structure that could correspond to the motion of the set of 2D SSD trackers. This global estimate is weighted using the noise characteristics and residuals of the 2D tracking. Once this structure is computed and an estimate of orientation and camera focal length are found, the 3D structure is filtered using an eigenspace of 3D head shape. The final 3D structure, motion and focal length are used to projected feature points back onto the image to determine an estimated position of the 2D feature trackers. Then, at the next frame in the sequence, correlation-based search is performed starting at this 3D estimated position as well as starting at the original destination of the feature track. The best match of these two searches is then fed back into the Kalman filter as the 2D spatial observation vector and the loop continues. Two searches are performed for each SSD tracker since the EKF may possibly perform worse than straight neareset-neighbour searching before structural convergence. The feedback from the adaptive Kalman filter maintains a sense of 3D structure and enforces a global collaboration between the separate 2D trackers.

In addition, at each iteration, the orientation of the face is computed and is used to warp the face image back into frontal view to compute 'distance to face space'. If DFFS is below a threshold, tracking continues. Otherwise, the system reverts back to the initial detection stage.

Next: Testing and Performance Up: MIT Media Laboratory, Perceptual Previous: Initialization and Parametrization of

Tony Jebara
1999-12-07