The correlation-based feature trackers begin by tracking in a nearest-neighbour sense and search locally for the facial features. However, at each iteration, the Kalman filter computes an estimate of the rigid 3D structure that could correspond to the motion of the set of 2D SSD trackers. This global estimate is weighted using the noise characteristics and residuals of the 2D tracking. Once this structure is computed and an estimate of orientation and camera focal length are found, the 3D structure is filtered using an eigenspace of 3D head shape. The final 3D structure, motion and focal length are used to projected feature points back onto the image to determine an estimated position of the 2D feature trackers. Then, at the next frame in the sequence, correlation-based search is performed starting at this 3D estimated position as well as starting at the original destination of the feature track. The best match of these two searches is then fed back into the Kalman filter as the 2D spatial observation vector and the loop continues. Two searches are performed for each SSD tracker since the EKF may possibly perform worse than straight neareset-neighbour searching before structural convergence. The feedback from the adaptive Kalman filter maintains a sense of 3D structure and enforces a global collaboration between the separate 2D trackers.
In addition, at each iteration, the orientation of the face is computed and is used to warp the face image back into frontal view to compute 'distance to face space'. If DFFS is below a threshold, tracking continues. Otherwise, the system reverts back to the initial detection stage.