Having determined the locations of facial features in the image, it is now possible to define a number of windows on the face which will be used for template matching via SSD correlation [5]. Using a simple mapping, a set of windows are overlayed upon the face automatically from the data gathered in the face detection stage. A typical initialization result is shown in Figure 9. Eight tracking windows are initialized on the nose, the mouth tips and the eyes automatically as shown. These windowed correlation trackers acquire templates from the image and minimize the SSD of the underlying image patch from one frame to the next. The image patches first undergo contrast and brightness compensation. Registration of the image patch from one frame to the next is accomplished by minimizing the normalized correlation over translation, scaling and rotation parameters. A linear approximation of the behaviour of the image patch under small translation, scaling and rotation perturbations can be used to recover the motion of the image patch. Only simple linear computations are required for this (i.e. no explicit searching) rendering the computation quite efficient.
Given an image at time 0, we wish to find that minimizes defined in Equation 5.
Where is a motion parametrized by vector which allows translation, rotation and scaling. In other words, . Solving for in an optimal sense is performed by computing the pseudo-inverse of a matrix composed of the motion templates. Such a solution for is only valid for small displacements and smoothing is used to extend the applicable range of the solution.
The minimum value of is also recovered by the process which gives us a cue for the reliability of the resulting optimal .
Unfortunately, minimizing over rotations, scaling and translations cannot account for other 3D or complex changes in the image region. Such changes might be induced by 3D out of plane rotations, occlusions or noise and could easily mislead the estimate of . Thus, the correlation window typically loses track of the feature being tracked if it undergoes excessive change beyond the span of the 2D motion model. In addition, due to the local nature of the tracking algorithm, it would be extremely unlikely for feature tracking to recover from this failure without external assistance. Even if multiple features are being tracked, without a strong coupling feature tracking will eventually fail. As unpredictable effects such as 3D structure, occlusion and noise, interfere with the 2D tracking, each of the feature trackers will stray off in turn and yield invalid spatial trajectories.
What is desired is a global framework that overcomes some of the difficulties inherent in simple 2D tracking by coupling the individual trackers to a global 3D structure. The outputs of the trackers are integrated appropriately to achieve a global explanation of the scene which can be fed back to constrain their individual behaviour and avoid feature loss.