Results

The results presented in this section demonstrate that expression models based on local histograms outperform optic flow when the images are not perfectly aligned and normalized. However, they are at least as good as optic-flow based models for perfectly aligned images. If we are to build a system that can reliably extract information from natural interactions it has to be robust against tracking and normalization errors. The lowest level or the shortest time scale expressions that we model should be such that they can be extracted with a high degree of confidence from unconstrained video. These low level models can then be used to build mid level and high level structures that capture the emotional or cognitive meaning of a conversation or interaction.

Table 1 shows our results using local histograms and optic-flow for images obtained without compensating for head-tracker errors. In this case, we have both translational (between 5% - 10%) and rotational error ( $\pm 5$ degrees) and minor scale changes. We used 44 expression sequences in total for training and 38 for testing consisting of 1719 and 1417 difference images respectively depicting the following eye expressions: blink, looking left, looking right, raising the eyebrow and looking up. Images were recorded at 30 frames/second, the mean length of a typical expression was 39 frames with a standard deviation of 19 frames. Expressions were not normalized to a constant length, unlike in the study by Donato et al. [6], as the variation in length contains important information as well. For example, a long blink might happen because a person is drowsy as opposed to regular short blinks. Training and testing expressions were obtained from two separate data recordings on two different days. All the expressions were trained on the histogram coefficients or the flow coefficients using three-state left-to-right HMMs.

Table: Recognition rates -- images with tracking errors

Expression	Local Histograms	Optic - Flow
Brow Raise	90.0 $\%$	88.9 $\%$
Blink	90.0 $\%$	40.0 $\%$
Right	90.0 $\%$	25.0 $\%$
Left	100.0 $\%$	20.0 $\%$
Up	100.0 $\%$	71.4 $\%$

Table 2. shows our results using local histograms and optic-flow for images that were accurately normalized. Again, all the expressions were trained using three-state left-to-right HMMs. On average we had 10 sequences/per expression for both training and testing.

Table: Recognition rates -- perfectly normalized images

Expression	Local Histograms	Optic - Flow
Brow Raise	90 $\%$	100 $\%$
Blink	90 $\%$	90 $\%$
Right	100 $\%$	100 $\%$
Left	100 $\%$	100 $\%$
Up	100 $\%$	100 $\%$