The results presented in this section demonstrate that expression models based on local histograms outperform optic flow when the images are not perfectly aligned and normalized. However, they are at least as good as optic-flow based models for perfectly aligned images. If we are to build a system that can reliably extract information from natural interactions it has to be robust against tracking and normalization errors. The lowest level or the shortest time scale expressions that we model should be such that they can be extracted with a high degree of confidence from unconstrained video. These low level models can then be used to build mid level and high level structures that capture the emotional or cognitive meaning of a conversation or interaction.
Table 1 shows our results using local histograms and optic-flow for
images obtained without compensating for head-tracker errors. In this
case, we have both translational (between 5% - 10%) and rotational
error (degrees) and minor scale changes. We used 44 expression
sequences in total for training and 38 for testing consisting of 1719
and 1417 difference images respectively depicting the following eye
expressions: blink, looking left, looking right, raising the eyebrow
and looking up. Images were recorded at 30 frames/second, the mean
length of a typical expression was 39 frames with a standard deviation
of 19 frames. Expressions were not normalized to a constant length,
unlike in the study by Donato et al. [6], as the
variation in length contains important information as well. For
example, a long blink might happen because a person is drowsy as
opposed to regular short blinks. Training and testing expressions were
obtained from two separate data recordings on two different days. All
the expressions were trained on the histogram coefficients or the flow
coefficients using three-state left-to-right HMMs.
Table 2. shows our results using local histograms and optic-flow for images that were accurately normalized. Again, all the expressions were trained using three-state left-to-right HMMs. On average we had 10 sequences/per expression for both training and testing.