For training, the sentences are divided automatically in five equal portions to provide an initial segmention into component signs. Then, initial estimates for the means and variances of the output probabilities are provided by iteratively using Viterbi alignment on the training data and then recomputing the means and variances by pooling the vectors in each segment. Entropic's Hidden Markov Model ToolKit (HTK) is used as a basis for this step and all other HMM modeling and training tasks. The results from the initial alignment program are fed into a Baum-Welch re-estimator, whose estimates are, in turn, refined in embedded training which ignores any initial segmentation. For recognition, HTK's Viterbi recognizer is used both with and without the part-of-speech grammar based on the known form of the sentences. Contexts are not used since they would require significantly more data to train. However, a similar effect can be achieved with the strong grammar in this data set. Recognition occurs five times faster than real time.
Word recognition accuracy results are shown in Table
2; when different, the percentage of words
correctly recognized is shown in parentheses next to the accuracy
rates. Accuracy is calculated by
An additional ``relative features'' test is provided in the results.
For this test, absolute (x,y) position is removed from the feature
vector. This provides a sense of how the recognizer performs
when only relative features are available. Such may be the case
in daily use; the signer may not place himself in the same location
each time the system is used.
experiment | training set | independent |
test set | ||
all features | 94.1% | 91.9% |
relative features | 89.6% | 87.2% |
all features & | 81.0% (87%) | 74.5% (83%) |
unrestricted | (D=31, S=287, | (D=3, S=76, |
grammar | I=137, N=2390) | I=41, N=470) |
The 94.1% and 91.9% accuracies using the part-of-speech grammar show that the HMM topologies are sound and that the models generalize well. However, the subject's variability in body rotation and position is known to be a problem with this data set. Thus, signs that are distinguished by the hands' positions in relation to the body were confused since the absolute positions of the hands in screen coordinates were measured. With the relative feature set, the absolute positions of the hands are be removed from the feature vector. While this change causes the error rate to increase slightly, it demonstrates the feasibility of allowing the subject to vary his location in the room while signing, possibly removing a constraint from the system.
The error rates of the ``unrestricted'' experiment better indicate where problems may occur when extending the system. Without the grammar, signs with repetitive or long gestures were often inserted twice for each actual occurrence. In fact, insertions caused more errors than substitutions. Thus, the sign ``shoes'' might be recognized as ``shoes shoes,'' which is a viable hypothesis without a language model. However, a practical solution to this problem is to use context training and a statistical grammar.