Next: The first person view: Up: Real-Time American Sign Language Previous: Feature extraction and hand

The second person view: a desk-based recognizer

The first experimental situation explored was the second person view: a desk-based recognizer. In this experiment 500 sentences were obtained, but 22 sentences were eliminated due to subject error or outlier signs. In general, each sign is 1 to 3 seconds long. No intentional pauses exist between signs within a sentence, but the sentences themselves are distinct. For testing purposes, 384 sentences were used for training, and 94 were reserved for testing. The test sentences are not used in any portion of the training process.

For training, the sentences are divided automatically in five equal portions to provide an initial segmention into component signs. Then, initial estimates for the means and variances of the output probabilities are provided by iteratively using Viterbi alignment on the training data and then recomputing the means and variances by pooling the vectors in each segment. Entropic's Hidden Markov Model ToolKit (HTK) is used as a basis for this step and all other HMM modeling and training tasks. The results from the initial alignment program are fed into a Baum-Welch re-estimator, whose estimates are, in turn, refined in embedded training which ignores any initial segmentation. For recognition, HTK's Viterbi recognizer is used both with and without the part-of-speech grammar based on the known form of the sentences. Contexts are not used since they would require significantly more data to train. However, a similar effect can be achieved with the strong grammar in this data set. Recognition occurs five times faster than real time.

Word recognition accuracy results are shown in Table 2; when different, the percentage of words correctly recognized is shown in parentheses next to the accuracy rates. Accuracy is calculated by

where N is the total number of words in the test set, D is the number of deletions, S is the number of substitutions, and I is the number of insertions. Note that, since all errors are accounted against the accuracy rate, it is possible to get large negative accuracies (and corresponding error rates of over 100%). When using the part-of-speech grammar (pronoun, verb, noun, adjective, pronoun), insertion and deletion errors are not possible since the number and class of words allowed is known. Thus, all errors are vocabulary substitutions when this grammar is used (and accuracy is equivalent to percent correct). Assuming independence, random chance would result in a percent correct of 13.9%, calculated by averaging over the likelihood of each part-of-speech being correct. Without the grammar, the recognizer is allowed to match the observation vectors with any number of the 40 vocabulary words in any order. In fact, the number of words produced by the recognizer can be up to the number of samples in the sentence! Thus, deletion (D), insertion (I), and substitution (S) errors are possible in the ``unrestricted grammar'' tests, and a comparison to random chance becomes irrelevant. The absolute number of errors of each type are listed in Table 2. Many of the insertion errors correspond to signs with repetitive motion.

An additional ``relative features'' test is provided in the results. For this test, absolute (x,y) position is removed from the feature vector. This provides a sense of how the recognizer performs when only relative features are available. Such may be the case in daily use; the signer may not place himself in the same location each time the system is used.

**Table 2:** Word accuracy of desk-based system
experiment	training set	independent
		test set
all features	94.1%	91.9%
relative features	89.6%	87.2%
all features &	81.0% (87%)	74.5% (83%)
unrestricted	(D=31, S=287,	(D=3, S=76,
grammar	I=137, N=2390)	I=41, N=470)

Word accuracies; percent correct in parentheses where different. The first test uses the strong part-of-speech grammar and all feature elements. The second test removes absolute position from the feature vector. The last test again uses all features but only requires that the hypothesized output be composed of words from the lexicon. Any word can occur at any time and any number of times.

The 94.1% and 91.9% accuracies using the part-of-speech grammar show that the HMM topologies are sound and that the models generalize well. However, the subject's variability in body rotation and position is known to be a problem with this data set. Thus, signs that are distinguished by the hands' positions in relation to the body were confused since the absolute positions of the hands in screen coordinates were measured. With the relative feature set, the absolute positions of the hands are be removed from the feature vector. While this change causes the error rate to increase slightly, it demonstrates the feasibility of allowing the subject to vary his location in the room while signing, possibly removing a constraint from the system.

The error rates of the ``unrestricted'' experiment better indicate where problems may occur when extending the system. Without the grammar, signs with repetitive or long gestures were often inserted twice for each actual occurrence. In fact, insertions caused more errors than substitutions. Thus, the sign ``shoes'' might be recognized as ``shoes shoes,'' which is a viable hypothesis without a language model. However, a practical solution to this problem is to use context training and a statistical grammar.

Next: The first person view: Up: Real-Time American Sign Language Previous: Feature extraction and hand

Thad Starner
1998-09-17