Next: Discussion and Future Work Up: Real-Time American Sign Language Previous: The second person view:

The first person view: a wearable-based recognizer

For the second experiment, the same 500 sentences were collected by a different subject. Sentences were re-signed whenever a mistake was made. The full 500 sentence database is available from anonymous ftp at vismod.media.mit.edu under pub/asl. The subject took care to look forward while signing so as not to confound the tracking with head rotation, though variations can be seen. Often, several frames at the beginning and ending of a sentence's data contain the hands at a resting position. To take this in account, another token, ``silence'' (in deference to the speech convention), was added to the lexicon. While this ``sign'' is trained with the rest, it is not included when calculating the accuracy measurement.

The resulting word accuracies from the experiment are listed in Table 3. In this experiment 400 sentences were used for training, and an independent 100 sentences were used for testing. A new grammar was added for this experiment. This grammar simply restricts the recognizer to five word sentences without regard to part of speech. Thus, the percent correct words expected by chance using this ``5-word'' grammar would be 2.5%. Deletions and insertions are possible with this grammar since a repeated word can be thought of as a deletion and an insertion instead of two substitutions.

**Table 3:** Word accuracy of wearable computer system
grammar	training set	independent
		test set
part-of-	99.3%	97.8%
speech
5-word	98.2% (98.4%)	97.8%
sentence	(D = 5, S=36,
	I=5 N =2500)
unrestricted	96.4% (97.8%)	96.8% (98.0%)
	(D=24, S=32,	(D=4, S=6,
	I=35, N=2500)	I=6, N=500)

Word accuracies; percent correct in parentheses where different. The 5-word grammar limits the recognizer output to 5 words selected from the vocabulary. The other grammars are as before.

Interestingly, for the part-of-speech, 5-word, and unrestricted tests, the accuracies are essentially the same, suggesting that all the signs in the lexicon can be distinguished from each other using this feature set and method. As in the previous experiment, repeated words represent 25% of the errors in the unrestricted grammar test. In fact, if a simple repeated word filter is applied post process to the recognition, the unrestricted grammar test accuracy becomes 97.6%, almost exactly that of the most restrictive grammar! Looking carefully at the details of the part-of-speech and 5-word grammar tests indicate that the same beginning and ending pronoun restriction may have hurt the performance of the part-of-speech grammar! Thus, the strong grammars are superfluous for this task. In addition, the very similar results between fair-test and test-on-training cases indicate that the HMM's training converged and generalized extremely well for the task.

The main result is the high accuracies themselves, which indicate that harder tasks should be attempted. However, why is the wearable system so much more accurate than the desk system? There are several possible factors. First, the wearable system has less occlusion problems, both with the face and between the hands. Second, the wearable data set did not have the problem with body rotation that the first data set experienced. Third, each data set was created and verified by separate subjects, with successively better data recording methods. Controlling for these various factors requires a new experiment, described in the next section.

Next: Discussion and Future Work Up: Real-Time American Sign Language Previous: The second person view:

Thad Starner
1998-09-17