Related Work

Next: The Task Up: Introduction Previous: Introduction

Related Work

Following a similar path to early speech recognition, many previous attempts at machine sign language recognition concentrate on isolated signs or fingerspelling. Space does not permit a thorough review [19], but, in general, most attempts either relied on instrumented gloves or a desktop-based camera system and used a form of template matching or neural nets for recognition. However, current extensible systems are beginning to employ hidden Markov models (HMM's).

Hidden Markov models are used prominently and successfully in speech recognition and, more recently, in handwriting recognition. Consequently, they seem ideal for visual recognition of complex, structured hand gestures as are found in sign languages. Explicit segmentation on the word level is not necessary for either training or recognition. Language and context models can be applied on several different levels, and much related development of this technology has been done by the speech recognition community [6].

When the authors first reported this project in 1995 [15,18], very few uses of HMM's were found in the computer vision literature [22,13]. At the time, continuous density HMM's were beginning to appear in the speech community; continuous gesture recognition was scarce; gesture lexicons were very small; and automatic training through Baum-Welch re-estimation was uncommon. Results were not reported with the standard accuracy measures accepted in the speech and handwriting recognition communities, and training and testing databases were often identical or dependent in some manner.

Since this time, HMM-based gesture recognizers for other tasks have appeared in the literature [21,2], and, last year, several HMM-based continuous sign language systems were demonstrated. In a submission to UIST'97, Liang and Ouhyoung's work in Taiwanese Sign Language [8] shows very encouraging results with a glove-based recognizer. This HMM-based system recognizes 51 postures, 8 orientations, and 8 motion primitives. When combined, these constituents can form a lexicon of 250 words which can be continuously recognized in real-time with 90.5% accuracy. At ICCV'98, Vogler and Metaxas described a desk-based 3D camera system that achieves 89.9% word accuracy on a 53 word lexicon [20]. Since the vision process is computationally expensive in this implementation, an electromagnetic tracker is used interchangeably with the 3 mutually orthogonal calibrated cameras for collecting experimental data.

Next: The Task Up: Introduction Previous: Introduction

Thad Starner
1998-09-17