Next: The second person view: Up: Real-Time American Sign Language Previous: Hidden Markov Modeling

Feature extraction and hand ambiguity

Previous systems have shown that, given strong constraints on viewing, relatively detailed models of the hands can be recovered from video images [3,12]. However, many of these constraints conflict with recognizing ASL in a natural context, since they either require simple, unchanging backgrounds (unlike clothing); do not allow occlusion; require carefully labelled gloves; or are difficult to run in real time.

In this project, we track the hands using a single camera in real-time without the aid of gloves or markings. Only the natural color of the hands is needed. For vision-based sign recognition, the two possible mounting locations for the camera are in the position of an observer of the signer or from the point of view of the signer himself. These two views can be thought of as second-person and first-person viewpoints, respectively.

Training for a second-person viewpoint is appropriate in the rare instance when the translation system is to be worn by a hearing person to translate the signs of a mute or deaf individual. However, such a system is also appropriate when a signer wishes to control or dictate to a desktop computer as is the case in the first experiment. Figure 2 demonstrates the viewpoint of the desk-based experiment.

**Figure 2:** View from the desk-based tracking camera. Images are analyzed at 320x240 resolution.

The first-person system observes the signer's hands from much the same viewpoint as the signer himself. Figure 3 shows the placement of the camera in the cap used in the second experiment, and demonstrates the resulting viewpoint. The camera was attached to an SGI for development; however, current hardware allows for the entire system to be unobtrusively embedded in the cap itself as a wearable computer. A matchstick-sized camera such as the Elmo QN401E can be embedded in front seam above the brim. The brim can be made into a relatively good quality speaker by lining it with a PVDF transducer (used in thin consumer-grade stereo speakers). Finally a PC/104-based CPU, digitizer, and batteries can be placed at the back of the head. See Starner et al. [17] and the MIT Wearable Computing Site (http://wearables.www.media.mit.edu /projects/wearables/) for more detailed information about wearable computing and related technologies.

**Figure 3:** The hat-mounted camera, pointed downward towards the hands, and the corresponding view.

A wearable computer system provides the greatest utility for an ASL to spoken English translator. It can be worn by the signer whenever communication with a non-signer might be necessary, such as for business or on vacation. Providing the signer with a self-contained and unobtrusive first-person view translation system is more feasible than trying to provide second-person translation systems for everyone whom the signer might encounter during the day.

For both systems, color NTSC composite video is captured and analyzed at 320 by 243 pixel resolution. This lower resolution avoids video interlace effects. A Silicon Graphics 200MHz R4400 Indy workstation maintains hand tracking at 10 frames per second, a frame rate which Sperling et al. [14] found sufficient for human recognition. To segment each hand initially, the algorithm scans the image until it finds a pixel of the appropriate color, determined by an a priori model of skin color. Given this pixel as a seed, the region is grown by checking the eight nearest neighbors for the appropriate color. Each pixel checked is considered part of the hand. This, in effect, performs a simple morphological dilation upon the resultant image that helps to prevent edge and lighting aberrations. The centroid is calculated as a by-product of the growing step and is stored as the seed for the next frame. Since the hands have the same skin tone, the labels ``left hand'' and ``right hand'' are simply assigned to whichever blob is leftmost and rightmost.

Note that an a priori model of skin color may not be appropriate in some situations. For example, with a mobile system, lighting can change the appearance of the hands drastically. However, the image in Figure 3 provides a clue to addressing this problem, at least for the first-person view. The smudge on the bottom of the image is actually the signer's nose. Since the camera is mounted on a cap, the nose always stays in the same place relative to the image. Thus, the signer's nose can be used as a calibration object for generating a model of the hands' skin color for tracking. While this calibration system has been prototyped, it was not used in these experiments.

After extracting the hand blobs from the scene, second moment analysis is performed on each blob. A sixteen element feature vector is constructed from each hand's x and y position, change in x and y between frames, area (in pixels), angle of axis of least inertia (found by the first eigenvector of the blob) [5], length of this eigenvector, and eccentricity of bounding ellipse.

When tracking skin tones, the above analysis helps to model situations of hand ambiguity implicitly. When a hand occludes either the other hand or the face (or the nose in the case of the wearable version), color tracking alone can not resolve the ambiguity. Since the face remains in the same area of the frame, its position can be determined and discounted. However, the hands move rapidly and occlude each other often. When occlusion occurs, the hands appear to the above system as a single blob of larger than normal area with significantly different moments than either of the two hands in the previous frame. In this implementation, each of the two hands is assigned the features of this single blob whenever occlusion occurs. While not as informative as tracking each hand separately, this method still retains a surprising amount of discriminating information. The occlusion event itself is implicitly modeled, and the combined position and moment information are retained. This method, combined with the time context provided by hidden Markov models, is sufficient to distinguish between many different signs where hand occlusion occurs.

Next: The second person view: Up: Real-Time American Sign Language Previous: Hidden Markov Modeling

Thad Starner
1998-09-17