In this project, we track the hands using a single camera in real-time without the aid of gloves or markings. Only the natural color of the hands is needed. For vision-based sign recognition, the two possible mounting locations for the camera are in the position of an observer of the signer or from the point of view of the signer himself. These two views can be thought of as second-person and first-person viewpoints, respectively.
Training for a second-person viewpoint is appropriate in the rare
instance when the translation system is to be worn by a hearing person
to translate the signs of a mute or deaf individual. However, such a
system is also appropriate when a signer wishes to control or dictate
to a desktop computer as is the case in the first experiment. Figure
2 demonstrates the viewpoint of the desk-based
experiment.
The first-person system observes the signer's hands from much the same
viewpoint as the signer himself. Figure 3 shows
the placement of the camera in the cap used in the second experiment,
and demonstrates the resulting
viewpoint. The camera was attached to an SGI for development;
however, current hardware allows for the entire system to be
unobtrusively embedded in the cap itself as a wearable computer. A
matchstick-sized camera such as the Elmo QN401E can be embedded in
front seam above the brim. The brim can be made into a
relatively good quality speaker by lining it with a PVDF transducer
(used in thin consumer-grade stereo speakers). Finally a PC/104-based
CPU, digitizer, and batteries can be placed at the back of the head.
See Starner et al. [17] and the MIT Wearable
Computing Site
(http://wearables.www.media.mit.edu /projects/wearables/)
for more detailed information about wearable computing and related
technologies.
A wearable computer system provides the greatest utility for an ASL to spoken English translator. It can be worn by the signer whenever communication with a non-signer might be necessary, such as for business or on vacation. Providing the signer with a self-contained and unobtrusive first-person view translation system is more feasible than trying to provide second-person translation systems for everyone whom the signer might encounter during the day.
For both systems, color NTSC composite video is captured and analyzed at 320 by 243 pixel resolution. This lower resolution avoids video interlace effects. A Silicon Graphics 200MHz R4400 Indy workstation maintains hand tracking at 10 frames per second, a frame rate which Sperling et al. [14] found sufficient for human recognition. To segment each hand initially, the algorithm scans the image until it finds a pixel of the appropriate color, determined by an a priori model of skin color. Given this pixel as a seed, the region is grown by checking the eight nearest neighbors for the appropriate color. Each pixel checked is considered part of the hand. This, in effect, performs a simple morphological dilation upon the resultant image that helps to prevent edge and lighting aberrations. The centroid is calculated as a by-product of the growing step and is stored as the seed for the next frame. Since the hands have the same skin tone, the labels ``left hand'' and ``right hand'' are simply assigned to whichever blob is leftmost and rightmost.
Note that an a priori model of skin color may not be appropriate in some situations. For example, with a mobile system, lighting can change the appearance of the hands drastically. However, the image in Figure 3 provides a clue to addressing this problem, at least for the first-person view. The smudge on the bottom of the image is actually the signer's nose. Since the camera is mounted on a cap, the nose always stays in the same place relative to the image. Thus, the signer's nose can be used as a calibration object for generating a model of the hands' skin color for tracking. While this calibration system has been prototyped, it was not used in these experiments.
After extracting the hand blobs from the scene, second moment analysis is performed on each blob. A sixteen element feature vector is constructed from each hand's x and y position, change in x and y between frames, area (in pixels), angle of axis of least inertia (found by the first eigenvector of the blob) [5], length of this eigenvector, and eccentricity of bounding ellipse.
When tracking skin tones, the above analysis helps to model situations of hand ambiguity implicitly. When a hand occludes either the other hand or the face (or the nose in the case of the wearable version), color tracking alone can not resolve the ambiguity. Since the face remains in the same area of the frame, its position can be determined and discounted. However, the hands move rapidly and occlude each other often. When occlusion occurs, the hands appear to the above system as a single blob of larger than normal area with significantly different moments than either of the two hands in the previous frame. In this implementation, each of the two hands is assigned the features of this single blob whenever occlusion occurs. While not as informative as tracking each hand separately, this method still retains a surprising amount of discriminating information. The occlusion event itself is implicitly modeled, and the combined position and moment information are retained. This method, combined with the time context provided by hidden Markov models, is sufficient to distinguish between many different signs where hand occlusion occurs.