The notion of grouping atomic parts of a scene together to form blob-like entities based on proximity and visual appearance is a natural one, and has been of interest to visual scientists since the Gestalt psychologists studied grouping criteria early in this century .
In modern computer vision processing we seek to group image pixels together and to segment images based on visual coherence, but the features obtained from such efforts are usually taken to be the boundaries, or contours, of these regions rather than the regions themselves. In very complex scenes, such as those containing people or natural objects, contour features have proven unreliable and difficult to find and use.
The blob representation that we use was developed by Pentland and Kauth et al [13, 8] as a way of extracting an extremely compact, structurally meaningful description of multi-spectral satellite (MSS) imagery. In this method feature vectors at each pixel are formed by adding (x,y) spatial coordinates to the spectral (or textural) components of the imagery. These are then clustered so that image properties such as color and spatial similarity combine to form coherent connected regions, or ``blobs,'' in which all the pixels have similar image properties. This blob description method is, in fact, a special case of recent Minimum Description Length (MDL) algorithms [4, 16].
The Pfinder system is related to body-tracking research such as Rehg and Kanade, Rohr , and Gavrila and Davis  that use kinematic models, or Pentland and Horowitz  and Metaxas and Terzopolous  who use dynamic models. However, in contrast to Pfinder these other systems all require accurate initialization and use local image features. Consequently, they have difficulty with occlusion and require massive computational resources.
Functionally, our systems are perhaps most closely related to the work of Bichsel and Baumberg and Hogg . These systems segment the person from the background in real time using only a standard workstation. Their limitation is that they do not analyze the person's shape or internal features, but only the silhouette of the person. Consequently, they cannot track head and hands, determine body pose, or recognize any but the simplest gestures.