The low-level features extracted from video comprise the final element of
our system. Our system tracks regions that are visually similar, and
spatially coherent: blobs. We can represent these 2-D regions by their
low-order statistics. Clusters of 2-D points have 2-D spatial means
and covariance matrices, which we shall denote
and
.
The blob spatial statistics are described in terms of their second-order
properties; for computational convenience we will interpret this as a
Gaussian model:
These 2-D features are the input to the 3-D blob estimation equation used by Azarbayejani and Pentland [1]. This observation equation relates the 2-D distribution of pixel values to a tracked object's 3-D position and orientation.
These observations supply constraints on the underlying 3-D human model. Due to their statistical nature, observations are easily modeled as soft constraints. Observations are integrated into the dynamic evolution of the system by modeling them as descriptions of potential fields, as discussed in Section 2.2.1.