The low-level features extracted from video comprise the final element
of our system. Our system tracks regions that are visually similar,
and spatially coherent: blobs [13,8]. We can
represent these 2-D regions by their low-order statistics. Clusters
of 2-D points have 2-D spatial means and covariance matrices,
which we shall denote
and .
The blob spatial
statistics are described in terms of their second-order properties;
for computational convenience we will interpret this as a Gaussian
model:
These 2-D features are the input to the 3-D blob estimation equation used by Azarbayejani and Pentland [1]. This observation equation relates the 2-D distribution of pixel values to a tracked object's 3-D position and orientation.
These observations supply constraints on the underlying 3-D human model. Due to their statistical nature, observations are easily modeled as soft constraints. Observations are integrated into the dynamic evolution of the system by modeling them as descriptions of potential fields, as discussed in Section 4.2.