We assume that the majority of the time Pfinder will be processing a scene that consists of a relatively static situation such as an office, and a single moving person. Consequently, it is appropriate to use different types of model for the scene and for the person.
We model the scene surrounding the human as a texture surface; each point on the texture surface is associated with a mean color value and a distribution about that mean. Color is expressed in the YUV space. The color distribution of each pixel is modeled with the Gaussian described by a full covariance matrix. Thus, for instance, a fluttering white curtain in front of a black wall will have a color covariance that is very elongated in the luminance direction, but narrow in the chrominance directions.
We define to be the mean (Y,U,V) of a point on the
texture surface, and
to be the covariance of that
point's distribution. The spatial position of the point is treated
implicitly because, given a particular image pixel at location
(x,y), we need only consider the color mean and covariance of the
corresponding texture location. In Pfinder the scene texture map is
considered to be class zero.
One of the key outputs of Pfinder is an indication of which scene pixels are occluded by the human, and which are visible. This information is critical in low-bandwidth coding (you don't have to code the background), and in the video/graphics compositing required for ``augmented reality'' applications.
In each frame visible pixels have their statistics recursively updated
using a simple adaptive filter.
This allows us to compensate for changes in lighting and even for object movement. For instance, if a person moves a book it causes the texture map to change in both the locations where the book was, and where it now is. By tracking the person we can know that these areas, although changed, are still part of the texture model and thus update their statistics to the new value. The updating process is done recursively, and even large changes in illumination can be substantially compensated within two or three seconds.