Before the system attempts to locate people in a scene, it must learn
the scene. To accomplish this Pfinder begins by acquiring a sequence
of video frames that do not contain a person. Typically this sequence
is relatively long, a second or more, in order to obtain a good
estimate of the color covariance associated with each image pixel.
For computational efficiency, color models are built in both the
standard (Y,U,V) and brightness-normalized color spaces.
With a static camera (the normal case) the pixels of the input images correspond exactly to the points in the scene texture model. This makes processing very efficient. In order to accommodate camera rotation and zooming, the input image must be transformed back to the coordinate system of the scene texture model before we compare the input image and scene model. Although we can estimate the camera transform parameters in real time [2, 13], we cannot currently apply the transform to the input image in real time.