next up previous
Next: A Dynamics Optimization Up: The Observation Model Previous: The Observation Model


The Inverse Observation Model

It only remains to bring information from the human model back down to the initial stages of the vision system. In the absence of this information, the pixel classification decisions were forced to rely solely on temporal smoothness constraints in the 2-D image plane. The decision rule takes the form:


\begin{displaymath}
\Gamma_{ij} = \arg \max_{k}
\left[ \Pr( {\bf o}_{i j} \ver...
...boldmath$ \mu $}}_k, {\mbox{\boldmath$ \Lambda $}}_k ) \right]
\end{displaymath} (6)

where $\Gamma_{ij}$ is the labeling of pixel (i,j), and $({\mbox{\boldmath$\ \mu $}}_k, {\mbox{\boldmath$\ \Lambda $}}_k)$ are the second-order statistics of model k.

Since the human model exists in 3-D a projection operation is required to convert the model's 3-D predictions into the 2-D features of the vision system. Given the current state of the model ${\bf q}$, it is possible to compute the state of an individual link that matches a specific tracked feature (say the hand), and compute 3-D means and covariances. Then, given a model of the camera, it is possible to calculate the projection of that state into 2-D and call it $({\mbox{\boldmath$\ \mu $}}^{*},{\mbox{\boldmath$\ \Lambda $}}^{*})$. For the first moment (the mean) that calculation is a perspective projection:

\begin{displaymath}
{\mbox{\boldmath$ \mu $}}^{*} = \left[ \begin{array}{c} u \\...
...frac{z}{f}} \\
\frac{y}{1 + \frac{z}{f}} \end{array} \right]
\end{displaymath}

where (x, y, z) is the mean of the 3-D description of the link, f is the focal length of the camera model, and (u ,v) is the projection of the mean into 2-D. Projection of the second moments is more difficult since the perspective projection of a Gaussian distribution is not itself Gaussian. We employ an approximation:

\begin{displaymath}
{\mbox{\boldmath$ \Lambda $}}^{*} = \frac{{\mbox{\boldmath$ \Lambda $}}_{xy}}{(1 + \frac{z}{f})^2}
\end{displaymath}

where ${\mbox{\boldmath$\ \Lambda $}}$ is the orthogonal projection of the 3-D covariance.

Since the vision system uses a stochastic framework, it is necessary to represent this link projection as a probabilistic model:

\begin{displaymath}
\Pr( {\bf o} \vert {\bf q} ) =
\frac{\exp(-\frac{1}{2}({\...
...mbox{\boldmath$ \Lambda $}}^{*}_{k}\right\vert}^{\frac{1}{2}}}
\end{displaymath}

Now that the 3-D model features are projected into the 2-D camera coordinates, they can be integrated into the 2-D probabilistic decision framework. This provides the Maximum A Posteriori decision rule with the much better prior information contained in the higher-level models.


next up previous
Next: A Dynamics Optimization Up: The Observation Model Previous: The Observation Model

1999-02-13