It is well known that the shape and motion geometry in SfM problems
such as this are subject to arbitrary scaling and that this scale
factor cannot be recovered. (The imaging geometry
and the
rotation are recoverable and not subject to this scaling.) In
two-frame problems with no information about true lengths in the
scene, scale factor is usually set by fixing the length of the
``baseline'' between the two cameras. This corresponds to the
magnitude of the translational motion.
It is equally acceptable to fix any other single length associated with the motion or the structure. In many previous formulations, including [10,42] some component of the translational motion is fixed at a finite value. This is not a good practice for two reasons. First, if the fixed component, e.g. the magnitude of translation is actually zero (or small), the estimation becomes numerically ill-conditioned. Second, every component of motion is generally dynamic, which means the scale changes at every frame! This is disastrous for stability and also requires some post-process to rectify the scale.
A better approach to setting the scale is to fix a static parameter.
Since we are dealing with rigid objects, all of the shape parameters
are static. Thus, fixing any one of these establishes
a uniform scale for all motion and structure parameters over the
entire sequence. The result is a well-conditioned, stable
representation. Setting scale is simple and elegant in the EKF; the
initial variance on, say,
is set to zero, which will fix
that parameter at its initial value. All other parameters then
automatically scale themselves to accommodate this constraint. This
behavior can be observed in the experimental results.