Next: Bounding Gate Means Up: Maximum Conditional Likelihood via Previous: Conditional Expectation Maximization

CEM and Bound Maximization for Gaussian Mixtures

In deriving an efficient M-step for the mixture of Gaussians, we call upon more bounding techniques that follow the CE-step and provide a monotonically convergent learning algorithm. The form of the conditional model we will train is obtained by conditioning a joint mixture of Gaussians. We write the conditional density in a experts-gates form as in Equation 8. We use unnormalized Gaussian gates ${\bar {\cal N}}({\bf x};\mu,\Sigma) = \exp( -\frac{1}{2} ({\bf x} - \mu)^T \Sigma^{-1} ({\bf x}-\mu))$ since conditional models do not require true marginal densities over ${\bf x}$ (i.e. that necessarily integrate to 1). Also, note that the parameters of the gates ( $\alpha,\mu_x,\Sigma_{xx}$ ) are independent of the parameters of the experts ( $\nu^m,\Gamma^m,\Omega^m$ ).

Both gates and experts are optimized independently and have no variables in common. An update is performed over the experts and then over the gates. If each of those causes an increase, we converge to a local maximum of conditional log-likelihood (as in Expectation Conditional Maximization [5]).

$\displaystyle \begin{array}{lll} p({\bf y}\vert{\bf x} , \Theta) & = & \frac{ \... ...m_{n=1}^M \alpha_n {\bar {\cal N}} ({\bf x};\mu_x^n,\Sigma_{xx}^n)} \end{array}$

(8)

To update the experts, we hold the gates fixed and merely take derivatives of the Q function with respect to the expert parameters ( $\Phi^m =\{ \nu^m,\Gamma^m,\Omega^m \}$ ) and set them to 0. Each expert is effectively decoupled from other terms (gates, other experts, etc.). The solution reduces to maximizing the log of a single conditioned Gaussian and is analytically straightforward.

$\displaystyle \begin{array}{lll} \frac{\partial Q(\Theta^t,\Theta^{(t-1)})} {\p... ...}_i;\nu^m + \Gamma^m {\bf x}_i,\Omega^m)}{\partial \Phi^m} ~ := ~ 0 \end{array}$

(9)

Similarly, to update the gate mixing proportions, derivatives of the Q function are taken with respect to $\alpha_m$ and set to 0. By holding the other parameters fixed, the update equation for the mixing proportions is numerically evaluated (Equation 10).

$\begin{displaymath}\alpha_m ~ := ~ \sum_{i=1}^N r_i {\hat {\cal N}} ({\bf x}_i;\... ...\vert _{\Theta^{(t-1)}} ~ \{ \sum_{i=1}^N {\hat h}_{im}\}^{-1} \end{displaymath}$

(10)

Next: Bounding Gate Means Up: Maximum Conditional Likelihood via Previous: Conditional Expectation Maximization

Tony Jebara
2000-03-20