Next: Conditional Expectation Maximization Up: Maximum Conditional Likelihood via Previous: Introduction

EM and Conditional Likelihood

For joint densities, the tried and true EM algorithm [3] maximizes joint likelihood over data. However, EM is not as useful when applied to conditional density estimation and maximum conditional likelihood problems. Here, one typically resorts to other local optimization techniques such as gradient descent or second order Hessian methods [2]. We therefore introduce CEM, a variant of EM, which targets conditional likelihood while maintaining desirable convergence properties. The CEM algorithm operates by directly bounding and decoupling conditional likelihood and simplifies M-step calculations.

In EM, a complex density optimization is broken down into a two-step iteration using the notion of missing data. The unknown data components are estimated via the E-step and a simplified maximization over complete data is done in the M-step. In more practical terms, EM is a bound maximization: the E-step finds a lower bound for the likelihood and the M-step maximizes the bound.

$\displaystyle p( {\bf x}_i, {\bf y}_i \vert \Theta) = \sum_{m=1}^M p(m, {\bf x}_i, {\bf y}_i \vert \Theta)$

(1)

Consider a complex joint density $p({\bf x}_i, {\bf y}_i \vert \Theta)$ which is best described by a discrete (or continuous) summation of simpler models (Equation 1). Summation is over the `missing components' m.

$\displaystyle \begin{array}{lll} \Delta l & = & \sum_{i=1}^N \log ( p( {\bf x}_... ...t-1})} {\sum_{n=1}^M p(n, {\bf x}_i, {\bf y}_i \vert \Theta^{t-1})} \end{array}$

(2)

By appealing to Jensen's inequality, EM obtains a lower bound for the incremental log-likelihood over a data set (Equation 2). Jensen's inequality bounds the logarithm of the sum and the result is that the logarithm is applied to each simple model $p(m, {\bf x}_i, {\bf y}_i \vert \Theta)$ individually. It then becomes straightforward to compute the derivatives with respect to $\Theta$ and set to zero for maximization (M-step).

$\displaystyle p( {\bf y}_i \vert {\bf x}_i , \Theta) = \sum_{m=1}^M p(m,{\bf y}... ...{\bf x}_i, {\bf y}_i \vert \Theta)} {\sum_{m=1}^M p(m, {\bf x}_i \vert \Theta)}$

(3)

However, the elegance of EM is compromised when we consider a conditioned density as in Equation 3. The corresponding incremental conditional log-likelihood, $\Delta l^c$ , is shown in Equation 4.

$\displaystyle \begin{array}{lll} \Delta l^c & = & \sum_{i=1}^N \log ( p( {\bf y... ... \vert \Theta^t)} {\sum_{n=1}^M p(n, {\bf x}_i \vert \Theta^{t-1})} \end{array}$

(4)

The above is a difference between a ratio of joints and a ratio of marginals. If Jensen's inequality is applied to the second term in Equation 4 it yields an upper bound since the term is subtracted (this would compromise convergence). Thus, only the first ratio can be lower bounded with Jensen (Equation 5).

$\begin{displaymath}\Delta l^c \geq \sum_{i=1}^N \sum_{m=1}^M { h}_{im} \log \fr... ...t \Theta^t)} {\sum_{n=1}^M p(n, {\bf x}_i \vert \Theta^{t-1})} \end{displaymath}$

(5)

Note the lingering logarithm of a sum which prevents a simple M-Step. At this point, one would resort to a Generalized EM (GEM) approach which requires gradient or second-order ascent techniques for the M-step. For example, Jordan et al. overcome the difficult M-step caused by EM with an Iteratively Re-Weighted Least Squares algorithm in the mixtures of experts architecture [4].

Next: Conditional Expectation Maximization Up: Maximum Conditional Likelihood via Previous: Introduction

Tony Jebara
2000-03-20