Consider the 4-cluster (x,y) data in Figure 3(a). The data is modeled with a conditional density p(y|x) using only 2 Gaussian models. Estimating the density with CEM yields the p(y|x) shown in Figure 3(b). CEM exhibits monotonic conditional likelihood growth (Figure 3(c)) and obtains a more conditionally likely model. In the EM case, a joint p(x,y) clusters the data as in Figure 3(d). Conditioning it yields the p(y|x) in Figure 3(e). Figure 3(f) depicts EM's non-monotonic evolution of conditional log-likelihood. EM produces a superior joint likelihood but an inferior conditional likelihood. Note how the CEM algorithm utilized limited resources to capture the multimodal nature of the distribution in y and ignored spurious bimodal clustering in the xfeature space. These properties are critical for a good conditional density p(y|x).
For comparison, standard databases were used from UCI 2. Mixture models were trained with EM and CEM, maximizing joint and conditional likelihood respectively. Regression results are shown in Table 1. CEM exhibited, monotonic conditional log-likelihood growth and out-performed other methods including EM with the same 2-Gaussian model (EM2 and CEM2).