Supervised learning from incomplete 
data via an EM approach 
Zoubin Ghahramani and Michael I. Jordan 
Department of Brain & Cognitive Sciences 
Massachusetts Institute of Techuology 
Cambridge, MA 02139 
Abstract 
Real-world learning tasks may involve high-dimensional data sets 
with arbitrary patterns of missing data. In this paper we present 
a framework based on maximum likelihood density estimation for 
learning from such data. sets. We use mixture models for the den- 
sity estimates and make two distinct appeals to the Expectation- 
Maximization (EM) principle (Dempster et al., 1977) in deriving 
a learning algorithm--EM is used both for the estimation of mix- 
ture components and for coping with missing data. The result- 
ing algorithm is applicable to a wide range of supervised as well 
as unsupervised learning problems. Results kom a classification 
benchmark--the iris data set--are presented. 
1 Introduction 
Adaptive systems generally operate in environments that are fraught with imper- 
fections; nonetheless they must cope with these imperfections and learn to extract 
as much relevant information as needed for their particular goals. One form of 
imperfection is incompleteness in sensing information. Incompleteness can arise ex- 
trinsically from the data generation process and intrinsically from failures of the 
system's sensors. For example, an object recognition system must be able to learn 
to classify images with occlusions, and a robotic controller must be able to integrate 
multiple sensors even when only a fraction may operate at any given time. 
In this paper we present a framework--derived from parametric statistics--for learn- 
120 
Supervised Learning from Incomplete Data via an EM Approach 121 
ing from data sets with arbitrary patterns of iucompleteness. Learning in this frame- 
work is a classical estimation problem requiring an explicit probabilistic model and 
an algorithm for estimating the parameters of the model. A possible disadvantage 
of parametric methods is their lack of flexibility when compared with nonparamet- 
ric methods. This problem, however, can be largely circumvented by the use of 
mixture models (McLachlan and Basford, 1988). Mixture models combine much of 
the flexibility of nonparametric methods with certain of the analytic advantages of 
parametric methods. 
Mixture models have been utilized recently for supervised learning problems in the 
form of the "mixlures of experts" architecture (Jacobs et al., 1991; Jordan and 
Jacobs, 1994). This architecture is a parametric regression model with a modular 
structure similar to the nonparametric decision tree and adaptive spline models 
(Breiman et al., 1984; Friedman, 1991). The approach presented here differs from 
these regression-based approaches in that the goal of learning is to estimate the 
density of the data. No distinction is made between input and output variables; the 
joint density is estimated and this estimate is then used to form an input/output 
map. Similar approaches have been discussed by Specht (1991) and Tresp et al. 
(1993). To estimate the vector function y -- f(x) the joint density P(x,y) is esti- 
mated and, given a particular input x, the conditional density P(yl x) is formed. 
To obtain a single estimate of y rather than the full conditional density one can 
evaluate  = E(y[x), the expectation of y given x. 
The density-based approach to learning can be exploited in several ways. First, 
having an estimate of the joint density allows for the representation of any rela- 
tion between the variables. From P(x,y), we can estimate ' -- f(x), the inverse 
 _. f-1 (y), or any other relation between two subsets of the elements of the con- 
catenated vector (x, y). 
Second, this density-based approach is applicable both to supervised learning and 
unsupervised learning in exactly the same way. The only distinction between su- 
pervised and unsupervised learning in this framework is whether some portion of 
the data vector is denoted as "input" and another portion as "target". 
Third, as we discuss in this paper, the density-based approach deals naturally with 
incomplete data, i.e. missing values in the data set. This is because the problem 
of estimating mixture densities can itself be viewed as a missing data problem (the 
"labels" for the component densities are missing) and an Expectation-Maximization 
(EM) algorithm (Dempster et al., 1977) can be developed to handle both kinds of 
missing data. 
Density estimation using EM 
This section outlines the basic learning algorithm for finding the maximum like- 
lihood parameters of a mixture model (Dempster et al., 1977; Duda and Hart, 
1973; Nowlan, 1991). We assume that the data Y = {x,..., XN} are generated 
independently froin a mixture density 
M 
P(xi) = E P(xilcJ;OJ )P(j)' (1) 
j=l 
122 Ghahramani and Jordan 
where each component of the mixture is denoted cod and parametrized by Oj. From 
equation (1) and the independence assumption we see that the log likelihood of the 
parameters given the data set is 
N M 
t(01x) - log P(xilw; (2) 
i=1 j=l 
By the maximum likelihood principle the best model of the data has parameters 
that maximize/(01l'). This function, however, is not easily maximized numerically 
because it involves the log of a sum. 
Intuitively, there is a "credit-assignment" problem: it is not clear which component 
of the mixture generated a given data point and thus which parameters to adjust 
to fit that data point. The EM algorithm for mixture models is an iterative method 
for solving this credit-assignment problem. The intuition is that if one had access 
to a "hidden" random variable z that indicated which data point was genera. ted 
by which component, then the maximization problem would decouple into a set 
of simple maximizations. Using the indicator variable z, a "complete-data" log 
likelihood function can be written 
N M 
/,(0IX, Z) - 1ogP(x, lz,;O)P(zi;O), (3) 
i=1 j=l 
which does not involve a log of a summation. 
Since z is unknown l, cannot be utilized directly, so we instead work with its ex- 
pectation, denoted by Q(OlOk). As shown by (Dempster et al., 1977), t(01x) can be 
maximized by iterating the following two steps: 
E step: Q(OlOk) = 
M step: 0k+ = argmax Q(OlOk). (4) 
0 
The E (Expectation) step computes the expected complete data log likelihood and 
the M (Maximization) step finds the parameters that maximize this likelihood. 
These two steps form the basis of the EM algorithm; in the next two sections we 
will outline how they can be used for real and discrete density estimation. 
2.1 Real-valued data: mixture of Gaussians 
Real-valued data can be lnodeled as a mixture of Gaussians. For this model the 
E-step simplifies to computing hij = E[zij [xi, Ok], the probability that Gaussian j, 
as defined by the parameters estimated at time step k, generated data point. i. 
^ k)T  - 
I1-/2exp{-(xi - tj Ej 'k(xi- )} 
hij = 1 [[-1/2xp{  x . (5) 
The M-step re-estimates the means and covariances of the Gaussians I using the 
data set weighted by the hij' 
a) ' k+l _ E4_-i hijxl 
Ij -- EiN__i hij ' 
^k+l E/V_.i hij(xi /.3k.' + 1 ) (xi . k+l)T 
(6) 
Though this derivation assumes equal priors for the Gaussians, if the priors are viewed 
as mixing parameters they can also be learned in the maximization step. 
Supervised Learning from Incomplete Data via an EM Approach 123 
2.2 Discrete-valued data: mixture of Bernoullis 
D-dimensional binary data x = (Xl,..., Xd,...ZD), Zd ( {0, 1}, can be modeled as 
a mixture of M Bernoulli densities. That is, 
M D 
P(xlO ) = y. P(wj) H pJ(1 - jd) (1-xa). 
j----1 d=l 
(7) 
For this model tile E-step involves computing 
d=l /jd k -- Ija] 
% = i - 
/-l=l 11d=l tld k -- r  
(8) 
and the M-step again re-estimates the parameters by 
/,k4-1 __ Y"-/N'-I hijXi (9) 
More generally, discrete or categorical data can be modeled as generated by a mix- 
ture of multinonlial densities and similar derivations for the learning algorithm can 
be applied. Finally, the exteusion to data with mixed real, binary, and categorical 
dimensions call be readily derived by assuming a joint density with mixed compo- 
nents of the three types. 
3 Learning from incomplete data 
In the previous section we presented one aspect of the EM algorithm: learning 
mixture models. Another important application of EM is to learning froin data 
sets with missing values (Little and Rubin, 1987; Dempster et al., 1977). This 
application has been pursued in the statistics literature for non-mixture density 
estimation problems; in this paper we combine this application of EM with that of 
learning mixture parameters. 
We assume that. the data set , = {Xl,..., XN} is divided into an observed com- 
ponent .o and a missing component ,l'm. Similarly, each data vector xi is divided 
X  
into ( i, x?) where each data vector can have different missing components--this 
would be denoted by superscript mi and oi, but we have simplified the notation for 
the sake of clarity. 
To handle missing data we rewrite the EM algorithln as follows 
E step: O(OlOk ) = E[I(OI.t',xm,z)IX,Oi] 
m step: 0+1 = argmax O(O[Ok). (10) 
0 
Comparing to equation (4) we see that aside from the indicator variables Z we have 
added a second form of incomplete data., ,m, correspouding to the missing values 
in the data set. The E-step of the algorithm estimates both these forms of missing 
information; in essence it uses the current estimate of the data density to complete 
the missing values. 
124 Ghahramani and Jordan 
3.1 Real-valued data: mixture of Gaussians 
We start by writing the log likelihood of the complete data, 
N M N M 
/,(0IX, Xm, Z) - ZZzijlogP(xilzi,O)+ ZEzijlogP(zilO). (11) 
i j i j 
We can ignore the second term since we will only be estimating the parameters of 
the P(xilzi, 0). Using equation (11) for the mixture ot' Gaussians wc note that if 
only the indicator variables zl are missing, the E step can be reduced to estimating 
E[zli xi, 0]. For the ce we are interested in, with two types of missing data zi and 
x, we expand equation (11) using m and o superscripts to denote subvectors and 
submatrices of the parameters matching the missing and observed components of 
the data, 
N M 
n 1 1 o i,OO(x  
/e(01X,xm, z) = zij[log2w+loglsl-(xi _])Tf . _) 
i j 
--(X o T -1 I m 
-) E 'm(x?--7)--(x , 
Note that after taking the expectation, the sufficient statistics for the parameters 
involve three unknown terms, zij, zisx, and zoxx r. Thus we must compute: 
    .. m m 
E[zalx ? 0] E[zox?lx 7 0] and E[z,ax i x i Tlx?,0 ]. 
One intuitive approach to dealing with missing data is to use the current estimate 
of the data density to compute the expectat. ion of the missing data in an E-step, 
complete the data with these expectations, and then use this completed data to re- 
estimate parameters in an M-step. However, this intuition fails even when dealing 
with a single two-dimensional Gaussian; the expectation of the missing data always 
lies along a line, which bies the estimate of the covariance. On the other hand, 
the approach arising from application of the EM algorithm specifies that one should 
use the current density estimate to compute the expectation of whatever incomplete 
terms appear in the likelihood maximization. For the mixture of Gaussians these 
incomplete terms involve interactions between the indicator variable zi5 and the 
first and second moments of x. Thus, simply computing the expectation of the 
missing data zl and x fi'om our model and substituting those values iuto the M 
step is not sufficient to guarantee an increde in the likelihood of the parameters. 
The above terms can be computed as follows: E[zj ]x, 0] is again hj, the proba- 
bility  defined in (5) measured only on the observed dimensions of xi, and 
o ' EE9-(x? -- P'7)), (12) 
= h[x21z5 = = + _, _, 
Defining   E[xlzij = 1 x9 0] the regression of x on x? using Gaussian j 
Z[z,x7x?lx? = h(s?m mooo-s?o 
, - -5  + xr). (1) 
The M-step uses these expectations substituted into equations (6)a aud (6)b to 
re-estimate the means and covariances. To re-estimate the mean vector, /xj, we 
substitute the values E[xnlzij = 1,x,0k] for the missing components of xi in 
equation (6)a. To re-estimate the covariance matrix we substitute the values 
E[x?x? T Izij = 1, x?, 0k] for the outer product matrices involving the ntissing com- 
ponents of xi in equation (6)b. 
Supervised Learning from Incomplete Data via an EM Approach 125 
3.2 Discrete-valued data: mixture of Bernoullis 
For the Bernoulli mixture the sufficient statistics for the M-step involve the incom- 
plete terms E[zijlx?, 0k] and E[zijxlx?, 0k]. The first is equal to hij calculated over 
the observed subvector of xi. The second, since we assume that within a class the 
h m 
individual dimensions of the Bernoulli variable are independent, is simply ijttj  
The M-step uses these expectations substituted into equation (9). 
4 Supervised learning 
If each vector xi in the data set is composed of an "input" subvector, xii, and a 
"target" or outpnt subvector, x?, then learning the joint density of thc input and 
target is a form of supervised learning. In supervised learning we generally wish to 
predict the output variables from the input variables. In this section we will outline 
how this is achieved using the estimated density. 
4.1 Function approximation 
For real-valued function approximation we have assumed that the density is esti- 
mated using a mixture of Gaussians. Given an input vector xii we extract all the 
relevant information from the density p(xi,x ) by conditionalizing to 
For a single Gaussian this conditional density is normal, and, since P(x', x ) is a 
mixture of Gaussians so is P(xlxi). In principle, this conditional density is the 
final output of the density estimator. That is, given a particular input the net- 
work returns the complete conditional density of the output. However, since many 
applications require a single estimate of the output, we note three ways to ob- 
tain estimates  of x  = f(xi): the least squares estimate (LSE), which takes 
:(xii) = E(x[xii); stochastic sampling (STOCH), which samples according to 
the distribution .(xi)  P(xlx); single colnponent LSE (SLSE), which takes 
(xii) = E(x[x},wj) where j = argmaxk P(zklx'i). For a given input, SLSE picks 
the Gaussian with highest posterior and approximates the output with the LSE 
estimator given by that Gaussian alone. 
The conditional expectation or LSE estimator for a Gaussian mixture is 
oi ii-  _ i 
O(xii ) = ]])= hij[tty + E E (xi - tt})] (14) 
M ' 
which is a convex sum of linear approximations, where the weights h;z vary non- 
linearly according to equation (14) over the input space. The LSE estimator on a 
Gaussian mixture has interesting relations to algorithms such as CART (Breiman 
et al., 1984), MARS (Friedman, 1991), and mixtures of experts (Jacobs ctal., 1991; 
Jordan and Jacobs, 1994), in that the mixture of Gaussians competitively parti- 
tions the input space, and learns a linear regression surface on each partition. This 
similarity h also been noted by Tresp et al. (1993) . 
The stochastic estimator (STOCH) and the single component estimator (SLSE) are 
better suited than any least squares method for learning non-convex inverse maps, 
where the mean of several solutions to an inverse might not be a sohllion. These 
126 Ghahramani and Jordan 
Figure 1: Classification of the iris data 
set. 100 data points were used for train- 
ing and 50 for testing. Each data point 
consisted of 4 real-valued attributes and 
one of three class labels. The figure 
shows classification performance q- 1 
standard error (, = 5) as a function 
of proportion missing features for the 
EM algorithm and for mean imputa- 
tion (MI), a common heuristic where the 
missing values are replaced with their 
unconditional means. 
Classification with missing inputs 
0 20 40 60 80 100 
% missing features 
estimators take advantage of the explicit representation of the input/output density 
by selecting one of the several solutions to the inverse. 
4.2 Classification 
Classification problems involve learning a mapping from an input space into a set 
of discrete class labels. The density estimation framework presented in this paper 
lends itself to solving classification problems by estimating the joint density of the 
input and class label using a mixture model. For example, if the inputs have real- 
valued attributes and there are D class labels, a mixture model with Gaussian and 
multinomial components will be used: 
M 
P(x,C = d[O)=  P(coj) tJd exp{ 1 )TE-I(x ttj)}, (15) 
j=l (2w)n/2[jll]2 --(X- ttj -- 
denoting the joint probability that the data point is x and belongs to class d, 
where the ttjd are the parameters for the multinomial. Once this density has been 
estimated, the maximum likelihood label for a particular input x may be obtained 
by computing P( = dlx , 0). Similarly, the class conditional densities can be derived 
by evaluating P(xlC = d, 0). Conditionalizing over classes in this way yields class 
conditional densities which are in turn mixtures of Gaussians. Figure 1 shows 
the performance of the EM algorithm on an example classification problem with 
varying proportions of missing features. We have also applied these algorithms to 
the problems of clustering 35-dimensional greyscale images and approximating the 
kinematics of a three-joint planar arm from incomplete data. 
5 Discussion 
Density estimation in high dimensions is generally considered to be more difficult-- 
requiring more parameters--than function approximation. The density-estimation- 
based approach to learning, however, has two advantages. First, it permits ready in- 
corporation of results from the statistical literature on missing data to yield flexible 
supervised and unsupervised learning architectures. This is achieved by combining 
two branches of application of the EM algorithm yielding a set of learning rules for 
mixtures under incomplete sampling. 
Supervised Learning from Incomplete Data via an EM Approach 127 
Second, estimating the density explicitly enables us to represent any relation be- 
tween the variables. Density estimation is fundamentally more general than function 
approxilnation and this generality is needed for a large class of learning problems 
arising from inverting causal systems (Ghahramani, 1994). These problclns cannot 
be solved easily by traditional function approximation techniques since the data is 
not generated from noisy samples of a function, but rather of a relation. 
Acknowledgements 
Thanks to D. M. Titterington and David Cohn for helpful comments. This project 
was supported in part by grants from the McDonnell-Pew Foundation, ATR Andi- 
tory and Visual Perception Research Laboratories, Siemens Corporation, the Na- 
tional Science Foundation, and the Office of Naval Research. The iris data set was 
obtained from the UCI Repository of Machine Learning Databases. 
References 
Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classification 
and Regression Trees. Wadsworth International Group, Behnont, CA. 
Dempster, A. P., Laird, N.M., and Rubin, D. B. (1977). Maximum likelihood from 
incomplete data via the EM algorithm. J. Royal Statistical Society Series B, 
39:1-38. 
Duda, R. O. and Hart, P. E. (1973). Pattern Classification and SceTe Analysis. 
Wiley, New York. 
Friedman, J. tI. (1991). Multivariate adaptive regression splines. Tbc Annls of 
Statistics, 19:1-141. 
Ghahramani, Z. (1994). Solving inverse problems using an EM approach to density 
estimation. In Proceedings of the 1993 Connectionist Models Summer School. 
Erlbaum, Hillsdale, NJ. 
Jacobs, R., Jordan, M., NowInn, S., and Hinton, G. (1991). Adaptive mixtnre of 
local experts. Neural Computation, 3:79-87. 
Jordan, M. and Jacobs, R. (1994). Hierarchical mixtures of experts and the EM 
algorithm. Neural Computation, 6:181-214. 
Little, R. J. A. and Rubin, D. B. (1987). Statistical Analysis with Missing Data. 
Wiley, New York. 
McLachlan, G. and Basford, K. (1988). Mixture models: Inference and applications 
to clustering. Marcel Dekker. 
Nowlan, S. J. (1991). Soft Competitive Adaptation: Neural Network Learning Algo- 
rithms based on Fitting Statistical Mixtures. CMU-CS-91-126, School of Com- 
puter Science, Carnegie Mellon University, Pittsburgh, PA. 
Specht, D. F. (1991). A general regression neural network. IEEE Trans. Neural 
Networks, 2(6):568-576. 
Tresp, V., Ho!latz, J., and Ahmad, S. (1993). Network structuring and training 
using rule-based knoxvledge. In Hanson, S. J., Cowan, J. D., and Giles, C. L., 
editors, Advances in Neural Information Processing Systems 5. Morgan Kauf- 
man Publishers, San Mateo, CA. 
