Mixture Density Estimation 
Jonathan Q. Li 
Department of Statistics 
Yale University 
P.O. Box 208290 
New Haven, CT 06520 
Qiang. Li@aya. yale. edu 
Andrew R. Barron 
Department of Statistics 
Yale University 
P.O. Box 208290 
New Haven, CT 06520 
Andrew. Barron@yale. edu 
Abstract 
Gaussian mixtures (or so-called radial basis function networks) for 
density estimation provide a natural counterpart to sigmoidal neu- 
ral networks for function fitting and approximation. In both cases, 
it is possible to give simple expressions for the iterative improve- 
ment of performance as components of the network are introduced 
one at a time. In particular, for mixture density estimation we show 
that a k-component mixture estimated by maximum likelihood (or 
by an iterative likelihood improvement that we introduce) achieves 
log-likelihood within order 1/k of the log-likelihood achievable by 
any convex combination. Consequences for approximation and es- 
timation using Kullback-Leibler risk are also given. A Minimum 
Description Length principle selects the optimal number of compo- 
nents k that minimizes the risk bound. 
I Introduction 
In density estimation, Gaussian mixtures provide flexible-basis representations for 
densities that can be used to model heterogeneous data in high dimensions. We 
introduce ar index of regularity cy of density functions f with respect to mixtures 
of densities from a given family. Mixture models with k components are shown to 
achieve Kullback-Leibler approximation error bounded by c/k for every k. Thus 
in a manner analogous to the treatment of sinusoidal and sigrnoidal networks in 
Barron [1],[2], we find classes of density functions f such that reasonable size net- 
works (not exponentially large as function of the input dimension) achieve suitable 
approximation and estimation error. 
Consider a parametric family G = (b0(x), x E 3J C R d' :0 E O C R d) of probability 
density functions parameterized by 0 & O. Then consider the class C = CONV(G') 
of density functions for which there is a mixture representation of the form 
f(x) = fo 0)P(dO) (1) 
where 0 () are density functions from G and P is a probability measure on . 
The main theme of the paper is to give approximation and estimation bounds of 
arbitrary densities by finite mixture densities. We focus our attention on densities 
280 J. Q. Li and A. R. Barron 
inside C first and give an approximation error bound by finite mixtures for arbi- 
trary f E C. The approximation error is measured by Kullback-Leibler divergence 
between two densities, defined as 
D(fllg ) = f f(x) log[f(x)/g(x)]dx. (2) 
In density estimation, D is more natural to use than the L 2 distance often seen in 
the function fitting literature. Indeed, D is invariant under scale transformations 
(and other 1-1 transformation of the variables) and it has an intrinsic connection 
with Maximum Likelihood, one of the most useful methods in the mixture density 
estimation. The following result quantifies the approximation error. 
THEOREM 1 Let G = {qb0(x)  0 e O} and C= CONV(G). Let f(x) = 
f c)o(x)P(dO)  . There exists fk, a k-component mixture of trio, such that 
(3) 
O(fllfk)-< k 
In the bound, we have 
f f 
f o(x)p(do) 
and ' = 411og(3v) + a], where 
dx, (4) 
log (x) 
a= sup (5) 
01,02,X 
Here, a characterizes an upper bound of the log ratio of the densities in G, when 
the parameters are restricted to O and the variable to 
Note that the rate of convergence, l/k, is not related to the dimensions of 
The behavior of the constants, though, depends on the choices of G and the target 
f. 
For example we may take G to be the Gaussian location family, which we restrict 
to a set r which is a cube of side-length A. Likewise we restrict the parameters to 
be in the same cube. Then, 
dA : 
a < (6) 
In this case, a is linear in dimension. 
The value of c depends on the target density f. Suppose f is a finite mixture with 
M components, then 
< (7) 
with equality if and only if those M components are disjoint. Indeed, suppose 
f(x) M 
-- i=lPiqbOi(X), then piqbo,(X)/iM=l piqboi(X) _ 1 and hence 
M 
c}: f '4M--(P'bO' (x))bO' (x)dx < f E(1)cfio,(x)dx: M. (8) 
M -- 
--i=1PiO0, (X) i=1 
Genovese and Wasserman [3] deal with a similar setting. A Kullback-Leibler ap- 
proximation bound of order 1/Vr for one-dimensional mixtures of Gaussians is 
given by them. 
In the more general case that f is not necessarily in , we have a competitive 
optimality result. Our density approximation is nearly at least as good as any gp 
in . 
Mixture Density Estimation 281 
THEOREM 2 For every gp(x) - f 
2 
O(fllf) 5 O(fllgP) q- 
(9) 
Here, 
2 f fck(x)P(dO) f(x)dx. (10) 
In particular, we can take infimum over all gr E C, and still obtain a bound. 
Let D(flIC) = infgec D(fllg). A theory of information projection shows that if 
there exists a sequence of fk such that D(fllfk ) -+ D(fll), then f converges to 
a function f*, which achieves D(fll ). Note that f* is not necessarily an element 
in . This is developed in Li[4] building on the work of Bell and Cover[5]. As a 
consequence of Theorem 2 we have 
2 
O(fllf) 5 O(fllf*) q- 
(11) 
where c s is the smallest limit of , for sequences of P achieving D(fllg) that 
f,* 
approaches the infimum D(fllC ). 
We prove Theorem 1 by induction in the following section. An appealing feature 
of such an approach is that it provides an iterative estimation procedure which 
allows us to estimate one component at a time. This greedy procedure is shown 
to perform almost as well as the full-mixture procedures, while the computational 
task of estimating one component is considerably easier than estimating the full 
mixtures. 
Section 2 gives the iterative construction of a suitable approximation, while Section 
3 shows how such mixtures may be estimated from data. Risk bounds are stated in 
Section 4. 
2 An iterative construction of the approximation 
We provide an iterative construction of f's in the following fashion. Suppose 
during our discussion of approximation that f is given. We seek a k-component 
mixture f close to f. Initialize fx by choosing a single component from G to 
minimize D(fllfx ) = D(fllq5o). Now suppose we have fk-l(X). Then let f(x) = 
(1 - a)f-l(X) + acfo(x) where a and 0 are chosen to minimize D(fllf ). More 
generally let f be any sequence of k-component mixtures, for k = 1, 2, ... such 
that D(fllfk) <_ mina,0 D(fll(1 - a)fk-x q- aO). We prove that such sequences fk 
achieve the error bounds in Theorem 1 and Theorem 2. 
Those familiar with the iterative Hilbert space approximation results of Jones[6], 
Barron[I], and Lee, Bartlett and Williamson[7], will see that we follow a similar 
strategy. The use of L2 distance measures for density approximation involves L2 
norms of component densities that are exponentially large with dimension. Naive 
Taylor expansion of the Kullback-Leibler divergence leads to an L2 norm approxi- 
mation (weighted by the reciprocal of the density) for which the difficulty remains 
(Zeevi gc Meir[8], Li[9]). The challenge for us was to adapt iterative approximation 
to the use of Kullback-Leibler divergence in a manner that permits the constant 
a in the bound to involve the logarithm of the density ratio (rather than the ratio 
itself) to allow more manageable constants. 
282 J. Q. Li and A. R. Barron 
The proof establishes the inductive relationship 
Dk _< (1 -o)Dk-1 + o?B, 
where B is bounded and D = D(fllf). By choosing a = 1,a2 
thereafter a = 2/k, it's easy to see by induction that D _< 4B/k. 
(12) 
= 1/2 and 
To get (12) we establish a quadratic upper bound for -log/x - 
' f -- 
-log ((1-a)f_+a0) Three key analytic inequalities regarding to the logarithm 
will be handy for us, 
forr_>ro >0, 
and 
- log(to) + ro - 1]( r _ 1) 2 
-log(r) _< -(r- 1) +[ (to - 1) 2 
(13) 
2[-log(r) +r- 1] < logr, (14) 
r-1 - 
- log(r) + r - 1 
(r- 1) 2 
_< 1/2 + log-(r) 
(15) 
Note that in our application, ro is a ratio of densities in . Thus we obtain an upper 
bound for log-(ro) involving a. Indeed we find that (1/2 + log-(ro)) _< 7/4 where 
7 is as defined in the theorem. 
In the case that f is in , we take g = f. Then taking the expectation with respect 
to f of both sides of (16), we acquire a quadratic upper bound for Dk, noting that 
r = L Also note that D is a function of 0. The greedy algorithm chooses 0 to 
minimize D(O). Therefore 
Plugging the upper bound (16) for Dk(O) into (17), we have 
_ __ + .2 (7/4) + - log(ro)]f(x)dxP(dO). (18) 
g g 
(16) 
Now apply (14) and (15) respectively. We get 
log(r) < log(to) aqb a2 (1/2 log- 
.... + + (to)) + - log(ro). 
- g -/ 
where log-(.) is the negative part of the logarithm. The proof of of inequality (13) 
-- 1og(r)+r--1 
is done by verifying that (r_1)2 is monotone decreasing in r. Inequalities (14) 
and (15) are shown by separately considering the cases that r < I and r > i (as 
well as the limit as r -+ 1). To get the inequalities one multiplies through by (r- 1) 
or (r - 1) 2, respectively, and then takes derivatives to obtain suitable monotonicity 
in r as one moves away from r = 1. 
(x-a)f_+ao and ro = (1-a)f/_t where 
Now apply the inequality (13) with r = /7 /7 , 
g is an arbitrary density in  with g = f coP(dO). Note that r >_ ro in this case 
because ao > 0. Plug in r = ro + a  at the right side of (13) and expand the 
square. Then we get 
aqb - log(to) + ro aqb)] 2 
-log(r) < -(to+---1)+[ -!][(ro-1)+( 
- g (ro - 1) 2 g 
log(to) + a 2 -[- log(to) + ro- 1] log(to) + ro- 1] 
g (to 1) 2 + 2a-[- ' 
- g ro- 1 
Mxture Density Estimation 283 
where ro = (1- a)fk-x (x)/g(x) and P is chosen to satisfy f0 c)o(x)P(dO) = g(x). 
Thus 
z>k < (1 - f 
- 
f(x)dx(7/4) + a log(1 - a) - a - log(1 - a). 
(19) 
It can be shown that a log(1 - a) - a - log(1 - a) _< O. Thus we have the desired 
inductive relationship, 
D < (1- a)Dk-1 + a2c,e7/4. (20) 
Therefore, D _< k' 
In the case that f does not have a mixture representation of the form f boP(dO), 
i.e. f is outside the convex hull , we take D to be f f(x) log gr-eiZ2dx for any given 
fk(x) 
g.,:,(x) = f c)o(x)P(dO). The above analysis then yields D - D(fllf)- D(fllg1) _< 
Cp 
,,, as desired. That completes the proof of Theorems I and 2. 
3 A greedy estimation procedure 
The connection between the K-L divergence and the MLE helps to motivate the 
following estimation procedure for fk if we have data Xx, ...,X, sampled from f. 
The iterative construction of f can be turned into a sequential maximum likeli- 
hood estimation by changing minD(flirt, ) to max ElL1 log ft,(Xi) at each step. A 
surprising result is that the resulting estimator f has a log likelihood almost at 
least as high as log likelihood achieved by any density gp in C with a difference of 
order 1/k. We formally state it as 
- ^ - 'P (21) 
1 logfk(Xi) >_ 1 logg1:,(Xi) - 7 k 
n n 
i--1 i--1 
gp E . Here Fn is the empirical distribution, 
for all 
(I/n) -4__1 2 where 
Xl ,P 
f 
cx,r = (f . 
for which c.,p = 
(22) 
The proof of this result (21) follows as in the proof in the last section, except that 
now we take D = EF, logg.,:,(X)/f(X) to be the expectation with respect to F 
instead of with respect to the density f. 
Let's look at the computation at each step to see the benefits this new greedy 
procedure can bring for us. We have f(x) = (1- a)f-l(/) + acfo(x) with 0 and 
a chosen to maximize 
n 
' log[(1 - a)fk-1 (Xi) + ac)o(Xi)] (23) 
i----1 
which is a simple two component mixture problem, with one of the two components, 
f-x(x), fixed. To achieve the bound in (21), a can either be chosen by this iterative 
maximum likelihood or it can be held fixed at each step to equal a (which as 
before is a = 2/k for k > 2). Thus one may replace the MLE-computation of a k- 
component mixture by successive MLE-computations of two-component mixtures. 
The resulting estimate is guaranteed to have almost at least as high a likelihood as 
is achieved by any mixture density. 
284 J. Q. Li and A. R. Barron 
A disadvantage of the greedy procedure is that it may take a number of steps to 
adequately downweight poor initial choices. Thus it is advisable at each step to re- 
tune the weights of convex combinations of previous components (and even perhaps 
to adjust the locations of these components), in which case, the result from the 
previous iterations (with k - I components) provide natural initialization for the 
search at step k. The good news is that as long as for each k, given ]k-l, the ]k is 
chosen among k component mixtures to achieve likelihood at least as large as the 
choice achieving max0 in__x log[(1 - ak)fk-l(Xi) + akqb0(Xi)], that is, we require 
that 
ElogA(Xi) >_ moaxElog[(1-ok)f_x(Xi)+okqbo(Xi)], (24) 
i=1 i=1 
then the conclusion (21) will follow. 
In particular, our likelihood results and risk bound results apply both to the case 
that ]k is taken to be global maximizer of the likelihood over k-component mixtures 
as well as to the case that ]k is the result of the greedy procedure. 
4 Risk bounds for the MLE and the iterative MLE 
The metric entropy of the family G is controlled to obtain the risk bound and 
to determine the precisions with which the coordinates of the parameter space are 
allowed to be represented. Specifically, the following Lipschitz condition is assumed: 
for 0 E 0 C R d and x E X C R d, 
d 
sup I logo(x) -logr;bo, (x)l _< B '- IO -0. I 
x:' j=l 
(25) 
where 0j is the j-th coordinate of the parameter vector. Note that such a condition 
is satisfied by a Gaussian family with x restricted to a cube with sidelength A 
and has a location parameter 0 that is also prescribed to be in the same cube. In 
particular, if we let the variance be a 2, we may set B - 2A/a 2. 
Now we can state the bound on the K-L risk of fk. 
THEOREM 3 Assume the condition (5). Also assume 0 to be a cube with side- 
length A. Let fk(x) be either the maximizer of the likelihood over k-component 
mixtures or more generally any sequence of density estimates fk satisfying (24). 
We have 
 c 2 2kd 
E(D(fllfk)) - D(fllC) < . log(nnBe). (26) 
- k n 
From the bound on risk, a best choice of k would be of order roughly v/ leading 
to a bound on ED(f[[fk) - D(f[[) of order 1/v/ to within logarithmic factors. 
However the best such bound occurs with k - 7cf,.x//x/2dlog(nABe) which is 
not available when the value of cL, is unknown. More importantly, k should not be 
chosen merely to optimize an upper bound on risk, but rather to balance whatever 
approximation and estimation sources of error actually occur. Toward this end we 
optimize a penalized likelihood criterion related to the minimum description length 
principle, following Barron and Cover [10]. 
Let l(k) be a function of k that satisfies E3__l e -l(k) _< 1, such as l(k) = 2 log(k+ 1). 
Mixture Density Estimation 285 
A penalized MLE (or MDL) procedure picks k by minimizing 
1  log 1 2kdlOg(nABe) 
- + + 
n i--1 fk(Xi) n 
(27) 
Then we have 
i,. 2kd 
E(D(fllfi))- D(fll) < nn{72 c2 
_ ----' + 7 n 
log(.4s) + (2s) 
A proof of these risk bounds is given in Li[4]. It builds on general results for 
maximum likelihood and penalized maximum likelihood procedures. 
Recently, Dasgupta [11] has established a randomized algorithm for estimating mix- 
tures of Gaussians, in the case that data are drawn from a finite mixture of suffi- 
ciently separated Gaussian components with common covariance, that runs in time 
linear in the dimension and quadratic in the sample size. However, present forms 
of his algorithm require impractically large sample sizes to get reasonably accurate 
estimates of the density. It is not yet known how his techniques will work for more 
general mixtures. Here we see that iterative likelihood maximization provides a 
better relationship between accuracy, sample size and number of components. 
References 
[1] Barron, Andrew (1993) Universal Approximation Bounds for Superpositions of a 
Sigmoidal Function. IEEE Transactions on Information Theory 39, No. 3:930-945 
[2] Barron, Andrew (1994) Approximation and Estimation Bounds for Artificial 
Neural Networks. Machine Learning 14: 115-133. 
[3] Genovese, Chris and Wasserman, Larry (1998) Rates of Convergence for the 
Gaussian Mixture Seive. Manuscript. 
[4] Li, Jonathan Q. (1999) Estimation of Mixture Models. Ph.D Dissertation. The 
Department of Statistics. Yale University. 
[5] Bell, Robert and Cover, Thomas (1988) Game-theoretic optimal portfolios. Man- 
agement Science 34: 724-733. 
[6] Jones, Lee (1992) A simple lemma on greedy approximation in Hilbert space 
and convergence rates for projection pursuit regression and neural network training. 
Annals of Statistics 20: 608-613. 
[7] Lee, W.S., Bartlett, P.L. and Williamson R.C. (1996) Efficient Agnostic Learn- 
ing of Neural Networks with Bounded Fan-in. IEEE Transactions on Information 
Theory 42, No. 6: 2118-2132. 
[8] Zeevi, Assaf and Meir Ronny (1997) Density Estimation Through Convex Com- 
binations of Densities: Approximation and Estimation Bounds. NeUral Networks 
10, No.1: 99-109. 
[9] Li, Jonathan Q. (1997) Iterative Estimation of Mixture Models. Ph.D. Prospec- 
tus. The Department of Statistics. Yale University. 
[10] Barron, Andrew and Cover, Thomas (1991) Minimum Complexity Density 
Estimation. IEEE Transactions on Information Theory 37: 1034-1054. 
[11] Dasgupta, Sanjoy (1999) Learning Mixtures of Gaussians. Proc. IEEE Conf. 
on Foundations of Computer Science, 634-644. 
