An Alternative 
Model for Mixtures of 
Experts 
Lei Xu 
Dept. of Computer Science, The Chinese University of Hong Kong 
Shatin, Hong Kong, Email lxu@cs.cuhk.hk 
Michael I. Jordan 
Dept. of Brain and Cognitive Sciences 
MIT 
Cambridge, MA 02139 
Geoffrey E. Hinton 
Dept. of Computer Science 
University of Toronto 
Toronto, M5S 1A4, Canada 
Abstract 
We propose an alternative model for mixtures of experts which uses 
a different parametric form for the gating network. The modified 
model is trained by the EM algorithm. In comparison with earlier 
models--trained by either EM or gradient ascent--there is no need 
to select a learning stepsize. We report simulation experiments 
which show that the new architecture yields faster convergence. 
We also apply the new model to two problem domains: piecewise 
nonlinear function approximation and the combination of multiple 
previously trained classifiers. 
I INTRODUCTION 
For the mixtures of experts architecture (Jacobs, Jordan, Nowlan  Hinton, 1991), 
the EM algorithm decouples the learning process in a manner that fits well with the 
modular structure and yields a considerably improved rate of convergence (Jordan  
Jacobs, 1994). The favorable properties of EM have also been shown by theoretical 
analyses (Jordan  Xu, in press; Xu  Jordan, 1994). 
It is difficult to apply EM to some parts of the mixtures of experts architecture 
because of the nonlinearity of softmax gating network. This makes the maximiza- 
634 Lei Xu, Michael L Jordan, Geoffrey E. Hinton 
tion with respect to the parameters in gating network nonlinear and analytically 
unsolvable even for the simplest generalized linear case. Jordan and Jacobs (1994) 
suggested a double-loop approach in which an inner loop of iteratively-reweighted 
least squares (IRLS) is used to perform the nonlinear optimization. However, this 
requires extra computation and the stepsize must be chosen carefully to guarantee 
the convergence of the inner loop. 
We propose an alternative model for mixtures of experts which uses a different para- 
metric form for the gating network. This form is chosen so that the maximization 
with respect to the parameters of the gating network can be handled analytically. 
Thus, a single-loop EM can be used, and no learning stepsize is required to guar- 
antee convergence. We report simulation experiments which show that the new 
architecture yields faster convergence. We also apply the model to two problem do- 
mains. One is a piecewise nonlinear function approximation problem with smooth 
blending of pieces specified by polynomial, trigonometric, or other prespecified ba- 
sis functions. The other is to combine classifiers developed previously--a general 
problem with a variety of applications (Xu, et al., 1991, 1992). Xu and Jordan 
(1993) proposed to solve the problem by using the mixtures of experts architecture 
and suggested an algorithm for bypassing the difficulty caused by the softmax gat- 
ing networks. Here, we show that the algorithm of Xu and Jordan (1993) can be 
regarded as a special case of the single-loop EM given in this paper and that the 
single-loop EM also provides a further improvement. 
2 MIXTURES OF EXPERTS AND EM LEARNING 
The mixtures of experts model is based on the following conditional mixture: 
K 
P(y]x, O) = -.gj(x,y)P(y]x, Oj), 
j----1 
1 exp- 1 
P(ylx, 0j) - (2)./2lrjl/2 fj(x, wj)]'Ff[y - (1) 
where x  R , and O consists of y, {Oj } K, and Oj consists of {wj } K, {r }. The 
vector fj(x, wj) i,s the output ofthe j-th expert net. The scalar gj(x,,),j = 1,...,K 
is given by the softmax function: 
gj(, v) = 
i 
In this equation, 3j(x,y),j = 1,..., K are the outputs of the gating network. 
The parameter O is estimated by Maximum Likelihood (ML), where the log likeli- 
hood is given by L = -].t In P(y(t)lx(), 0). The ML estimate can be found iteratively 
using the EM algorithm as follows. Given the current estimate O(), each iteration 
consists of two steps. 
(1) E-step. For each pair {x(t),y(t)}, compute hk)(y(t)]x (t)) -- P(j]x(t),y(t)), and 
then form a set of objective functions: 
Q(Oj) -  h)(y(t)[x (t)) lnP(y(t)[x(t),Oj), j - 1,...,K; 
t 
An Alternative Model for Mixtures of Experts 635 
Qg(') = Z Z hk)(Y(t)lx(t))lngk)(x(t),'(k)) ' 
(3) 
(2) M-step. Find a new estimate O (k+x) - $n(+X)lK u (+x)} with: 
 -- ll. vj l j-'l, 
0 +) = argmax Q(Oj),j = 1,.--,K; v (+x) = argmax Qg(v). 
(4) 
In certain cases, for example when fj(z, wj) is linear in the parameters wj, 
maxoj Q(Oj) can be solved by solving OQ/OOj = O. When fj(x, wj) is nonlinear 
with respect to wj, however, the maximization cannot be performed analytically. 
Moreover, due to the nonlinearity of softmax, may, Qg(y) cannot be solved analyti- 
cally in any case. There are two possibilities for attacking these nonlinear optimiza- 
tion problems. One is to use a conventional iterative optimization technique (e.g., 
gradient ascent) to perform one or more inner-loop iterations The other is to simply 
find a new estimate such that Q(O +)) >_ Q(O)), Qg(u (+)) >_ Qg(u()). Usu- 
ally, the algorithms that perform a full maximization during the M step are referred 
as "EM" algorithms, and algorithms that simply increase the Q function during the 
M step as "GEM" algorithms. In this paper we will further distinguish between 
EM algorithms requiring and not requiring an iterative inner loop by designating 
them as double-loop EM and single-loop EM respectively. 
Jordan and Jacobs (1994) considered the case of linear ttj(x,y) - v][x, 1] with 
and semi-linear fj(wj Ix, 1])with nonlinear fj(.). They proposed a 
double-loop EM algorithm by using the IRLS method to implement the inner-loop 
iteration. For more general nonlinear j(x,,) and fj(x, Oj), Jordan and Xu (in 
press) showed that an extended IRLS can be used for this inner loop. It can be 
shown that IRLS and the extension are equivalent to solving eq. (3) by the so-called 
Fisher Scoring method 
3 A NEW GATING NET AND A SINGLE-LOOP EM 
To sidestep the need for a nonlinear optimization routine in the inner loop of the 
EM algorithm, we propose the following modified gatlug network: 
P(xl') = aj(Yj)-lbj(x)exp{cj(Yj)Ttj(x)} (5) 
where y = aj,yj,j = 1,...,K}, tj(x) is a vector of sufficient statistics, and the 
P(xlyj)'s are density functions from the exponential family. The most common 
example is the Gaussian: 
i exp i 
In eq. (5), gj(x, v) is actually the posterior probability P(jIx) that x is assigned to 
the partition corresponding to the j-th expert net, obtained from Bayes' rule: 
g(x, ,) = P(jlx) = aP(xlr,.i)/P(x,r,), P(x,r,)= Z a'P(xla)' (7) 
i 
636 Lei Xu, Michael I. Jordan, Geoffrey E. Hinton 
Inserting this gj(x,u) into the model eq. (1), we get 
r(ylx, e) =  r(x, v) r(ylx, O r). 
$ 
(s) 
If we do ML estimation directly on this P(ylx, 0) and derive an EM algorithm, 
we again find that the maximization max Qg(y) cannot be solved analytically. To 
avoid this difficulty, we rewrite eq. (8) as: 
r(y,x) = r(ylx, o)r(x,v) =  ajr(xlv)r(ylx, Oj). (9) 
This suggests an asymmetrical representation for the joint density. We accordingly 
perform ML estimation based on L' - -]t In p(y(t), x(t)) to determine the param- 
eters a j, yj, Oj of the gating net and the expert nets. This can be done by the 
following EM algorithm: 
(1) E-step. Compute 
h)(y() Ix(t)) = a)r(x(OlY?))r(Y(t)ix(t), 0)) 
E, a?)r(x(t)lY?))P(Y(t)lx(t), 0J)); (10) 
Then let Q(Oj),j  1,...,K be the same  given in eq. (3), and decompose Qg(y) 
further int6 
Q(yj) :  h)(y(t)[x(t))lnP(x(t)[yj), j : I,...,K; 
t 
Qa = hk)(y(t)[x(t))lnaj, with a= {al,''',aK). (11) 
t j 
(2). M-step. Find a new estimate for j = 1,..., K 
 (k+) 
05 k+l) argmaxo s Qj(Oj), ,j argmaxvs 
= = 
(k+) = argma Q, .t. E = 1. (1) 
The maximization for the expert nets is the same  in eq. (4). However, for the 
gating net the maximization now becomes analytically solvable  long  P(x]yj) 
is from the exponential family. That is, we have: 
i 
t 
In particular, when P(xyj) is a Gaussian density, the update becomes: 
m+l) = I 
t 
1 
(+) _ 
3 -- 
 h)(y(')lx('))[x(')- 
t 
(14) 
An Alternative Model for Mixtures of Experts 637 
Two issues deserve to be emphasized further: 
(1) The gating nets eq. (2) and eq. (5) become identical when 3j(x,t/) = lnaj + 
In bj(x)+cj(vj)Ttj(x)-ln a(vj). In other words, the gating net in eq. (5) explicitly 
uses this function family instead of the function family defined by a multilayer 
feedforward network. 
(2) It follows from eq. (9) that max In P(y, xIO) = max [lnP(ylz, e) q- In P(zl)]. 
So, the solution given by eqs. (10) through (14) is actually different from the one 
given by the original eqs. (3) and (4). The former tries to model both the mapping 
from x to y and the input x, while the latter only models the mapping from x and 
y. In fact, here we learn the parameters of the gating net and the expert nets via 
an asymmetrical representation eq. (9) of the joint density P(y, x) which includes 
P(ylx) implicitly. However, in the testing phase, the total output still follows eq. 
(8). 
4 PIECEWISE NONLINEAR APPROXIMATION 
The simple form fj(x, wj) = wf[x, 1] is not the only case to which single-loop EM 
applies. Whenever fj (x, wj) can be written in a form linear in the parameters: 
fj(x, wj) = E wi,jrki,j(x) + wo,j = w/[qbj(x), 11, 
i 
(15) 
where (ki,j(z)are prespecified basis functions, maxoj Q(Oj),j = 1,-..,K in eq. (3) 
is still a weighted least squares problem that can be solved analytically. One useful 
special case is when q3i,j(x) are canonical polynomial terms x[ x.. rd 
 x d , ri >_ O. In 
this case, the mixture of experts model implements piecewise polynomial approxi- 
mations. Another case is that cki,j(x) is I-Iisin[(jrxx)cos[(jrrxx),ri >_ O, in which 
case the mixture of experts implements piecewise trigonometric approximations. 
5 COMBINING MULTIPLE CLASSIFIERS 
Given pattern classes Ci, i = 1,. ., M, we consider classifiers ej that for each input 
 produce an output Pj(ylx): 
P(ylx) - [py(llx),-..,py(MIx)], p(ilx) > O, y.py(ilx) - 1. 
i 
(16) 
The problem of Combining Multiple Classifiers (CMC) is to combine these P(ylx)'s 
to give a combined estimate of P(ylx). Xu and Jordan (1993) proposed to solve 
CMC problems by regarding the problem as a special example of the mixture den- 
sity problem eq. (1) with the Pj(ylx)'s known and only the gating net gj(x,,) to 
be learned. In Xu and Jordan (1993), one problem encountered was also the non- 
linearity of softmax gating networks, and an algorithm was proposed to avoid the 
difficulty. 
Actually, the single-loop EM given by eq. (10) and eq. (13) can be directly used 
to solve the CMC problem. In particular, when P(x[,) is Gaussian, eq. (13) 
becomes eq. (14). Assuming that a = ...= aK in eq. (7), eq. (10) becomes 
638 Lei Xu, Michael I. Jordan, Geoffrey E. Hinton 
hk)(Y(t)lx(*)) = P(x(t)lvJk))P(Y(t)lx(t))/'4 P(x(t)lv?))P(Y(t)lx(*))  If we divide 
both the numerator and denominator by Y4 P(x(*)l?)), we get h)(y(t)[x(*)) = 
gj(x, v')P(y(t)lx(t))/--i gj(x, v')P(y(t)lx(t)). Comparing this equation with eq. (7a) 
in Xu and Jordan (1993), we can see that the two equations are actually the same. 
Despite the different notation, cj(x) and Pj(t)lx(t) ) in Xu and Jordan (1993) are 
the same as gj(x,,) and P(y(t)lx(t) ) in Section 3. So the algorithm of Xu and 
Jordan (1993) is a special case of the single-loop EM given in Section 3. 
6 SIMULATION RESULTS 
We compare the performance of the EM algorithm presented earlier with the model 
of mixtures of experts presented by Jordan and Jacobs (1994). As shown in Fig. 
l(a), we consider a mixture of experts model with K = 2. For the expert nets, 
each P(ylx, Oj) is Gaussian given by eq. (1) with linear fj(x,wj) -- w[[x, 1]. For 
the new gating net, each P(x, ,j) in eq. (5) is Gaussian given by eq. (6). For the 
old gating net eq. (2), 1 (3, Y) -- 0 and 2(g, y) -' yT[x, 1]. The learning speeds of 
the two are significantly different. The new algorithm takes k-15 iterations for the 
log-likelihood to converge to the value of -1271.8. These iterations require about 
1,351,383 MATLAB flops. For the old algorithm, we use the IRLS algorithm 
given in Jordan and Jacobs (1994) for the inner loop iteration. In experiments, 
we found that it usually took a large number of iterations for the inner loop to 
converge. To save computations, we limit the maximum number of iterations by 
vm, = 10. We found that this saved computation without obviously influencing 
the overall performance. From Fig. l(b), we see that the outer loop converges in 
about 16 iterations. Each inner loop takes 290498 flops and the entire process 
requires 5,312,695 flops. So, we see that the new algorithm yields a speedup of 
about 4,648,608/1,441,475 = 3.9. Moreover, no external adjustment is needed to 
ensure the convergence of the new algorithm. But for the old one the direct use 
of IRLS can make the inner loop diverge and we need to appropriately rescale the 
updating stepsize of IRLS. 
rigs. 2(a) and (b) show the results of a simulation of a piecewise polynomial 
approximation problem utilizing the approach described in Section 4. We consider a 
mixture of experts model with K = 2. For expert nets, each P(yl x, Oj) is Gaussian 
given by eq. (1) with fj(x, wj) = w3,jx a + w2,jx 2 + wx,jx + wo,j. In the new gating 
net eq. (5), each P(x, ,) is again Gaussian given by eq. (6). We see that the higher 
order nonlinear regression has been fit quite well. 
For multiple classifier combination, the problem and data are the same as in Xu and 
Jordan (1993). Table I shows the classification results. Corn-old and Corn-new de- 
note the method given in in Xu and Jordan (1993) and in Section 5 respectively. We 
see that both improve the classification rate of each individual network considerably 
and that Corn- new improves on Corn - old. 
Classifer e Classifer e Corn- old Corn- new 
Training set 89.9% 93.3% 98.6% 99.4% 
Testing set 89.2% 92.7% 98.0% 99.0% 
Table I A comparison of the correct classification rates 
An Alternative Model for Mixtures of Experts 639 
'1400 f 
-160 
-I0 
-2400 
-2800 
-3000 
-32{g 
(a) 
(b) 
Figure 1: (a) 1000 samples from y - alx + a2 + e, ax - 0.8, a2 = 0.4, x E [-1, 1.5] 
with prior E 1 -- 0.25 and y - ax + a + e,a - 0.8, a2 =' 2.4, x  [1,4] with prior 
a2 = 0.75, where x is uniform random variable and z is from Gaussian N(0, 0.3). 
The two lines through the clouds are the estimated models of two expert nets. The 
fits obtained by the two learning algorithms are almost the same. (b) The evolution 
of the log-likelihood. The solid line is for the modified learning algorithm. The 
dotted line is for the original learning algorithm (the outer loop iteration). 
7 REMARKS 
Recently, Ghahramani and Jordan (1994) proposed solving function approximation 
problems by using a mixture of Gaussians to estimate the joint density of the input 
and output (see also Specht, 1991; Tresp, et al., 1994). In the special case of linear 
fj(x, wj) = wf[x, 1] and Gaussian P(x[t/j) with equal priors, the method given 
in Section 3 provides the same result as Ghahramani and Jordan (1994) although 
the parameterizations of the two methods are different. However, the method of 
this paper also applies to nonlinear fj(x, wj) = wf[tkj(x), 1] for piecewise nonlinear 
approximation or more generally fj(x, wj) that is nonlinear with respect to wj, 
and applies to cases in which P(y, xlt/j, Oj) is not Gaussian, as well as the case of 
combining multiple classifiers. Furthermore, the methods proposed in Sections 3 and 
4 can also be extended to the hierarchical mixture of experts architecture (Jacobs 
 Jordan, 1994) so that single-loop EM can be used to facilitate its training. 
References 
Ghahramani, Z., & Jordan, M.I. (1994). Function approximation via density esti- 
mation using the EM approach. In Cowan, J.D., Tesauro, G., and Alspector, J., 
(Eds.), Advances in NIPS 6. San Mateo, CA: Morgan Kaufmann. 
Jacobs, R.A., Jordan, .M.I., Nowlan, S.J., & Hinton, G.E. (1991). Adaptive mixtures 
of local experts. Neural Computation, 3, 79-87. 
640 Lei Xu, Michael I. Jordan, Geoffrey E. Hinton 
-2 
-4 
xlO a 
o 
-03 
-I 
-I.5 
-3 
.3.5 
o 
(a) (b) 
Figure 2: Piecewise 3rd polynomial approximation. (a) 1000 samples from y = 
ax3+a3x+a4+e,x  [-1, 1.5] with prior ax 0.4 and y ax 2 ' : 
= = +a3x +a+e,x 6 
[1,4] with prior a2 = 0.6, where x is uniform random variable and z is from Gaussian 
N(0, 0.15). The two curves through the clouds are the estimated models of two 
expert nets. (b) The evolution of the log-likelihood. 
Jordan, M.I., g; Jacobs, R.A. (1994). Hierarchical mixtures of experts and the EM 
algorithm. Neural Computation, 6, 181-214. 
Jordan, M.I., g; Xu, L. (in press). Convergence results for the EM approach to 
mixtures-of-experts architectures. Neural Networks. 
Specht, D. (1991). A general regression neural network. IEEE Trans. Neural 
Networks, 2, 568-576. 
Tresp, V., Ahmad, S., and Neuneier, R. (1994). Training neural networks with 
deficient data. In Cowan, J.D., Tesauro, G., g; Alspector, J., (Eds.), Advances in 
NIPS 6, San Mateo, CA: Morgan Kaufmann. 
Xu, L., Krzyzak A., g; Suen, C.Y. (1991). Associative switch for combining multiple 
classifiers. Proc. of 1991 IJCNN, Vol. I. Seattle, 43-48. 
Xu, L., Krzyzak A., g; Suen, C.Y. (1992). Several methods for combining multiple 
classifiers and their applications in handwritten character recognition. IEEE Trans. 
on SMC, Vol. SMC-22, 418-435. 
Xu, L., g; Jordan, M.I. (1993). EM Learning on a generalized finite mixture model 
for combining multiple classifiers. Proceedings of World Congress on Neural Net- 
works, Vol. IV. Portland, OR, 227-230. 
Xu, L., g; Jordan, M.I. (1994). On convergence properties of the EM algorithm for 
Gaussian mixtures. Submitted to Neural Computation. 
