Ensemble Learning 
for Multi-Layer Networks 
David Barber* 
Christopher M. Bishop t 
Neural Computing Research Group 
Department of Applied Mathematics and Computer Science 
Aston University, Birmingham B4 7ET, U.K. 
http://www. ncrg. aston. ac. uk/ 
Abstract 
Bayesian treatments of learning in neural networks are typically 
based either on local Gaussian approximations to a mode of the 
posterior weight distribution, or on Markov chain Monte Carlo 
simulations. A third approach, called ensemble learning, was in- 
troduced by Hinton and van Camp (1993). It aims to approximate 
the posterior distribution by minimizing the Kullback-Leibler di- 
vergence between the true posterior and a parametric approximat- 
ing distribution. However, the derivation of a deterministic algo- 
rithm relied on the use of a Gaussian approximating distribution 
with a diagonal covariance matrix and so was unable to capture 
the posterior correlations between parameters. In this paper, we 
show how the ensemble learning approach can be extended to full- 
covariance Gaussian distributions while remaining computationally 
tractable. We also extend the framework to deal with hyperparam- 
eters, leading to a simple re-estimation procedure. Initial results 
from a standard benchmark problem are encouraging. 
I Introduction 
Bayesian techniques have been successfully applied to neural networks in the con- 
text of both regression and classification problems (MacKay 1992; Neal 1996). In 
contrast to the maximum likelihood approach which finds only a single estimate 
for the regression parameters, the Bayesian approach yields a distribution of weight 
parameters, p(wlD), conditional on the training data D, and predictions are ex- 
*Present address: SNN, University of Nijmegen, Geert Grooteplein 21, Nijmegen, The 
Netherlands. http: //www. mbfys. kun. nl/snn/email: davidbmbfys. kun. nl 
tPresent address: Microsoft Research Limited, St George House, Cambridge CB2 3NH, 
UK. http://www.research.microsoft. corn ema/l: cmbishopnicrosoft. corn 
396 D. Barber and C. M. Bishop 
pressed in terms of expectations with respect to the posterior distribution (Bishop 
1995). However, the corresponding integrals over weight space are analytically in- 
tractable. One well-established procedure for approximating these integrals, known 
as Laplace's method, is to approximate the posterior distribution by a Gaussian, 
centred at a mode of p(wlD), in which the covariance of the Gaussian is deter- 
mined by the local curvature of the posterior distribution (MacKay 1995). The 
required integrations can then be performed analytically. More recent approaches 
involve Markov chain Monte Carlo simulations to generate samples from the poste- 
rior (Neal 1996). However, such techniques can be computationally expensive, and 
they also suffer from the lack of a suitable convergence criterion. 
A thirl approach, called ensemble learning, was introduced by Hinton and van 
Camp (1993) and again involves finding a simple, analytically tractable, approxi- 
mation to the true posterior distribution. Unlike Laplace's method, however, the 
approximating distribution is fitted globally, rather than locally, by minimizing a 
Kullback-Leibler divergence. Hinton and van Camp (1993) showed that, in the case 
of a Gaussian approximating distribution with a diagonal covariance, a determin- 
istic learning algorithm could be derived. Although the approximating distribution 
is no longer constrained to coincide with a mode of the posterior, the assumption 
of a diagonal covariance prevents the model from capturing the (often very strong) 
posterior correlations between the parameters. MacKay (1995) suggested a modifi- 
cation to the algorithm by including linear preprocessing of the inputs to achieve a 
somewhat richer class of approximating distributions, although this was not imple- 
mented. In this paper we show that the ensemble learning approach can be extended 
to allow a Gaussian approximating distribution with an general covariance matrix., 
while still leading to a tractable algorithm. 
1.1 The Network Model 
We consider a two-layer feed-forward network having a single output whose value 
is given by 
H 
f(x, w) ---- E via(ui'x) (1) 
i----1 
where w is a k-dimensional vector representing all of the adaptive parameters in the 
model, x is the input vector, {ui),i - 1,..., H are the input-to-hidden weights, 
and (vi),i - 1,..., H are the hidden-to-output weights. The extension to multi- 
ple outputs is straightforward. For reasons of analytic tractability, we choose the 
sigmoidal hidden-unit activation function er(a) to be given by the error function 
er(a) = exp (-s2/2) ds (2) 
which (when appropriately scaled) is quantitatively very similar to the standard 
logistic sigmoid. Hidden unit biases are accounted for by appending the input 
vector with a node that is always unity. In the current implementation there are 
no output biases (and the output data is shifted to give zero mean), although the 
formalism is easily extended to include adaptive output biases (Barber and Bishop 
1997).-The data set consists of N pairs of input vectors and corresponding target 
output values D - {x',t '},/ -- 1,...,N. We make the standard assumption 
of Gaussian noise on the target values, with variance /-. The likelihood of the 
training data is then proportional to exp(-/ED), where the training error ED is 
1 
E(w) = w) - 2 . (3) 
Ensemble Lea ming for Multi-Layer Networks 397 
The prior distribution over weights is chosen to be a Gaussian of the form 
p(w) oc exp (-Ew(w)) 
(4) 
where Ew(w) = wTAw, and A is a matrix of hyperparameters. The treatment of 
/3 and A is dealt with in Section 2.1. From Bayes' theorem, the posterior distribution 
over weights can then be written 
1 
p(wlD)- 2 exp (-/3ED(W)- Ew(w)) 
(5) 
where Z is a normalizing constant. Network predictions on a novel example are 
given by the posterior average of the network output 
(f(x)) = f f(x, w)p(wID) dw. 
(6) 
This represents an integration over a high-dimensional space, weighted by a pos- 
terior distribution p(w[D) which is exponentially small except in narrow regions 
whose locations are unknown a-priori. The accurate evaluation of such integrals is 
thus very difficult. 
2 'Ensemble Learning 
Integrals of the form (6) may be tackled by approximating p(wID) by a simpler 
distribution Q(w). In this paper we choose this approximating distribution to be 
a Gaussian with mean W and covariance C. We determine the values of W and C 
by minimizing the Kullback-Leibler divergence between the network posterior and 
approximating Gaussian, given by 
[ Q(w) 
-- /Q(W)ln[p(-lD) } dw (7) 
= / Q(w)lnQ(w)dw - / Q(w)lnp(wlD)dw. (8) 
The first term in (8) is the negative entropy of a Gaussian distribution, and is easily 
evaluated to give  In det (C) + const. 
From (5) we see that the posterior dependent term in (8) contains two parts that 
depend on the prior and likelihood 
Q(w)Ew(w)dw + / Q(W)ED(W)dW. 
(9) 
Note that the normalization coefficient Z - in (5) gives rise to a constant additive 
term in the KL divergence and so can be neglected. The prior term E(w) is 
quadratic in w, and integrates to give Tr(CA) + wTAw. This leaves the data 
dependent term in (9) which we write as 
N 
L '- f Q(W)ED(w)dw -- - E/(x't) (10) 
2 
u=l 
where 
f Q(w)(i(x,w)f aw-,t f (11) 
398 D. Barber and C. M. Bishop 
For clarity, we concentrate only on the first term in (11), as the calculation of the 
term linear in f(x, w) is similar, though simpler. Writing the Gaussian integral 
over Q as an average, (), the first term of (11) becomes 
H 
E  
(12) 
where 
Oij ---- (Cvv - CvuCuu-lCuv)ij q- vivj (14) 
q'o = C---C-,,=iCv=j,-C-- -, (15) 
= -C - (16) 
'ij 2Cuu u,v=jVi. 
Although the remaining integration in (13) over u is not analytically tractable, we 
can make use of the following result to reduce it to a one-dimensional integration 
(a (z'a + ao) o' (z-b + bo))z = a(zlal + ao)a 
v/lal 2 (1 q- Ibl 2) - (aTb) 2 
(17) 
where a and b are vectors and ao, b0 are scalar offsets. The average on the left of 
(17) is over an isotropic multi-dimensional Gaussian, p(z) cr exp(-z'rz/2), while the 
average on the right is over the one-dimensional Gaussian p(z) cr exp(-z2/2). This 
result follows from the fact that the vector z only occurs through the scalar product 
with a and b, and so we can choose a coordinate system in which the first two 
components of z lie in the plane spanned by a and b. All orthogonal components 
do not appear elsewhere in the integrand, and therefore integrate to unity. 
The integral we desire, (13) is only a little more complicated than (17) and can be 
evaluated by first transforming the coordinate system to an isotopic basis z, and 
then differentiating with respect to elements of the covariance matrix to 'pull down' 
the required linear and quadratic terms in the a-independent pre-factor of (13). 
These derivatives can then be reduced to a form which requires only the numerical 
evaluation of (17). We have therefore succeeded in reducing the calculation of the 
KL divergence to analytic terms together with a single one-dimensional numerical 
integration of the form (17), which we compute using Gaussian quadrature . 
Similar techniques can be used to evaluate the derivatives of the KL divergence with 
respect to the mean and covariance matrix (Barber and Bishop 1997). Together with 
the KL divergence, these derivatives are then used in a scaled conjugate gradient 
optimizer to find the parameters W and C that represent the best Gaussian fit. 
The number of parameters in the covariance matrix scales quadratically with the 
number of weight parameters. We therefore have also implemented a version with 
Although (17) appears to depend on 4 parameters, it can be expressed in terms of 3 
independent parameters. An alternative to performing quadrature during training would 
therefore be to compute a 3-dimensional look-up table in advance. 
i,j=l 
To simplify the notation, we denote the set of input-to-hidden weights (u,..., UH) 
by u and the set of hidden-to-output weights, (V,...,VH) by v. Similarly, we 
partition the covariance matrix C into blocks, C, Cv, C, and C = C. As 
the components of v do not enter the non-linear sigrnoid functions, we can directly 
integrate over v, so that each term in the summation (12) gives 
((Oij+(U--H)Tqtij(U--H)+"i(U--H))a(uTxi)a(uTxJ)) (13) 
Ensemble Learning for Multi-Layer Networks 399 
Posterior Laplace fit Minimum KLD fit Minimum KL fit 
Figure 1: Laplace and minimum Kullback-Leibler Gaussian fits to the posterior. 
The Laplace method underestimates the local posterior mass by basing the covario 
ance matrix on the mode alone, and has KL value 41. The minimum Kullback- 
Leibler Gaussian fit with a diagonal covariance matrix (KLD) gives a KL value 
of 4.6, while the minimum Kullback-Leibler Gaussian with full covariance matrix 
achieves a value of 3.9. 
a constrained covariance matrix 
$ 
C = diag(d,... ,d2) + y. s/s/T (18) 
i=1 
which is the form of covariance used in factor analysis (Bishop 1997). This reduces 
the number of free parameters in the covariance matrix from k(k + 1)/2 to k(s + 1) 
(representing k(s + 1) - s(s - 1)/2 independent degrees of freedom) which is now 
linear in k. Thus, the number of parameters can be controlled by changing s and, 
unlike a diagonal covariance matrix, this model can still capture the strongest of the 
posterior correlations. The value of s should be as large as possible, subject only 
to computational cost limitations. There is no 'over-fitting' as s is increased since 
more flexible distributions Q(w) simply better approximate the true posterior. 
We illustrate the optimization of the KL divergence using a toy problem involving 
the posterior distribution for a two-parameter regression problem. Figure i shows 
the true posterior together with approximations obtained from Laplace's method, 
ensemble learning with a diagonal covariance Gaussian, and ensemble learning using 
an unconstrained Gaussian. 
2.1 Hyperparameter Adaptation 
So far, we have treated the hyperparameters as fixed. We now extend the ensemble 
learning formalism to include hyperparameters within the Bayesian framework. For 
simplicity, we consider a standard isotropic prior covariance matrix of the form 
A -- aI, and introduce hyperpriors given by Gamma distributions 
lnp(c) = ln{ca-exp(-)} +const (19) 
lnp(fi) = ln{fiC-lexp(-)}+ cOnst (20) 
400 D. Barber and C. M. Bishop 
where a, b, c, d are constants. 
hyperparameters is given by 
in which 
The joint posterior distribution of the weights and 
p (w, a, fi]D) cr p(D]w, fi)p(w]a)p(a)p(fi) 
(21) 
N 
lnp(DIw, fi) = -tiEr) + -lnfi + const (22) 
k 
lnp(wla ) = -etlwl 2 + lnet +const (23) 
We follow MacKay (1995) by modelling the joint posterior p (w, a, fiID) by a fac- 
torized approximating distribution of the form 
Q(w)R(et)$(fi) (24) 
where Q(w) is a Gaussian distribution as before, and the functional forms of R and 
$ are left unspecified. We then minimize the KL divergence 
.T [Q, R, S] = /Q(w)R(a)S(fi)ln { Q(w)R(a)S(fi) } dw da dfi. (25) 
p(w, et, ID) 
Consider first the dependence of (25) on Q(w) 
f[Q] = - Q(w)R(et)$(fi) {-flED(w) - lwl 2 - In Q(w)} + const (26) 
- -fQ(w){-E>(w)--lwl2-1nQ(w)}+const (27) 
where 5 = f R(a)ada and  = f $(fi)fidfi. We see that (27) has the form of (8), 
except that the fixed hyperparameters are now replaced with their average values. 
To calculate these averages, consider the dependence of the functional f on 
I 
.FIR] - - Q(w)R(et)S(fi) - Iwl   lnet(a- 1)lnet--- 
: - fR(a){a + (r- 1)lna-lnR(a)} det +const (28) 
8 
 TrC+ 1/b. We recognise (28) as the Kullback- 
where r =  + a and 1/s = lwl 2 + 
Leibler divergence between R(a) and a Gamma distribution. Thus the optimum 
R(a) is also Gamma distributed 
R(a) (x ar- exp (-7) . (29) 
We therefore obtain 5 = rs. 
A similar procedure for S(fi) gives j= uv, where u =  +c and 1/v = (Ez))+ i/d, 
in which (ED) has already been calculated during the optimization of Q(w). 
This defines an iterative procedure in which we start by initializing the hyperparam- 
eters (using the mean of the hyperprior distributions) and then alternately optimize 
the KL divergence over Q(w) and re-estimate 5 and fl. 
3 Results and Discussion 
As a preliminary test of our method on a standard benchmark problem, we ap- 
plied the minimum KL procedure to the Boston Housing dataset. This is a one 
Ensemble Lea ming for Multi-Layer Networks 401 
Method I Test Error 
Ensemble (s = 1) 0.22 
Ensemble (diagonal) 0.28 
Laplace 0.33 
Table 1: Comparison of ensemble learning with Laplace's method. The test error 
is defined to be the mean squared error over the test set of 378 examples. 
dimensional regression problem, with 13 inputs, in which the data for 128 training 
examples was obtained from the DELVE archive 2. We trained a network of four 
hidden units, with covariance matrix given by (18) with s = 1, and specified broad 
hyperpriors on ( and/ (a = 0.25, b = 400, c = 0.05, and d = 2000). Predictions are 
made by evaluating the integral in (6). This integration can be done analytically 
as a consequence of the form of the sigmoid function given in (2). 
We compared the performance of the KL method against the Laplace framework of 
MacKay (1995) which also treats hyperparameters through a re-estimation proce- 
dure. In addition we also evaluated the performance of the ensemble method using 
a diagonal covariance matrix. Our results are summarized in Table 1. 
Acknowledgements 
We would like to thank Chris Williams for helpful discussions. Supported by EPSRC 
grant GR/J75425: Novel Developments in Learning Theory for Neural Networks. 
References 
Barber, D. and C. M. Bishop (1997). On computing the KL divergence for 
Bayesian neural networks. Technical report, Neural Computing Research 
Group, Aston University, Birmingham, U.K. 
Bishop, C. M. (1995). Neural Networks for Pattern Recognition. Oxford Univer- 
sity Press. 
Bishop, C. M. (1997). Latent variables, mixture distributions and topographic 
mappings. Technical report, Aston University. To appear in Proceedings of 
the NATO Advanced Study Institute on Learning in Graphical Models, Erice. 
Hinton, G. E. and D. van Camp (1993). Keeping neural networks simple by 
minimizing the description length of the weights. In Proceedings of the Sixth 
Annual Conference on Computational Learning Theory, pp. 5-13. 
MacKay, D. J. C. (1992). A practical Bayesian framework for back-propagation 
networks. Neural Computation , (3), 448-472. 
MacKay, D. J. C. (1995). Developments in probabilistic modelling with neural 
networks ensemble learning. In Neural Networks: Artificial Intelligence and 
Industrial Applications. Proceedings of the 3rd Annual Symposium on Neural 
Networks, Nijmegen, Netherlands, 1,-15 September 1995, Berlin, pp. 191-198. 
Springer. 
MacKay, D. J. C. (1995). Probable networks and plausible predictions - a re- 
view of practical Bayesian methods for supervised neural networks. Network: 
Computation in Neural Systems 6(3), 469-505. 
Neal, R. M. (1996). Bayesian Learning for Neural Networks. Springer. Lecture 
Notes in Statistics 118. 
See hp://www.cs.uorono.ca/~delve/ 
