Selecting weighting factors in logarithmic 
opinion pools 
Tom Heskes 
Foundation for Neural Networks, University of Nijmegen 
Geert Grooteplein 21, 6525 EZ Nijmegen, The Netherlands 
tom@mbfys.kun.nl 
Abstract 
A simple linear averaging of the outputs of several networks as 
e.g. in bagging [3], seems to follow naturally from a bias/variance 
decomposition of the sum-squared error. The sum-squared error of 
the average model is a quadratic function of the weighting factors 
assigned to the networks in the ensemble [7], suggesting a quadratic 
programming algorithm for finding the "optimal" weighting factors. 
If we interpret the output of a network as a probability statement, 
the sum-squared error corresponds to minus the loglikelihood or 
the Kullback-Leibler divergence, and linear averaging of the out- 
puts to logarithmic averaging of the probability statements: the 
logarithmic opinion pool. 
The crux of this paper is that this whole story about model aver- 
aging, bias/variance decompositions, and quadratic programming 
to find the optimal weighting factors, is not specific for the sum- 
squared error, but applies to the combination of probability state- 
ments of any kind in a logarithmic opinion pool, as long as the 
Kullback-Leibler divergence plays the role of the error measure. As 
examples we treat model averaging for classification models under 
a cross-entropy error measure and models for estimating variances. 
I INTRODUCTION 
In many simulation studies it has been shown that combining the outputs of several 
trained neural networks yields better results than relying on a single model. For 
regression problems, the most obvious combination seems to be a simple linear 
Selecting Weighting Factors in Logarithmic Opinion Pools 267 
averaging of the network outputs. From a bias/variance decomposition of the sum- 
squared error it follows that the error of the so obtained average model is always 
smaller or equal than the average error of the individual models. In [7] simple linear 
averaging is generalized to weighted linear averaging, with different weighting factors 
for the different networks in the ensemble. A slightly more involved bias/variance 
decomposition suggests a rather straightforward procedure for finding "optimal" 
weighting factors. 
Minimizing the sum-squared error is equivalent to maximizing the loglikelihood of 
the training data under the assumption that a network output can be interpreted 
as an estimate of the mean of a Gaussian distribution with fixed variance. In 
these probabilistic terms, a linear averaging of network outputs corresponds to a 
logarithmic rather than linear averaging of probability statements. 
In this paper, we generalize the regression case to the combination of probability 
statements of any kind. Using the Kullback-Leibler divergence as the error mea- 
sure, we naturally arrive at the so-called logarithmic opinion pool. A bias/variance 
decomposition similar to the one for sum-squared error then leads to an objective 
method for selecting weighting factors. 
Selecting weighting factors in any combination of probability statements is known 
to be a difficult problem for which several suggestions have been made. These 
suggestions range from rather involved supra-Bayesian methods to simple heuristics 
(see e.g. [1, 6] and references therein). The method that follows from our analysis 
is probably somewhere in the middle: easier to compute than the supra-Bayesian 
methods and more elegant than simple heuristics. 
To stress the generality of our results, the presentation in the next section will be 
rather formal. Some examples will be given in Section 3. Section 4 discusses how 
the theory can be transformed into a practical procedure. 
2 LOGARITHMIC OPINION POOLS 
Let us consider the general problem of building a probability model of a variable y 
given a particular input x. The "output" y may be continuous, as for example in 
regression analysis, or discrete, as for example in classification. In the latter case 
integrals over y should be replaced by summations over all possible values of y. 
Both x and y may be vectors of several elements; the one-dimensional notation is 
chosen for convenience. We suppose that there is a "true" conditional probability 
model q(ylx) and have a whole ensemble (also called pool or committee) of experts, 
each supplying a probability model p,(ylx). p(x) is the unconditional probability 
distribution of inputs. An unsupervised scenario, as for example treated in [8], is 
obtained if we simply neglect the inputs x or consider them constant. 
We define the distance between the true probability q(ylx) and an estimate p(ylx) 
to be the Kullback-Leibler divergence 
K(q,p) -- - J dx p(x) J dy q(y]x) log [p(y]x)' 
[q(Yl x) 
If the densities p(x) and q(y]x) correspond to a data set containing a finite number 
P of combinations {x ', y' }, minus the Kullback divergence is, up to an irrelevant 
268 T. Heskes 
constant, equivalent to the loglikelihood defined as 
1 E log p(y' Ix ') 
L(p, -- p . 
The more formal use of the Kullback-Leibter divergence instead of the loglikelihood 
is convenient in the derivations that follow. 
Weighting factors w are introduced to indicate the reliability of each of the experts 
a. In the following we will work with the constraints Y' w = 1, which is used in 
some of the proofs, and w _> 0 for all experts a, which is not strictly necessary, but 
makes it easier to interpret the weighting factors and helps to prevent overfitting 
when weighting factors are optimized (see details below). 
We define the average model 15(y]x) to be the one that is closest to the given set of 
models: 
(ylx) = argmin  w,K(p,p,). 
P(Yl x) 
Introducing a Lagrange multiplier for the constraint f dxp(ylx ) = 1, we immediately 
find the solution 
1 
(Ylx): Z(x)1 I[p(ylx)lw ' (1) 
with normalization constant 
Z(x): f dy . (2) 
o 
This is the logarithmic opinion pool, to be contrasted with the linear opinion pool, 
which is a linear averageof the probabilities. In fact, logarithmic opinion pools have 
been proposed to overcome some of the weaknesses of the linear opinion pool. For 
example, the logarithmic opinion pool is "externally Bayesian", i.e., can be derived 
from joint probabilities using Bayes' rule [2]. A drawback of the logarithmic opinion 
pool is that if any of the experts assigns probability zero to a particular outcome, 
the complete pool assigns probability zero, no matter what the other experts claim. 
This property of the logarithmic opinion pool, however, is only a drawback if the 
individual density functions are not carefully estimated. The main problem for both 
linear and logarithmic opinion pools is how to choose the weighting factors w. 
The Kullback-Leibler divergence of the opinion pool 15(y]x) can be decomposed into 
a term containing the Kullback-Leibler divergences of individual models and an 
"ambiguity" term: 
K(q,15) = E w,K(q,p,) - E w,K(15, p,) -- E- A . 
Proof: The first term in (3) follows immediately from the numerator in (1), the 
second term is minus the logarithm of the normalization constant Z(x) in (2) which 
can, using (1), be rewritten as 
A=/dxp(x)log[Z(z)]= fdxp(x)log [ ] 
(Vlx) 
Selecting Weighting Factors in Logarithmic Opinion Pools 269 
for any choice of y' for which 15(y' I z) is nonzero. Integration over y' with probability 
measure 5(y'lx ) then yields (3). 
Since the ambiguity A is always larger than or equal to zero, we conclude that the 
Kullback-Leibler divergence of the logarithmic opinion pool is never larger than the 
average Kullback-Leibler divergences of individual experts. The larger the ambigu- 
ity, the larger the benefit of combining the experts' probability assessments. Note 
that by using Jensen's inequality, it is also possible to show that the Kullback-Leibler 
divergence of the linear opinion pool is smaller or equal to the average Kullback- 
Leibler divergences of individual experts. The expression for the ambiguity, defined 
as the difference between these two, is much more involved and more difficult to 
interpret (see e.g. [10]). 
The ambiguity of the logarithmic opinion pool depends on the weighting factors 
we,, not only directly as expressed in (3), but also through p(ylx). We can make 
this dependency somewhat more explicit by writing 
1 
1EwawfiK(p a )+ Ewa[K(15, pa)-K(pa,p)] (4) 
.4 = . 
Proof: Equation (3) is valid for any choice of q(ylx). Substitute q(ylx) - pt3(ylx), 
multiply left- and righthand side by w, and sum over fl. Simple manipulation of 
terms than yields the result. 
Alas, the Kullback-Leibler divergence is not necessarily symmetric, i.e., in general 
K(pl,p2)  K(p2,pl). However, the difference K(pl,p2)- K(p2,pl) is an order 
of magnitude smaller than the divergence K(p,p2) itself. More formally, writing 
pl (y[x) = [1 +e(y[x)]p(y[x) with e(y[x) small, we can easily show that K(p, p2) is of 
order (some integral over) e:(yl x) whereas K(pl,p2)- K(p:,p) is of order ea(ylx). 
Therefore, if we have reason to assume that the different models are reasonably 
close together, we can, in a first approximation, and will, to make things tractable, 
neglect the second term in (4) to arrive at 
1 
K(q,p)  E waK(q,pa) -  E waw [K(pa,p) q- K(p,pa)] . (5) 
The righthand side of this expression is quadratic in the weighting factors w, a 
property which will be very convenient later on. 
3 EXAMPLES 
Regression. The usual assumption in regression analysis is that the output func- 
tionally depends on the input x, but is blurred by Gaussian noise with standard 
deviation a. In other words, the probability model of an expert a can be written 
-(y- L(x)) 
p(ylx) = 2-a exp 5 5 j (6) 
The function f(x) corresponds to the network's estimate of the "true" regression 
given input x. The logarithmic opinion pool (1) also leads to a normal distribution 
with the same standard deviation a and with regression estimate 
270 T. Heskes 
In this case the Kullback-Leibler divergence 
K(p, p) = 2 2 
is symmetric, which makes (5) exact instead of an approximation. In [7], this has 
all been derived starting from a sum-squared error measure. 
Variance estimation. There has been some recent interest in using neural net- 
works not only to estimate the mean of the target distribution, but also its variance 
(see e.g. [9] and references therein). In fact, one can use the probability density (6) 
with input-dependent (). We will consider the simpler situation in which an 
input-dependent model is fitted to residuals y, after a regression model has been 
fitted to estimate the mean (see also [5]). The probability model of expert a can 
be written 
2 ' 
where 1/z(x) is the experts' estimate of the residual variance given input x. The 
logarithmic opinion pool is of the same form with z(x) replaced by 
Here the Kullback-Leibler divergence 
t,7 (i6, p) =  dx p(x) log 1 
z) z(x) 
is asymmetric. We can use (3) to write the Kullback-Leibler divergence of the 
opinion pool explicitly in terms of the weighting factors w. The approximation (5), 
with 
 f [()- ()]2 
c(p.,p) + c(p,p) =  d p(x) (x)() , 
is much more appealing and easier to handle. 
Classification. In a two-class classification problem, we can treat y as a discrete 
variable having two possible realizations, e.g., y  {-1, 1}. A convenient represen- 
tation for a properly normalized probability distribution is 
1 
P(YlX) = 1 + exp[-2h(x)y] ' 
In this logistic representation, the logarithmic opinion pool has the same form with 
= 
o 
The Kullback-Leibler divergence is asymmetric, but yields the simpler form 
K(p,pt3 ) + K(pt3,p ) - f dx p(x) [tanh(h(x)) - tanh(h(x))][h(x) - h(x)], 
to be used in the approximation (5). For a finite set of patterns, minus the loglike- 
lihood yields the well-known cross-entropy error. 
Selecting Weighting Factors in Logarithmic Opinion Pools 271 
The probability models in these three examples are part of the exponential family. 
The mean f,,, inverse variance z, and logit h,, are the canonical parameters. It is 
straightforward to show that, with constant dispersion across the various experts, 
the canonical parameter of the logarithmic opinion pool is always a weighted average 
of the canonical parameters of the individual experts. Slightly more complicated 
expressions arise when the experts are allowed to have different estimates for the 
dispersion or for probability models that do not belong to the exponential family. 
4 SELECTING WEIGHTING FACTORS 
The decomposition (3) and approximation (5) suggest an objective method for 
selecting weighting factors in logarithmic opinion pools. We will sketch this method 
for an ensemble of models belonging to the same class, say feedforward neural 
networks with a fixed number of hidden units, where each model is optimized on a 
different bootstrap replicate of the available data set. 
Suppose that we have available a data set consisting of P combinations {x 
As suggested in [3], we construct different models by training them on different 
bootstrap replicates of the available data set. Optimizing nonlinear models is often 
an unstable process: small differences in initial parameter settings or two almost 
equivalent bootstrap replicates can result in completely different models. Neural 
networks, for example, are notorious for local minima and plateaus in weight space 
where models might get stuck. Therefore, the incorporation of weighting factors, 
even when models are constructed using the same procedure, can yield a better gen- 
eralizing opinion pool. In [4] good results have been reported on several regression 
problems. Balancing clearly outperformed bagging, which corresponds to w,, = 1In 
with n the number of experts, and bumping, which proposes to keep a single expert. 
Each example in the available data set can be viewed as a realization of an unknown 
probability density characterized by p(x) and q(ylz). We would like to choose the 
weighting factors w,, such as to minimize the Kullback-Leibler divergence K(q, I5) of 
the opinion pool. If we accept the approximation (5), we can compute the optimal 
weighting factors once we know the individual Kullbacks K(q, p,) and the Kullbacks 
between different models K(p,,p). Of course, both q(ylx) and p(x) are unknown, 
and thus we have to settle for estimates. 
In an estimate for K(p,,p) we can simply replace the average over p(x) by an 
average over all inputs x t' observed in the data set: 
q- K(p/3,pc,) dy[P(Ylx) - 
A similar straightforward replacement for q(yl x) in an estimate for K(q,p,) is bi- 
ased, since each expert has, at least to some extent, been overfitted on the data set. 
In [4] we suggest how to remove this bias for regression models minimizing sum- 
squared errors. Similar compensations can be found for other probability models. 
Having estimates for both the individual Kullback-Leibler divergences K (q, p) and 
the cross terms K(p, p), we can optimize for the weighting factors w. Under the 
constraints y' w -- i and w >_ 0 the approximation (5) leads to a quadratic pro- 
gramming problem. Without this approximation, optimizing the weighting factors 
becomes a nasty exercise in nonlinear programming. 
272 T. Heskes 
The solution of the quadratic programming problem usually ends up at the edge of 
the unit cube with many weighting factors equal to zero. On the one hand, this is 
a beneficial property, since it implies that we only have to keep a relatively small 
number of models for later processing. On the other hand, the obtained weight- 
ing factors may depend too strongly on our estimates of the individual Kullbacks 
K(q,p,). The following version prohibits this type of overfitting. Using simple 
statistics, we obtain a rough indication for the accuracy of our estimates K(q, p,). 
This we use to generate several, say on the order of 20, different samples with 
estimates {K(q,pl),...,K(q,p,)). For each of these samples we solve the corre- 
sponding quadratic programming problem and obtain a set of weighting factors. 
The final weighting factors are obtained by averaging. In the end, there are less 
experts with zero weighting factors, at the advantage of a more robust procedure. 
Acknowledgement s 
I would like to thank David Tax, Bert Kappen, Pi[rre van de Lear, Wim Wiegerinck, 
and the anonymous referees for helpful suggestions. This research was supported 
by the Technology Foundation STW, applied science division of NWO and the 
technology programme of the Ministry of Economic Affairs. 
References 
[1] J. Benediktsson and P. Swain. Consensus theoretic classification methods. 
IEEE Transactions on Systems, Man, and Cybernetics, 22:688-704, 1992. 
[2] R. Bordley. A multiplicative formula for aggregating probability assessments. 
Management Science, 28:1137-1148, 1982. 
[3] L. Breiman. Bagging predictors. Machine Learning, 24:123-140, 1996. 
[4] 
T. Heskes. Balancing between bagging and bumping. In M. Mozer, M. Jordan, 
and T. Petsche, editors, Advances in Neural Information Processing Systems 
9, pages 466-472, Cambridge, 1997. MIT Press. 
[5] 
T. Heskes. Practical confidence and prediction intervals. In M. Mozer, M. Jor- 
dan, and T. Petsche, editors, Advances in Neural Information Processing Sys- 
tems 9, pages 176-182, Cambridge, 1997. MIT Press. 
[6] R. Jacobs. Methods for combining experts' probability assessments. Neural 
Computation, 7:867-888, 1995. 
[7] 
A. Krogh and J. Vedelsby. Neural network ensembles, cross validation, and 
active learning. In G. Tesauro, D. Touretzky, and T. Leen, editors, Advances 
in Neural Information Processing Systems 7, pages 231-238, Cambridge, 1995. 
MIT Press. 
[8] P. Smyth and D. Wolpert. Stacked density estimation. These proceedings, 1998. 
[9] P. Williams. Using neural networks to model conditional multivariate densities. 
Neural Computation, 8:843-854, 1996. 
[10] D. Wolpert. On bias plus variance. Neural Computation, 9:1211-1243, 1997. 
