Efficient Approaches to Gaussian Process 
Classification 
Lehel Csat6, Ernest Fokou, Manfred Opper, Bernhard Schottky 
Neural Computing Research Group 
School of Engineering and Applied Sciences 
Aston University Birmingham B4 7ET, UK. 
{opperm, csat ol}aston. ac. uk 
Ole Winther 
Theoretical Physics II, Lund University, S51vegatan 14 A, 
S-223 62 Lund, Sweden 
wintherthep. lu. se 
Abstract 
We present three simple approximations for the calculation of 
the posterior mean in Gaussian Process classification. The first 
two methods are related to mean field ideas known in Statistical 
Physics. The third approach is based on Bayesian online approach 
which was motivated by recent results in the Statistical Mechanics 
of Neural Networks. We present simulation results showing: 1. that 
the mean field Bayesian evidence may be used for hyperparameter 
tuning and 2. that the online approach may achieve a low training 
error fast. 
I Introduction 
Gaussian processes provide promising non-parametric Bayesian approaches to re- 
gression and classification [2, 1]. In these statistical models, it is assumed that the 
likelihood of an output or target variable y for a given input x  R N can be written 
as P(yla(x)) where a: R S -- R are functions which have a Gaussian prior distri- 
bution, i.e. a is (a priori) assumed to be a Gaussian random field. This means that 
any finite set of field variables a(xi), i - 1,...,l are jointly Gaussian distributed 
with a given covariance E[a(xi)a(xj)] - K(xi,xj) (we will also assume a zero mean 
throughout the paper). 
Predictions on a(x) for novel inputs x, when a set D of m training examples (xi, yi) 
i - 1,..., m, is given, can be computed from the posterior distribution of the m + 1 
variables a(x) and a(xl),...,a(x,). A major technical problem of the Gaussian 
process models is the difficulty of computing posterior averages as high dimensional 
integrals, when the likelihood is not Gaussian. This happens for example in classifi- 
cation problems. So far, a variety of approximation techniques have been discussed: 
Monte Carlo sampling [2], the MAP approach [4], bounds on the likelihood [3] and 
a TAP mean field approach [5]. In this paper, we will introduce three different novel 
methods for approximating the posterior mean of the random field a(x), which we 
think are simple enough to be used in practical applications. Two of the techniques 
252 L. Csat6, E. Fokoud, M. Opper, B. Schottky and O. lt4nther 
are based on mean field ideas from Statistical Mechanics, which in contrast to the 
previously developed TAP approach are easier to implement. They also yield simple 
approximations to the total likelihood of the data (the evidence) which can be used 
to tune the hyperparameters in the covariance kernel K (The Bayesian evidence (or 
MLII) framework aims at maximizing the likelihood of the data). 
We specialize to the case of a binary classification problem, where for simplicity, 
the class label y - -+-1 is assumed to be noise free and the likelihood is chosen as 
P(y[a) = O(ya) , (1) 
where (x) is the unit step function, which equals 1 for x > 0 and zero else. We 
are interested in computing efficient approximations to the posterior mean (a(x)), 
which we will use for a prediction of the labels via y = sign(a(x)), where (...) 
denotes the posterior expectation. If the posterior distribution of a(x) is symmetric 
around its mean, this will give the Bayes optimal prediction. 
Before starting, let us add two comments on the likelihood (1). First, the MAP 
approach (i.e. predicting with the fields a that maximize the posterior) would not 
be applicable, because it gives the trivial result a(x) = 0. Second, noise can be 
easily introduced within a probit model [2], all subsequent calculations will only be 
slightly altered. Moreover, the Gaussian average involved in the definition of the 
probit likelihood can always be shifted from the likelihood into the Gaussian process 
prior, by a redefinition of the fields a (which does not change the prediction), leaving 
us with the simple likelihood (1) and a modified process covariance [5]. 
2 Exact Results 
At first glance, it may seem that in order to calculate (a(x)/ we have to deal with 
the joint posterior of the fields ai = a(xi), i - 1,..., m together with the field at 
the test point a(x). This would imply that for any test point, a different new m + 1 
dimensional average has to be performed. Actually, we will show that this is not the 
case. As above let E denote the expectation over the Gaussian prior. The posterior 
expectation at any point, say x 
E [a(x)1-Ij P(yjlay)] 
(x)) = (2) 
can by integration by parts-for any likelihood-be written as 
( 01nP(Yjlaj) I (3) 
(a(x)) = y. K(x, xj)ojyj and oj '- yj Oaj 
showing that cj is not dependent on the test point x. It is therefore not necessary 
to compute a m + 1 dimensional average for every prediction. 
We have chosen the specific definition (3) in order to stress the similarity to pre- 
dictions with Support Vector Machines (for the likelihood (1), the aj will come 
out nonnegative). In the next sections we will develop three approaches for an 
approximate computation of the aj. 
3 Mean Field Method I: Ensemble Learning 
Our first goal is to approximate the true posterior distribution 
m 
P(alD") = Z V/(2vr)- detK e 
(4) 
Efficient Approaches to Gaussian Process Classification 253 
of a =' (a,. .. ,am) by a simpler, tractable distribution q. Here, K denotes the 
covariance matrix with elements Kij '- K(xi, xj). In the variational mean field 
approach-known as ensemble learning in the Neural Computation Community,- 
the relative entropy distance KL(q,p) = f da q(a)In - is minimized in the family 
m 
of product distributions q(a) = Hj=x qj(aj). This is in contrast to [3], where a 
variational bound on the likelihood is computed. We get 
KL(q,p) = E / daiqi(ai)ln qi(ai) + 
 P(yilai) 
$ 
1 1 (ai)o 
+ 
i,j,ij i 
where (...)0 denotes expectation w.r.t. q. By setting the functional derivative 
of KL(q,p) with respect to qi(a) equal to zero, we find that the best product 
distribution is a Gaussian prior times the original Likelihood: 
1 (-"i) 2 
qi(a) cr P(yila) e  , (5) 
where rni = -Ai j,jvi(K-x)ij(aj)o and Ai = [K-x]/. Using this specific form 
for the approximated posterior q(a), replacing the average over the true posterior in 
(3) by the approximation (5), we get (using the likelihood (1)) a set of rn nonlinear 
equations in the unknowns cj: 
(01nP(yjlaj)) 1 D() 
o- 
and rnj - E KjiYiti - jyjotj , 
i 
(6) 
where D(z) - e-Z2/2/x/- and (z) = f_Zoodt D(t). As a useful byproduct of 
the variational approximation, an upper bound on the Bayesian evidence P(D) = 
f da ,r(a)P(Dla ) can be derived. (,r denotes the Gaussian process prior and 
m 
P(Dla) = 1-Ij= P(Yjlaj))  The bound can be written in terms of the mean field 
'free energy' as 
-lnP(D) <_ Sqlnq(a) - Eqln[vr(a)P(Dla)] 
_Eln  [ rnj \ 1 
+ 2. (7) 
i 1 
+  In der K -  E In hi 
i 
which can be used as a yardstick for selecting appropriate hyperparameters in the 
covariance kernel. 
The ensemble learning approach has the little drawback, that it requires inversion 
of the covariance matrix K and, for the free energy (7) one must compute a deter- 
minant. A second, simpler approximation avoids these computations. 
4 Mean Field Theory II: A 'Naive' Approach 
The second mean field theory aims at working directly with the variables otj. As a 
starting point, we consider the partition function (evidence), 
m 
Z = P(D) = / dze -"T<" H P(yjlzj), (8) 
j=l 
254 L. Csat6, E. Fokoud, 3/1. Opper, B. Schottky and O. Winther 
which follows from (4) by a standard Gaussian integration, introducing the Fourier 
eaeiazP(y a) with i being the imaginary 
transform of the Likelihood 15(ylz) - f  
unit. It is tempting to view (8) as a normalizing partition function for a Gaussian 
process zi having covariance matrix K -x and likelihood/5. Unfortunately, /5 is 
not a real number and precludes a proper probabilistic interpretation. Neverthe- 
less, dealing formally with the complex measure defined by (8), integration by parts 
shows that one has yjaj = -i<zj>,, where the brackets (...), denote a average over 
the complex measure. This suggests a simple approximation for calculating the 
aj. One may think of trying a saddle-point (or steepest descent) approximation 
to (8) and replace <zj), by the value of zj (in the complex z plane) which makes 
the integrand stationary thereby neglecting the fluctuations of the zj. Hence, this 
approximation would treat expectations of products as (zizj), as (zi), Zj),, which 
may be reasonable for i  j, but definitely not for the self-correlation i - j. Ac- 
cording to the general formalism of mean field theories (outlined e.g. in [6]), one 
2 separately. This can 
can improve on that idea, by treating the 'self-interactions' z i 
be done by replacing all zi (except in the form z) by a new variable/i by inserting 
a Dirac 5 function representation 5(z -/) - f into (8) and integrate 
over the z and a variables exactly (the integral factorizes), and finally perform a 
saddle-point integration over the m and/ variables. The details of this calculation 
will be given elsewhere. Within the saddle-point approximation, we get the system 
of nonlinear equations 
I  and mj--  Kji(-ii)-  KjiYiOq (9) 
cj -- -iyj/j --  (yj m  
V/- ! i,ij i,ij 
which is of the same form as (6) with Aj replaced by the simpler Kjj. These 
equations have also been derived by us in [5] using a Callen identity, but our present 
derivation allows also for an approximation to the evidence. By plugging the saddle- 
point values back into the partition function, we get 
-lnP(D)  - ln Yi +  y.yici(Ki -SijKii)yjcj 
 j 
which is also simpler to compute than (7) but does not give a bound on the true 
evidence. 
5 A sequential Approach 
Both previous algorithms do not give an explicit expression for the posterior mean, 
but require the solution of a set of nonlinear equations. These must be obtained 
by an iterative procedure. We now present a different approach for an approximate 
computation of the posterior mean, which is based on a single sequential sweep 
through the whole dataset giving an explicit update of the posterior. 
The algorithm is based on a recently proposed Bayesian approach to online learning 
(see [8] and the articles of Opper and Winther& Solla in [9]). Its basic idea applied 
to the Gaussian process scenario, is as follows: Suppose, that qt is a Gaussian 
approximation to the posterior after having seen t examples. This means that we 
approximate the posterior process by a Gaussian process with mean <a(x))t and 
covariance Kt(x,y), starting with (a(x)>0 - 0 and K0(x,y) = K(x,y). After a 
new data point Yt+l is observed, the posterior is updated according to Bayes rule. 
The new non-Gaussian posterior {t is projected back into the family of Gaussians 
by choosing the closest Gaussian qt+l minimizing the relative entropy KL({t, qt+l) 
Efficient Approaches to Gaussian Process Classifican'on 255 
in order to keep the loss of information small. This projection is equivalent to a 
matching of the first two moments of t and qt+l. E.g., for the first moment we get 
= (x) 
(p(yt+lla(xt+l))>t -- (a(x)>t q- l(t)Kt(x, xt+l) 
where the second line follows again from an integration by parts and n(t) : 
= y+<(+)> 
'+ '(*') with zt and a 2(t) : Kt (xt+, xt+). This recursion and 
7 (,,) (t) 
the corresponding one for Kt can be solved by the ansatz 
= 
j=l 
= + 
i,j 
where the vector a(t) = (a,..., at, 0, 0,...) and the matrix C(t) (which has also 
only t x t nonzero elements) are updated as 
a(t+l) = a(t)+(t)(C(t)k++e+)y 
C(t + 1) = C(t) + 2(t)(C(t)kt+ + e+)(C(t)kt+ + e+)  (12) 
where 2(t) =  (z) -(,)} , k is the vector with elements Kts, 
j = 1 ..., t and  denotes the element-wise product between vectors. The sequen- 
tial algorithm defined by (10)-(12) has the advantage of not requiring any matrix 
inversions. There is also no need to solve a numerical optimization problem at each 
time as in the approach of [11] where a different update of a Gaussian posterior ap- 
proximation was proposed. Since we do not require a linearization of the likelihood, 
the method is not equivalent to the extended Kalman Filter approach. 
Since it is possible to compute the evidence of the new datapoint P(yt+) = 
based on the old posterior, we can compute a further approximation 
to the log evidence for m data via In P(D) - 
6 Simulations 
We present two sets of simulations for the mean field approaches. In the first, we test 
the Bayesian evidence framework for tuning the hyperparameters of the covariance 
function (kernel). In the second, we test the ability of the sequential approach to 
achieve low training error and a stable test error for fixed hyperparameters. 
For the evidence framework, we give simulation results for both mean field free 
energies (7) and (10) on a single data set, 'Pima Indian Diabetes (with 200/332 
training/test-examples and input dimensionality d = 7) [7]. The results should 
therefore not be taken as a conclusive evidence for the merits of these approaches, 
but simply as an indication that they may give reasonable results. We use the 
( 1EldWl(X l --X) 2) A 
radial basis function covariance function K(x, x ) = exp -3  
diagonal term v is added to the covariance matrix corresponding to a Gaussian noise 
added to the fields with variance v [5]. The free energy, -lnP(D) is minimized 
by gradient descent with respect to v and the lengthscale parameters wx,..., wd 
and the mean field equations for cj are solved by iteration before each update of 
the hyperparameters (further details will be given elsewhere). Figure I shows the 
evolution of the naive mean free energy and the test error starting from uniform 
256 L. Csat6, E. Fokou, M. Opper, B. Schottky and O. Winther 
ws. It typically requires of the order of 10 iteration steps of the cj-equations 
between each hyperparameter update. We also used hybrid approaches, where the 
free energy was minimized by one mean field algorithm and the hyperparameters 
used in the other. As it may be seen from table 1, the naive mean field theory can 
overestimate the free energy (since the ensemble free energy is an upper bound to 
the free energy). The overestimation is not nearly as severe at the minimum of the 
naive mean field free energy. Another interesting observation is that as long as the 
same hyperparameters are used the actual performance (as measured by the test 
error) is not very sensitive to the algorithm used. This also seems to be the case 
for the TAP mean field approach and Support Vector Machines [5]. 
115 74 
70 
' 66 
62 
0 20 40 60 80 0 20 40 60 80 
Iterations Iterations 
Figure 1: Hyperparameter optimization for the Pima Indians data set using the 
naive mean field free energy. Left figure: The free energy as a function of the 
number of hyperparameter updates. Right figure: The test error count (out of 332) 
as a function of the number of hyperparameter updates. 
'114 ,, 
e-  
I , 
-113 , 
e- 
'"112 
111 
Table 1: Pima Indians dataset. Hyperparameters found by free energy minimiza- 
tion. Left column gives the free energy - In P(D) used in hyperparameter optimiza- 
tion. Test error counts in range 63- 75 have previously been reported [5] 
Ensemble MF Naive MF 
Free Energy minimization Error - In P(D) Error - In P(D) 
Ensemble Mean Field, eq. (7) 72 100.6 70 183.2 
Naive Mean Field, eq. (10) 62 107.0 62 110.9 
For the sequential algorithm, we have studied the sonar [10] and crab [7] datasets. 
Since we have not computed an approximation to the evidence so far, a simple fixed 
polynomial kernel was used. Although a probabilistic justification of the algorithm 
is only valid, when a single sweep through the data is used (the independence of the 
data is assumed), it is tempting to reuse the same data and iterate the procedure as a 
heuristic. The two plots show that in this way, only a small improvement is obtained, 
and it seems that the method is rather efficient in extracting the information from 
the data in a single presentation. For the sonar dataset, a single sweep is enough 
to achieve zero training error. 
Acknowledgements: BS would like to thank the Leverhulme Trust for their support 
(F/250/K). The work was also supported by EPSRC Grant GR/L52093. 
Efficient Approaches to Gaussian Process Classification 257 
5O 
45 
40 
35 
30 
25 
20 
15 
10 
5 
0 
0 
I -- Training Error 
[ - - Test Error 
20 40 60 80 1 O0 120 140 160 
Iteration # 
0 50 1 O0 150 200 
Iteration # 
Figure 2: Training and test errors during learning for the sonar (left) and crab 
dataset (right). The vertical dash-dotted line marks the end of the training set 
and the starting point of reusing of it. The kernel function used is K(x, x ) = 
(1 + x. x/m) k with order k = 2 (m is the dimension of inputs). 
References 
[1] Williams C.K.I. and Rasmussen C.E., Gaussian Processes for Regression, in Neural 
Information Processing Systems 8, Touretzky D.S, Mozer M.C. and Hasselmo M.E. 
(eds.), 514-520, MIT Press (1996). 
[2] Neal R.M, Monte Carlo Implementation of Gaussian Process Models for Bayesian 
Regression and Classification, Technical Report 9702, Department of Statistics, Uni- 
versity of Toronto (1997). 
[3] Gibbs M.N. and Mackay D.J.C., Variational Gaussian Process Classifiers, Preprint 
Cambridge University (1997). 
[4] Williams C.K.I. and Barber D, Bayesian Classification with Gaussian Processes, IEEE 
Trans Pattern Analysis and Machine Intelligence, 20 1342-1351 (1998). 
[5] Opper M. and Winther O. Gaussian Processes for Classification: Mean 
Field Algorithms, Submitted to Neural Computation, http://www.thep.lu.se 
/tf2/staff/winther/ (1999). 
[6] Zinn-Justin J, Quantum Field Theory and Critical Phenomena, Clarendon Press Ox- 
ford (1990). 
[7] Ripley B.D, Pattern Recognition and Neural Networks, Cambridge University Press 
(1996). 
[8] Opper M., Online versus Offiine Learning from Random Examples: General Results, 
Phys. Rev. Lett. 77, 4671 (1996). 
[9] Online Learning in Neural Networks, Cambridge University Press, D. Saad (ed.) 
(1998). 
[10] Gorman R.P and Sejnowski T.J, Analysis of hidden units in a layered network trained 
to classify sonar targets, Neural Networks 1, (1988). 
[11] Jaakkola T. and Haussler D. Probabilistic kernel regression, In Online Proceedings 
of 7-th Int. Workshop on AI and Statistics (1999), 
http://uncertainty99.microsoft.com/proceedings.htm. 
