General Bounds on Bayes Errors for 
Regression with Gaussian Processes 
Manfred Opper 
Neural Computing Research Group 
Dept. of Electronic Engineering 
and Computer Science, 
Aston University, 
Birmingham, B4 7ET 
United Kingdom 
oppermast on. ac. uk 
Francesco Vivarelli 
Centro Ricerche Ambientali 
Montecatini, 
via Ciro Menotti, 48 
48023 Marina di Ravenna, 
Italy 
fvivarellicramont. it 
Abstract 
Based on a simple convexity lemma, we develop bounds for differ- 
ent types of Bayesian prediction errors for regression with Gaussian 
processes. The basic bounds are formulated for a fixed training set. 
Simpler expressions are obtained for sampling from an input distri- 
bution which equals the weight function of the covariance kernel, 
yielding asymptotically tight results. The results are compared 
with numerical experiments. 
I Introduction 
Nonparametric Bayesian models which are based on Gaussian priors on function 
spaces are becoming increasingly popular in the Neural Computation Community 
(see e.g.[2, 3, 4, 7, 1]). Since the model classes considered in this approach are 
infinite dimensional, the application of Vapnik - Chervonenkis type of methods to 
determine bounds for the learning curves is nontrivial and has not been performed 
so far (to our knowledge). In these methods, the target function to be learnt is 
fixed and input data are drawn independently at random from a fixed (unknown) 
distribution. The approach of this paper is different. Here, we assume that the target 
is actually drawn at random from a known prior distribution, and we are interested 
in developing simple bounds on the average prediction performance (with respect 
to the prior) which hold for a fixed set of inputs. Only at a later stage, an average 
over the input distribution is made. 
General Bounds on Bayes Errors for Regression with Gaussian Processes 303 
2 Regression with Gaussian processes 
To explain the Gaussian process scenario for regression problems [4], we assume that 
observations y  R at input points x  R D are corrupted values of a function O(x) 
by an independent Gaussian noise with variance a 2. The appropriate stochastic 
model is given by the likelihood 
_ (-o()) 2 
e 2r2 
po(ylx) = (1) 
V/2rcr 2 
The goal of a learner is to give an estimate of the function 0(x), based on a set of 
observed example data D, = ((x, y),..., (,, Yt)). As the prior information about 
the unknown function 0() we asume that 0 is a realization of a Gaussian random 
field with zero mean and covariance 
C(, ') = ,[0)0(')]. (2) 
It is useful to expand the random functions as 
= wkOk() (a) 
in a complete set of deterministic functions b (z) with random Gaussian coefficients 
w. As is well known, if the b are chosen as orthonormal eigenfunctions of the 
integral equation 
C(x,x')qb(x')p(x')dx' = Akb(x), (4) 
with eigenvalues A and a nonnegative weight function p(x), the a priori statistics 
of wt is simple. They are independent Gaussian variables which satisfy lg[wwt] = 
3 Prediction and Bayes error 
Usually, the posterior mean of O(x) is chosen as the prediction O(x) on a new point 
x based on a dataset Dn = (x,y),... ,(xn,yn). Its explicit form can be easily 
derived by using the expansion (x) = y. tbb (x), and the fact that for Gaussian 
random variables, their mean coincides with their most probable value. Maximizing 
the log posterior, with respect to the w, one finds for the infinite dimensional vector 
  (w)=o .....  the result  = (621 + AV) - b where Vt = i k(Xi)l(xi) 
At = A8t and b = i Ayi(zi) Fixing the set of inputs , the Bayesian 
prediction error at a point  is given by 
Evaluating (5) yields, after some work, the expression 
e(zlz" ) = a2 { (a2I + AV) -1 AS(z)} (6) 
1 n 
with the matrix Ut(z) = O(z)Ot(z). U has the properties that  Ei= U(zi) = V 
and f dz p(z)U(x) = I. We define the Bayesian training eor as the empirical 
average of the error (5) at the n datapoints of the training set and the Bayesian 
generalization eor as the average error over all z weighted by the function p(z). 
We get 
= k (I + } (7) 
= {a + (8) 
304 M. Opper and F. lh'varelli 
4 Entropic error 
In order to understand the next type of error [9], we assume that the data arrive 
sequentially, one after the other. The predictive distribution after t - 1 training data 
at the new input xt is the posterior expectation of the likelihood (1), i.e. 
P(ylxt,Ot-) - E[Po(y[xt)lOt-x]. 
Let Lt as the Bayesian average of the relative entropy (or Kullback Leibler diver- 
gence) between the predictive distribution and the true distribution Po from which 
the data were generated, i.e. Lt = 1F, [DKi_, (P011/b)]. It can also be shown that 
Lt =  2 . Hence, when the prediction error is small, we will have 
(9) 
The cumulative entropic error Ee (x'*) is defined by summing up all the losses (which 
gives an integrated learning curve) from t = I up to time n and one can show that 
E(x,) =  Lt(xt,Dt-) = lgD:L (PIIP'*) =  Trln (I + AV/a 2) (10) 
where P l-Ii= Po(Yil'xi) and P" = IF, " . 
= " [1-Ii= Po(yilx)] The first equality may be 
found e.g. in [9], and the second follows from direct calculation. 
5 Bounds for fixed set of inputs 
In order to get bounds on (7),(8) and (10), we use a lemma, which has been used in 
Quantum Statistical Mechanics to get bounds on the free energy. The lemma (for 
the special function f(x) = e -x) was proved by Sir Rudolf Peierls in 1938 [10]. In 
order to keep the paper self contained, we have included the proof in the appendix. 
Lemma 1 Let H be a real symmetric matrix and f a convex real function. Then 
Wr f(t) _> k f(tk). 
By noting, that for concave functions the bound goes in the other direction, we 
immediately get 
a 2 ,X V _< a2 y. a 2 ,Xk v (11) 
 k k 
a2A   a2 a2A (12) 
k k 
1 
I ln (1 + VA/a 2) <  ln (1 + nvA/ 2) (13) 
(") 5  - 
k k 
where in the rightmost inequalities, we assume that all n inputs are in a compact 
region Z), and we define v = supxv b} (x).  
The entropic case may also be proved by Hadamard's inequality. 
General Bounds on Bayes Errors for Regression with Gaussian Processes 305 
6 Average case bounds 
Next, we assume that the input data are drawn at random and denote by (...) 
the expectations with respect to the distribution. We do not have to assume inde- 
pendence here, but only the fact that all marginal distributions for the n inputs are 
identical. t Using Jensen's inequality 
(16) 
Akuk (14) 
k 
0.2k (15) 
k 
1 
{E(xn)}   ln (1 + nuA/a 2) 
k 
where now u = (;b(x)). This result is especially simple, when the weighting 
function p(x) is a probability density and the inputs have the marginal distribution 
p(x). In this case, we simply have u - 1. In this case, training and generalization 
error sandwich the bound 
A (17) 
b -- 0'2  0' 2 q_ nk 
k 
We expect that the bound eb becomes asymptotically exact, when n -+ oo. This 
should be intuitively clear, because training and generalization error approach each 
other asymptotically. This fact may also be understood from (9), which shows that 
 asymptotically equal to the 
the cumulative entropic error is within a factor of  
cumulative generalization error. By integrating the lower bound (17) over n, we 
obtain precisely the upper bound on E with a factor 2, showing that upper and 
lower bounds show the same behaviour. 
7 Simulations 
We have compared our bounds with simulations for the average training error and 
generalization error for the case that the data are drawn from p(x). Results for the 
entropic error will be given elsewhere. 
We have specialized on the case, where the covariance kernel is of the RBF form 
C(x,x') = exp[(x - Xt)2//2], and p(x) = (2r) }e  z2, for which, following Zhu 
et al. (1997), the k-th eigenvalue of the spectrum (k = 0... oo) can be written 
as ,k = ab , where a = x/Z, b = c/,k 2, c = 2 1 + 2/A2+ V/1 + 4/X 2 , and X 
is the lengthscale of the process. We estimated the average generalisation error 
for each training set based on the exact analytical expressions (8) and (7) over 
the distribution of the datasets by using a Monte Carlo approximation. To begin 
with, let us consider x 6 R. We sampled the 1-dimensional input space generating 
100 training sets whose data points were normally distributed around zero with 
unit variance. For each generation, the expected training and generalisation errors 
for a GP have been evaluated using up to 1000 data points. We set the value 
of the lengthscale 2 A to 0.1 and we let the noise level a 2 assume several values 
(if2 = 10-4, 10-a, 10-2, 10-1, 1). Figure 1 shows the results we obtained when 
'The value of the lengthscale A has the effect of stretching the training and learning 
curves; thus the results of the experiments performed with different A are qualitatively 
similar to those presented. 
306 M. Opper and E lq'varelli 
0.01 
0.1 
lO lOO lOOO 1 
n 
(a) A = 0.1, a 2 = 0.1 
10 100 1000 
n 
(b) A=0.1, a z=l 
Figure 1: The Figures show the graphs of the training and learning curves with their 
bound b(n) obtained with A = 0.1; the noise level is set to 0.1 in Figure l(a) and to 
1 in Figure l(b). In all the graphs,  and a(n) are drawn by the solid line and their 
95% confidence interval is signed by the dotted curves. The bound b(n) is drawn by the 
dash-dotted lines. 
a 2 = 0.1 (Figure l(a)) and a 2 = 1 (Figure l(b)). The bound e,(n) lies within the 
training and learning curves, being an upper bound for et (n) and a lower bound for 
e a (n). This bound is tighter for the processes with higher noise level; in particular, 
for large datasets the error bars on the curves et(n) and eg(n) overlap the bound 
e,(n). The curves et(n), eg(n) and e,(n) approach zero as O(log(n)/n). 
Our bounds can also be applied to higher dimensions D > I using the covariance 
C(x,x'): exp (-I[x - x'll2/A 2) (18) 
for x, x'  R D. Obviously the integral kernel C is just a direct product of RBF 
kernels, one for each coordinate of x and x'. The eigenvalue problem (4) can be 
immediately reduced to the one for a single variable. Eigenfunctions and eigenvalues 
are simply products of those for the single coordinate problems. Hence, using a bit 
of combinatorics, the bound ev can be written as 
 (k+D-1) cr2aDb k 
eb = y. k o' 2 + naDbk, (19) 
k=0 
where a and b have been defined above. We performed experiments when x  R 2 
and x  R s. The correlation lengths along each direction of the input space has 
been set to 1 and the noise level was rr 2 = 1.0. The graphs of the curves, with their 
error bars are reported in Figure 2(a) (for x  R 2) and in Figure 2(b) (for x  RS). 
8 Discussion 
Based on the minimal requirements on training inputs and covariances, we con- 
jecture that our bounds cannot be improved much without making more detailed 
assumptions on models and distributions. We can observe from the simulations 
that the tightness of the bound eb(n) depends on the dimension of the input space. 
In particular, for large datasets e(n) is tighter for small dimension of the input 
space; Figure 2(a) shows this quite clearly since e(n) overlaps the error bars of the 
General Bounds on Bayes Errors for Regression with Gaussian Processes 307 
10 100 1000 0.1 10 100 1000 
n n 
(a) d = 2 (b) d = 5 
Figure 2: The Figures show the graphs of the training and learning curves with their 
bound e,(n) obtained with the squared exponential covariance function with A - i and 
a2 = 1; the input space is R 2 (Figure 2(a)) and R 5 (Figure 2(b)). In all the Figures, et 
and eg(n) are drawn by the solid line and their 95% confidence interval is signed by the 
dotted curves. The bound e,(n) is drawn by the dash-dotted lines. 
training and learning curves for large n. Numerical simulations performed using 
modified Bessel covariance functions of order r (describing random processes r - 1 
time mean square differentiable) have shown that the bound eb(n) becomes tighter 
for smoother processes. 
Acknowledgement: We are grateful for many inspiring discussions with C.K.I. 
Williams. M.O. would like to thank Peter Sollich for his conjecture that (17) is an 
exact lower bound on the generalization error, which motivated part of this work. 
F. V. was supported by a studentship of British Aerospace. 
9 Appendix: Proof of the lemma I 
Let {(J)} be a complete set of orthonormal eigenvectors and {Ei} the correspond- 
ing set of eigenvalues of H, i.e. we have the properties '.t Hk,l i) -- Ei( i), 
i a(i) a(i) = 5k, and E (i)?) = 5ij. Then we get 
Xk 
i k i 
k k l 
k 
The second equality follows from orthonormality, because ((i))2 = 1. The 
inequality uses the fact that by completeness, for any k, we have i((i))2 = 1 
and we may regard the ((i))2 as probabilities, such that by convexity, Jensen's 
inequality can be used. After using the eigenvalue equation, the sum over i was 
carried out with the help of the completeness relation, in order to obtain the last 
line. 
308 34. Opper and E varelli 
References 
[1] D. J. C. Mackay, Gaussian Processes, A Replacement for Neu- 
ral Networks, NIPS tutorial 1997. May be obtained from 
http://wo].. ra. phy. cam. ac. uk/pub/mackay/. 
[2] R. Neal, Bayesian Learning for Neural Networks, Lecture Notes in Statistics, 
Springer (1996). 
[3] C. K. I. Williams, Computing with Infinite Networks, in Neural Information 
Processing Systems 9, M. C. Mozer, M. I. Jordan and T. Petsche, eds., 295-301. 
MIT Press (1997). 
[4] C. K. I. Williams and C. E. Rasmussen, Gaussian Processes for Regression, in 
Neural Information Processing Systems 8, D. S. Touretzky, M. C. Mozer and 
M. E. Hasselmo eds., 514-520, MIT Press (1996). 
[5] R. M. Neal, Monte Carlo Implementation of Gaussian Process Models for 
Bayesian Regression and Classification, Technical Report CRG-TR-97-2, Dept. 
of Computer Science, University of Toronto (1997). 
[6] M. N. Gibbs and D. J. C. Mackay, Variational Gaussian Process Classifiers, 
Preprint Cambridge University (1997). 
[7] D. Barber and C. K. I. Williams, Gaussian Processes for Bayesian Classification 
via Hybrid Monte Carlo, in Neural Information Processing Systems 9, M. C. 
Mozer, M. I. Jordan and T. Petsche, eds., 340-346. MIT Press (1997). 
[8] C. K. I. Williams and D. Barber, Bayesian Classification with Gaussian Pro- 
cesses, Preprint Aston University (1997). 
[9] D. Haussler and M. Opper, Mutual Information, Metric Entropy and Cumu- 
lative Relative Entropy Risk, The Annals of Statistics, Vol 25, No 6, 2451 
(1997). 
[10] R. Peierls, Phys. Rev. 54, 918 (1938). 
[11] H. Zhu, C. K. I. Williams, R. Rohwer and M. Morciniec, Gaussian Re- 
gression and Optimal Finite Dimensional Linear Models, Technical report 
NCRG/97/011, Aston University (1997). 
