Predictive Approaches For Choosing 
Hyperparameters in Gaussian Processes 
S. Sundararajan 
Computer Science and Automation 
Indian Institute of Science 
Bangalore 560 012, India 
sundarcsa. iisc. ernet. in 
S. Sathiya Keerthi 
Mechanical and Production Engg. 
National University of Singapore 
10 Kentridge Crescent, Singapore 119260 
mpesskguppy. rope. nus. edu. sg 
Abstract 
Gaussian Processes are powerful regression models specified by 
parametrized mean and covariance functions. Standard approaches 
to estimate these parameters (known by the name Hyperparam- 
eters) are Maximum Likelihood (ML) and Maximum APosterior 
(MAP) approaches. In this paper, we propose and investigate pre- 
dictive approaches, namely, maximization of Geisser's Surrogate 
Predictive Probability (GPP) and minimization of mean square er- 
ror with respect to GPP (referred to as Geisser's Predictive mean 
square Error (GPE)) to estimate the hyperparameters. We also 
derive results for the standard Cross-Validation (CV) error and 
make a comparison. These approaches are tested on a number of 
problems and experimental results show that these approaches are 
strongly competitive to existing approaches. 
I Introduction 
Gaussian Processes (GPs) are powerful regression models that have gained popular- 
ity recently, though they have appeared in different forms in the literature for years. 
They can be used for classification also; see MacKay (1997), Rasmussen (1996) and 
Williams and Rasmussen (1996). Here, we restrict ourselves to regression problems. 
Neal (1996) showed that a large class of neural network models converge to a Gaus- 
sian Process prior over functions in the limit of an infinite number of hidden units. 
Although GPs can be created using infinite networks, often GPs are specified di- 
rectly using parametric forms for the mean and covariance functions (Williams and 
Rasmussen (1996)). We assume that the process is zero mean. Let Zv = {Xv, yv} 
whereXv = {x(i)' i = 1,...,N}andyv = {y(i)' i = 1,...,N}. Here, y(i) 
represents the output corresponding to the input vector x(i). Then, the Gaussian 
prior over the functions is given by 
p(yNlXN,) -- exp(-Yv(lYv) (1) 
where v is the covariance matrix with (i,j) th element 
and (.; ) denotes the parametrized covariance function. Now, assuming that the 
632 $. $undararajan and $. S. Keerthi 
observed output tv is modeled as tv = yv + eN and eN is zero mean multivariate 
Gaussian with covariance matrix a2Iv and is independent of yv, we get 
p(tNIXN,0) : xP(--tTNC ltN) 
(2)lcNl (2) 
where Cv = v +a2Iv. Therefore, [Cv]i,j = [v]i,j +a6i,j, where 6i,j = 1 when 
i = j and zero otherwise. Note that 0 = (0,a ) is the new set of hyperparameters. 
Then, the predictive distribution of the output y(N + 1) for a test case x(N + 1) is 
also Gaussian with mean and variance 
9(N + 1) r -1 
= kN+lC N tN (3) 
and 
ay(N+) = bN+ - k/+ClkN+ (4) 
C(x(N + 1), x(N + 1); I) and kv+l is an N x 1 vector with i th 
where bN+ = 
element given by C(x(N + 1),x(i);0). Now, we need to specify the covariance 
function C(.; 0). Williams and Rasmussen (1996) found the following covariance 
function to work well in practice. 
(5) 
M I M 
(x(i),x(j); O) = ao + al 2 xp(i)xp(j) .- voexp(- y. wp (xp(i) - xp(j)) ) 
p----1 p----1 
where xp(i) is the pth component of i tn input vector x(i). The wp are the Auto- 
matic Relevance Determination (ARD) parameters. Note that C(x(i),x(j); O) = 
(x(i),x(j);O) + aSi,j. Also, all the parameters are positive and it is conve- 
nient to use logarithmic scale. Hence, 0 is given by log(ao, al, v0, Wl,..., wM, a2). 
Then, the question is: how do we handle 0 ? More sophisticated techniques 
like Hybrid Monte Carlo (HMC) methods (Rasmussen (1996) and Neal (1997)) 
are available which can numerically integrate over the hyperparameters to make 
predictions. Alternately, we can estimate 0 from the training data. We restrict 
to the latter approach here. In the classical approach, 0 is assumed to be de- 
terministic but unknown and the estimate is found by maximizing the likelihood 
argmax 
(2). That is, OML = 0 p(tvIXv, 0). In the Bayesian approach, 0 is 
assumed to be random and a prior p(O) is specified. Then, the MAP estimate 
argmax 
OMp is obtained as OMp = 0 p(tlX, O)p(O) with the motivation that 
the the predictive distribution p(y(N + 1)lx(N + 1), Zv) can be approximated as 
p(y(N + 1)lx(N + 1), ZN, 04). With this background, in this paper we propose 
and investigate different predictive approaches to estimate the hyperparameters 
from the training data. 
2 Predictive approaches for choosing hyperparameters 
Geisser (1975) proposed Predictive Sample Reuse (PSR) methodology that can be 
applied for both model selection and parameter estimation problems. The basic 
idea is to define a partition scheme P(N,n,r) such that P()_. = tZ . o 
 v-, Z ) is 
ita partition belonging to a set F of partitions with Z_n, Z/  representing the 
N - n retained and n omitted data sets respectively. Then, the unknown  is esti- 
mated (or a model Mj is chosen among a set of models indexed by j = 1,..., J) 
by means of optimizing a predictive measure that measures the predictive perfor- 
mance on the omitted observations X/  by using the retained observations Z_n 
averaged over the partitions (i  F). In the special case of n = 1, we have the 
leave one out strategy. Note that this approach was independently presented in the 
Predictive Approaches for Choosing Hyperparameters in Gaussian Processes 633 
name of cross-validation (CV) by Stone (1974). The well known examples are the 
standard CV error and negative of average predictive likelihood. Geisser and Eddy 
N  Z?, 
(1979) proposed to maximize rIi=lp(t()[x(i), Mj) (known as Geisser's surro- 
gate Predictive Probability (GPP)) by synthesizing Bayesian and PSR methodology 
in the context of (parametrized) model selection. Here, we propose to maximize 
rI=l p(t(i)lx(i), z?, 0) to estimate 0, where Z? is obtained from ZN by removing 
the i th sample. Note that p(t(i)]x(i), Z?, 0) is nothing but the predictive distribu- 
tion p(y(i)[x(i), Z?, 0) evaluated at y(i) - t(i). Also, we introduce the notion of 
i E/N=1 E((y(i) - t(i)) 2) 
Geisser's Predictive mean square Error (GPE) defined as  
(where the expectation operation is defined with respect to p(y(i)lx(i), Z?, 0)) and 
propose to estimate 0 by minimizing GPE. 
2.1 Expressions for GPP and its gradient 
The objective function corresponding to GPP is given by 
N 
1 
G(o) - log(p(t(i)lx(i), (6) 
i----1 
From (S) and (4) we get 
i k (t(i) - + (7) 
G(O) --  i----1 Y ) 
2cr2,i, 
where (i) [c?]r[C?]-lt? and  
-- fly(i) 
is an N - 1 x N - 1 matrix obtained from Cv by removing the i th column and 
i t row. Similarly, [() and c i) are obtained from tv and ci (i.e., i  column of 
Cv) respectively by removing the i  element. Then, G() and is gradien can be 
computed ecienly using the following result. 
N 
1  1 
2N  log fly(i) -]-  log 
i=1 
['(i)]T[g'(i)]--lr'(i) Here, C( ) 
= eli -- ['i J ["NJ 'i ' 
Theorem 1 
by 
The objective function G(O) under the Gaussian Process model is given 
G(O) - 1 i.l q(i) I N 1 2 
2N c-it 27 ylogeii + log r (8) 
'---- i1 
where ii denotes the ita diagonal entry of C x and qv (i) denotes the i th element 
of qv = cltN  ItS gradient is given by 
OG(O) = I /1 (1 + q-(i))(sj'--!) + 
OOj 2N .__ eli Cii 
- __ OCN cl tN 
where sj,i - c/Tc----' rj - -C 1 o0 
-- O0 i , 
denotes the i  column of the matrix C 1 . 
1 v (r(i)] 
  qv(i), ii / (9) 
i----1 
and qv = cltN . Here, i 
Thus, using (8) and (9) we can compute the GPP and its gradient. We will give 
meaningful interpretation to the different terms shortly. 
2.2 Expressions for CV function and its gradient 
We define the CV function as 
N 
1 
H(O) =   (t(i) - (i))  
i=1 
(10) 
634 $. $undararajan and $. $. Keerthi 
where ?(i) is the mean of the conditional predictive distribution as given above. 
Now, using the following result we can compute H(O) efficiently. 
Theorem 2 The CV function H(O) under the Gaussian model is given by 
N 
1 (qN(i)) 2 (11) 
i=1 
and its gradient is given by 
OH(O) 1 
ooj N 
il (qv(i)rj(i) qv(i)) (sj,i) 
where 8j,i, rj, qlv(i) and ii are as defined in theorem 1. 
2.3 Expressions for GPE and its gradient 
The GPE function is defined as 
1 
= 
N 
N 
/(t(i) - y(i)) 2 p(y(i)lx(i),z(),o) dy(i) 
i=1 
(13) 
which can be readily simplified to 
N N 
fly(i) 
i=1 i=1 
(14) 
On comparing (14) with (10), we see that while CV error minimizes the deviation 
from the predictive mean, GPE takes predictive variance also into account. Now, 
the gradient can be written as 
0(7(0) OH(O) 1 N [lh2rg.TOC N 
aOj = OOj +  E \ii] ' O0j ei (15) 
i----1 
where we have used the results a 2 = i Oil OCv  OCq  __ 
y(i) ii' 00 -- eiT-i e i and o0 -- 
oC . 
-C 1 o0 C 1 Here ei denotes the i th column vector of the identity matrix IN. 
2.4 Interpretations 
More insight can be obtained from reparametrizing the covariance function as fol- 
lows. 
M I M 
C(x(i),x(j); O) = 0.2 (0_ 1 E xp(i)xp(j)+oexp(- 
p=l p=l 
(16) 
where ao - a 2 ao, al - a 2 al, vo = a 2 Vo. Let us define P(x(i),x(j);O) : 
- - ', where Ci,, i,j 
l-C(x(i),x(j); 0). Then rr 1 -- 0 '2 Cr 1. Therefore, ci,j -  
denote the (i, j)th element of the matrices C31 and pl respectively. From theorem 
2 (see (10) and (11)) we have t(i)- (i) 
(8) as 
I q(i) 
(7(0) = 2N0. 2 Z P-i 
i1 
Then, we can rewrite 
N 
1 logii + log2r0. 
2N 
(17) 
Predictive Approaches for Choosing Hyperparameters in Gaussian Processes 635 
Here, v = pltv and, i, ii denote, respectively, the i th column and i t' diagonal 
entry of the matrix pl. Now, by setting the derivative of (17) with respect to a 2 
to zero, we can infer the noise level as 
I N 
6. 2 _  y (i) (18) 
i=1 Pii 
Similarly, the CV error (10) can be rewritten as 
1 v qv(i) (19) 
i=1 Pii 
Note that H(I) is dependent only on the ratio of the hyperparameters (i.e., on 
a0, al, v0) apart from the ARD parameters. Therefore, we cannot infer the noise 
level uniquely. However, we can estimate the ARD parameters and the ratios 
a0, al, 0. Once we have estimated these parameters, then we can use (18) to es- 
timate the noise level. Next, we note that the noise level preferred by the GPE 
criterion is zero. To see this, first let us rewrite (14) under reparametrization as 
I N k 1 
a(o) = :2 + 
Pii  ii 
'---- 
(20) 
Since qv(i) and ii are independent of a 2, it follows that the GPE prefers zero as 
the noise level, which is not true. Therefore, this approach can be applied when, 
either the noise level is known or a good estimate of it is available. 
3 Simulation results 
We carried out simulation on four data sets. We considered MacKay's robot arm 
problem and its modified version introduced by Neal (1996). We used the same 
data set as MacKay (2-inputs and 2-outputs), with 200 examples in the training 
set and 200 in the test set. This data set is referred to as 'data set 1' in Table 
1. Next, to evaluate the ability of the predictive approaches in estimating the 
ARD parameters, we carried out simulation on the robot arm data with 6 inputs 
(Neal's version), denoted as 'data set 2' in Table 1. This data set was generated by 
adding four further inputs, two of which were copies of the two inputs corrupted 
by additive zero mean Gaussian noise of standard deviation 0.02 and two further 
irrelevant Gaussian noise inputs with zero mean and unit variance (Williams and 
Rasmussen (1996)). The performance measures chosen were average of Test Set 
Error (normalized by true noise level of 0.0025) and average of negative logarithm of 
predictive probability (NLPP) (computed from Gaussian density function with (3) 
and (4)). Friedman's [1] data sets 1 and 2 were based on the problem of predicting 
impedance and phase respectively from four parameters of an electrical circuit. 
Training sets of three different sizes (50, 100, 200) and with a signal-to-noise ratio 
of about 3:1 were replicated 100 times and for each training set (at each sample 
size N), scaled integral squared error (ISE = fv(Y(x)-(x))2dx 
varo y(x)  and NLPP were 
computed using 5000 data points randomly generated from a uniform distribution 
over D (Friedman (1991)). In the case of GPE (denoted as G in the tables), we 
used the noise level estimate generated from Gaussian distribution with mean NL, 
(true noise level) and standard deviation 0.03 NL.. In the case of CV, we estimated 
the hyperparameters in the reparametrized form and estimated the noise level using 
(18). In the case of MAP (denoted as MP in the tables), we used the same prior 
636 S. Sundararajan andS. S. Keerthi 
Table 1: Results on robot arm data sets. Average of normalized test set error (TSE) 
and negative logarithm of predictive probability (NLPP) for various methods. 
Data Set: 1 Data Set: 2 
TSE NLPP TSE NLPP 
ML 1.126 -1.512 1.131 -1.512 
MP 1.131 -1.511 1.181 -1.489 
Gp 1.115 -1.524 1.116 -1.516 
CV 1.112 -1.518 1.146 -1.514 
G 1.111 -1.524 1.112 -1.524 
Table 2: Results on Friedman's data sets. Average of scaled integral squared error 
and negative logarithm of predictive probability (given in brackets) for different 
training sample sizes and various methods. 
Data Set: 1 Data Set: 2 
N = 50 N = 100 N = 200 N = 50 N = 100 N = 200 
ML 0.43(7.24) 0.19(6.71) 0.10(6.49) 0.26(1.05) 0.16(0.82) 0.11(0.68) 
MP 0.42(7.18) 0.22(6.78) 0.12(6.56) 0.25(1.01) 0.16(0.82) 0.11(0.69) 
Gp 0.47(7.29) 0.20(6.65) 0.10(6.44) 0.33(1.25) o.20(0.86) 0.12(0.70) 
CV 0.55(7.27) 0.22(6.67) 0.10(6.44) 0.42(1.36) 0.21(0.91) 0.13(0.70) 
es 0.35(7.10) 0.15(6.60) 0.08(6.37) 0.28(1.20) 0.18(0.85) o.12(o.63) 
given in Rasmussen (1996). The GPP approach is denoted as G in the tables. For 
all these methods, conjugate gradient (CG) algorithm (Rasmussen (1996)) was used 
to optimize the hyperparameters. The termination criterion (relative function error) 
with a tolerance of 10 -7 was used, but with a constraint on the maximum number 
of CG iterations set to 100. In the case of robot arm data sets, the algorithm 
was run with ten different initial conditions and the best solution (chosen from 
respective best objective function value) is reported. The optimization was carried 
out separately for the two outputs and the results reported are the average TSE, 
NLPP. In the case of Friedman's data sets, the optimization algorithm was run 
with three different initial conditions and the best solution was picked up. When 
N = 200, the optimization algorithm was run with only one initial condition. For 
all the data sets, both the inputs and outputs were normalized to zero mean and 
unit variance. 
From Table 1, we see that the performances (both TSE and NLPP) of the predic- 
tive approaches are better than ML and MAP approaches for both the data sets. 
In the case of data set 2, we observed that like ML and MAP methods, all the 
predictive approaches rightly identified the irrelevant inputs. The performance of 
GPE approach is the best on the robot arm data and demonstrates the usefulness 
of this approach when a good noise level estimate is available. In the case of Fried- 
man's data set I (see Table 2), the important observation is that the performances 
(both ISE and NLPP) of GPP, CV approaches are relatively poor at low sample size 
(N = 50) and improve very well as N increases. Note that the performances of the 
predictive approaches are better compared to the ML and MAP methods starting 
from N = 100 onwards (see NLPP). Again, GPE gives the best performance and 
the performance at low sample size (N = 50) is also quite good. In the case of 
Friedman's data set 2, the ML and MAP approaches perform better compared to 
the predictive approaches except GPE. The performances of GPP and CV improve 
Predictive Approaches for Choosing Hyperparameters in Gaussian Processes 63 7 
as N increases and are very close to the ML and MAP methods when N - 200. 
Next, it is clear that the MAP method gives the best performance at low sample 
size. This behavior, we believe, is because the prior plays an important role and 
hence is very useful. Also, note that unlike data set 1, the performance of GPE is 
inferior to ML and MAP approaches at low sample sizes and improves over these 
approaches (see NLPP) as N increases. This suggests that the knowledge of the 
noise level alone is not the only issue. The basic issue we think is that the predictive 
approaches estimate the predictive performance of a given model from the training 
samples. Clearly, the quality of the estimate will become better as N increases. 
Also, knowing the noise level improves the quality of the estimate. 
4 Discussion 
Simulation results indicate that the size N required to get good estimates of predic- 
tive performance will be dependent on the problem. When N is sufficiently large, we 
find that the predictive approaches perform better than ML and MAP approaches. 
The sufficient number of samples can be as low as 100 as evident from our results 
on Friedman's data set 1. Also, MAP approach is the best, when N is very low. 
As one would expect, the performances of ML and MAP approaches are nearly 
same as .iV increases. The comparison with the existing approaches indicate that 
the predictive approaches developed here are strongly competitive. The overall cost 
for computing the function and the gradient (for all three predictive approaches) 
is O(MN3). The cost for making prediction is same as the one required for ML 
and MAP methods. The proofs of the results and detailed simulation results will 
be presented in another paper (Sundararajan and Keerthi, 1999). 
References 
Friedman, J.H., (1991) Multivariate Adaptive Regression Splines, Ann. of Stat., 19, 1-141. 
Geisser, S., (1975) The Predictive Sample Reuse Method with Applications, Journal of 
the American Statistical Association, 70, 320-328. 
Geisser, S., and Eddy, W.F, (1979) A Predictive Approach to Model Selection, Journal 
of the American Statistical Association, 74, 153-160. 
MacKay, D.J.C. (1997) Gaussian Processes - A replacement .for neural networks ?, Avail- 
able in Postscript via URL http://www. woI. ra.phy. cam. ac.uk/mackay/. 
Neal, R.M. (1996) Bayesian Learning .for Neural Networks, New York: Springer-Verlag. 
Neal, R.M. (1997) Monte Carlo Implementation of Gaussian Process Models for Bayesian 
Regression and Classification. Tech. Rep. No. 9702, Dept. of Statistics, University of 
Toronto. 
Rasmussen, C. (1996) Evaluation of Gaussian Processes and other Methods .for Non-Linear 
Regression, Ph.D. Thesis, Dept. of Computer Science, University of Toronto. 
Stone, M. (1974) Cross-Validatory Choice and Assessment of Statistical Predictions (with 
discussion), Journal of Royal Statistical Society, ser. B, 36, 111-147. 
Sundararajan, S., and Keerthi, S.S. (1999) Predictive Approaches for Choosing Hy- 
perparameters in Gaussian Processes, submitted to Neural Computation, available at: 
http : // guppy. mpe. nus. edu. sg/~mpes sk/gp /gp.htmI. 
Williams, C.K.I., and Rasmussen, C.E. (1996) Gaussian Processes for Regression. In 
Advances in Neural Information Processing Systems 8, ed. by D.S.Touretzky, M.C.Mozer, 
and M.E.Hasselmo. MIT Press. 
