Hyperparameters Evidence and 
Generalisation for an Unrealisable Rule 
Glenn Marion and David Saad 
glenhyPed. ac. uk, D. Saaded. ac. uk 
Department of Physics, University of Edinburgh, 
Edinburgh, EH9 3JZ, U.K. 
Abstract 
Using a statistical mechanical formalism we calculate the evidence, 
generalisation error and consistency measure for a linear percep- 
tton trained and tested on a set of examples generated by a non 
linear teacher. The teacher is said to be unrealisable because the 
student can never model it without error. Our model allows us to 
interpolate between the known case of a linear teacher, and an un- 
realisable, nonlinear teacher. A comparison of the hyperparameters 
which maximise the evidence with those that optimise the perfor- 
mance measures reveals that, in the non-linear case, the evidence 
procedure is a misleading guide to optimising performance. Finally, 
we explore the extent to which the evidence procedure is unreliable 
and find that, despite being sub-optimal, in some circumstances it 
might be a useful method for fixing the hyperparameters. 
I INTRODUCTION 
The analysis of supervised learning or learning from examples is a major field of 
research within neural networks. In general, we have a probabilistic  teacher, which 
maps an N dimensional input vector x to output yt(x) according to some distri- 
bution P(yt I x). We are supplied with a data set D: ({yt(x"),x"}: t: 1..p) 
generated from P(yt Ix) by independently sampling the input distribution, P(x), 
p times. One attempts to optimise a model mapping (a student), parameterised by 
This accommodates teachers with deterministic output corrupted by noise. 
256 Glenn Marion, David Saad 
some vector w, with respect to the underlying teacher. The training error Ew(D) 
is some measure of the difference between the student and the teacher outputs over 
the set D. Simply minimising the training error leads to the problem of over-fitting. 
In order to make successful predictions out-with the set D it is essential to have 
some prior preference for particular rules. Occams razor is an expression of our 
preference for the simplest rules which account for the data. Clearly Ew(D) is an 
unsatisfactory performance measure since it is limited to the training examples. 
Very often we are interested in the students ability to model a random example 
drawn from P(yt I x)P(x), but not necessarily in the training set, one measure 
of this performance is the generalisation error. It is also desirable to predict, or 
estimate, the level of this error. The teacher is said to be an unrealisable rule, for 
the student in question, if the minimum generalisation error is non-zero. 
One can consider the Supervised Learning Paradigm within the context of Bayesian 
Inference. In particular MacKay [MacKay 92(a)] advocates the evidence procedure 
as a 'principled' method which, in some situations, does seem to improve perfor- 
mance [Thodberg 93]. However, in others, as MacKay points out the evidence 
procedure can be misleading [MacKay 92(b)]. 
In this paper we do not seek to comment on the validity of of the evidence procedure 
as an approximation to Hierarchical Bayes (see for example [Wolpert and Strauss 
94]). Rather, we ask which performance measures do we seek to optimise and under 
what conditions will the evidence procedure optimise them? Theoretical results 
have been obtained for a linear perceptron trained on data produced by a linear 
perceptron [Bruce and Saad 94]. They suggest that the evidence procedure is a 
useful guide to optimising the learning algorithm's performance. 
In what follows we examine the evidence procedure for the case of a linear perceptron 
learning a non linear teacher. In the next section we review the Bayesian scheme 
and define the evidence and the relevant performance measures. In section 3 we 
introduce our student and teacher and discuss the calculation. Finally, in section 4 
we examine the extent to which the evidence procedure optimises performance. 
2 BAYESIAN FORMALISM 
2.1 THE EVIDENCE 
If we take E(D) to be the usual sum squared error and assume that our data is 
corrupted by Gaussian noise with variance 1/2/ then the probability, or likelihood, 
of the data(D) being produced given the model w and/ is P(D I/, w) o e -r"(v). 
In order to incorporate Occams Razor we also assume a prior distribution on the 
teacher rules, that is, we believe a priori in some rules more strongly than others. 
Specifically we believe that P(w I 7) o e -*c(). Multiplying the likelihood by 
the prior we obtain the post training or student distribution a P(w I D,7,/) o 
e -r"(v)-'rc(w). It is clear that the most probable model w* is given by minimising 
the composite cost function/E,(D)+7C(w ) with respect to the weights (w). This 
formalises the trade off between fitting the data and minimising student complexity. 
In this sense the Bayesian viewpoint coincides with the usual backprop standpoint. 
Zlntegrating this over/ and 7 gives us the posterior P(w I D). 
Hyperparameters, Evidence and Generalisation for an Unrealisable Rule 257 
In fact, it should be noted that stochastic minimisation can also give rise to the 
same post training distribution [Seung et a192]. The parameters  and 7 are known 
as the hyperparameters. Here we consider C(w) = ww in which case 7 is termed 
the weight decay. 
The evidence is the normalisation constant in the above expression for the post 
training distribution. 
That is, the probability of the data set (D) given the hyperparameters. The ev- 
idence procedure fixes the hyperparameters to the values that maximise this 
probability. 
2.2 THE PERFORMANCE MEASURES 
Many performance measures have been introduced in the literature (See e.g., [Krogh 
and Hertz 92] and [Seung et a192]). Here, we consider the squared difference between 
the average (over the post training distribution) of the student output 
and that of the teacher, y(x), averaged over all possible test questions and teacher 
outputs, P(yt, x) and finally over all possible sets of data, D. 
 = ((y(x) - (y,(x))w))e(x,y),v 
This is equivalent to the generalisation error given by Krogh and Hertz. 
Another factor we can consider is the variance of the output over the student dis- 
tribution ({y,(x) - (y,(x)),}2),p(x). This gives us a measure of the confidence 
we should have in our post training distribution and could possibly be calculated if 
we could estimate the input distribution P(x). Here we extend Bruce and Saad's 
definition [Bruce and Saad 94] of the consistency measure 6c to include unrealisable 
rules by adding the asymptotic error o = limp-.o %, 
c = ({y,(x)- (y,(x))w})w,p(x),V--% 
We regard 6c = 0 as optimal since then the variance over our student distribution 
is an accurate prediction of the decaying part of the generalisation error. 
We can consider both these performance measures as objective functions measuring 
the students ability to mimic the underlying teacher. Clearly, they can only be 
calculated in theory and perhaps, estimated in practice. In contrast, the evidence 
is only a function of our assumptions and the data and the evidence procedure is, 
therefore, a practical method of setting the hyperparameters. 
3 THE MODEL 
In our model the student is simply a linear perceptron. The output for an input 
vector x" is given by y = W.X"/X/r. The examples, against which the student 
is trained and tested, are produced by sampling the input distribution, P(x) and 
then generating outputs from the distribution, 
P(Y Ix) = 
n= Y'"= P(f)P(x I ) 
258 Glenn Marion, David Saad 
-1.0 -0.5 0.0 0.5 1.0 
Figure 1' A 2-teacher in 1D  The average output (Yt)r(ul) (i) for Dto = 0, (ii) for 
Do > 0 (er = er2) and (iii) with Do > 0 (er  a2). 
2 3 
where P(Yt Ix, a) c exp([yt- wrl.x]/2r), P(x I n) is N(an,.o) and 
is chosen such that y]= P/=I. Thus, each component in the sum is a linear 
perceptron, whose output is corrupted by Gaussian noise of variance er , and we 
refer to this teacher as an n-teacher. 
In what follows, for simplicity, we consider a two teacher (n=2) with an = 0. The 
parameter Dto =  [wl-w2[  and the input distribution determine the form of the 
teacher. This is shown in Figure 1. which displays the average output of a 2-teacher 
with one dimensional input vector. For er =er,, D, controls the variance about 
a linear mean output, and for fixed er  er, D, controls the nonlinearity of the 
teacher. In the latter case, in the large N limit the variance of P(/t [ x) is zero. 
We can now explicitly write the evidence and perform the integration over the 
student parameters (over weights). Taking the logarithm of the resulting expression 
leads to lnP( [ ,/3) = -Nf(>) where the f is analogous to a free energy in 
statistical physics. 
and, 
P 
Here we are using the convention that summations are implied where repeated 
indices occur. 
aWhere N(:, a 2) denotes a normal distribution with mean : and variance 
Hyperparatneters, Evidence and Generalisation for an Unrealisable Rule 259 
The performance measures for this model are 
where, 
2 
2 
eff 
{w)=pg, and a =E a 
eff n o 
In order to pursue the calculation we consider the average of f(D) over all possible 
data sets just as, earlier, we defined our performance measures as averages over all 
data sets. This is some what artificial as we would normally be able to calculate 
f(D) and be interested in the generalisation error for our learning algorithm given a 
particular instance of the data. However, here we consider the thermodynamic limit 
(i.e., N,p - oo s.t. a = pin = const.) in which, due to our sampling assumptions, 
the behaviours for typical examples of/9 coincide with that of the average. Details 
of the calculation will be published else where [Marion and Saad 95]. 
4 RESULTS AND DISCUSSION 
We can now examine the evidence and the performance measures for our unlearnable 
problem. We note that in two limits we recover the learnable, linear teacher, case. 
Specifically if the probability of picking one of the component teachers is zero or if 
both component teacher vectors are aligned. In what follows we set P = P and 
normalise the components of the teacher such that Iwal = 1. 
Firstly let us consider the performance measures. The asymptotic value of both eg 
and I/cl for large a is ntnt 2 172 13 /172 
t-lt-517 ,,,o/ eff' This is the minimum generalisation 
error attainable and reflects the effective noise level due to the mismatch between 
student and teacher. 
We note here that the generalisation error is a function of , rather than/ and 7 
independently. Figure 2a shows the generalisation error plotted against a. The 
addition of unlearnability (Dw > 0) has a similar effect to the addition of noise on 
the examples. The appearance of the hump can be easily understood; If there is no 
noise or , is large enough then there is a steady reduction in %. However, if this is 
not so then for small a the student learns this effective noise and the generalisation 
error increases with a. As the student gets more examples the effects of the noise 
begin to average out and the student starts to learn the rule. The point at which the 
generalisation error starts to decrease is influenced by the effective noise level and 
the prior constraint. Figure 2b shows the absolute value of the consistency measure 
v's a for non-optimal/. Again we see that unlearnability acts as an effective noise. 
For a few examples with , small or with large effective noise the student distribution 
is narrowed until the t;, is zero. However, the generalisation error is still increasing 
(as described above) and I/;cl increases to a local maximum, it then asymptotically 
tends to q. If there is no noise or ,X is large enough then I/;,I steadily reduces as 
the number of examples increases. 
We now examine the evidence procedure. Firstly we define 5,(7) and %(/) to 
be the hyperparameters which maximise the evidence. The evidence procedure 
260 Glenn Marion, David Saad 
4 
4 
0 
0 
(a) Generalisation error 
(b) Consistency Measure 
Figure 2: The performance measures: Graph a shows % for finite lambda. a(i) and 
a(ii) are the learnable case with noise in the latter case. a(iii) shows that the effect 
of adding unlearnability is qualitatively the same as adding noise. Graph b. shows 
the modulus of the consistency error v's a. Curves b(i) and b(ii) are the learnable 
case without and with noise respectively. Curve b(iii) is an unlearnable case with 
the same noise level. 
picks the point in hyperparameter space where these curves coincide. We denote 
the asymptotic values of/ev(7) and 7ev(/) in the limit of large ( by/o and 7 
respectively. 
In the linear case (Do= 0) the evidence procedure assignments of the hyperpa- 
rameters (for finite a) coincide with / and %0 and also optimise % and /, in 
agreement with [Bruce and Saad 94] . This is shown in Figure 3a where we plot 
the / which optimises the evidence (/ev), the consistency measure (/*c) and the 
generalisation error (/g) versus 7. The point at which the three curves coincide is 
the point in the/-7 plane identified by the evidence procedure. However, we note 
here that, if one of the hyperparameters is poorly determined then maximising the 
evidence with respect to the other is a misleading guide to optimising performance 
even in the linear case. 
The results for an unrealisable rule in the linear regime (Dw > 0, a = a) are 
similar to the learnable case but with an increased noise due to the unlearnability. 
The evidence procedure still optimises performance. 
In the non-linear regime (Do> 0 , rrr  aa) the evidence procedure fails to 
minimise either performance measure. This is shown in Figure 3b where the evi- 
dence procedure point does not lie on/g(7) or/*c(7)- Indeed, its hyperparameter 
assignments do not coincide with/o and %0 but are a dependent. 
How badly does the evidence procedure fail? We define the percentage degradation 
in generalisation performance as  100. (%()ev) opt opt 
= -% )/t . WhereAev is the 
evidence procedure assignment and [;pt is the optimal generalisation error with 
respect to ,. This is plotted in Figure 4a. We also define 
;5 = 100, I/;c(,,)l/%(,,). This measures the error in using the variance of the 
Hyperparameters, Evidence and Generalisation for an Unrealisable Rule 261 
1.0 
0.8 
opt 0.8- 
0.4- 
0- 
0.5 
,,,*'* % ./" 
(a),// L(l) 
! I I I 
0.2 0.4 0.8 0.8 1.0 
0.4- 
0.3- 
0.1 
0.0 0.0 
0.0 0.0 
r 
I I I 
0.5 1.0 1.5 2.0 
(a) Linear Case 
(b) Non-Linear Case 
Figure 3: The evidence procedure:Optimal / v's 7. In both graphs for (i) the 
evidence(/e,, (ii) the generalisation error (/cg) and (iii) the consistency measure 
(c). The point which the evidence procedure picks in the linear case is that where 
all three curves coincide, whereas in the non linear case it coincides only with 
post training distribution to estimate the generalisation error as a percentage of 
the generalisation error itself. Examples of this quantity are plotted in Figure 4b. 
There are three important points to note concerning/g and/g6  Firstly, the larger 
the deviation from a linear rule the greater is the error. Secondly, that it is the 
magnitude of the effective noise due to unlearnability relative to the real noise which 
determines this error. In other words, if the real noise is large enough to swamp the 
non-linearity of the rule then the evidence procedure will not be very misleading. 
Finally, the magnitude of the error for relatively large deviations from linearity is 
only a few percent and thus the evidence procedure might well be a reasonable, if 
not optimal, method for setting the hyperparameters. However, clearly it would be 
preferable to improve our student space to enable it to model the teacher. 
5 CONCLUSION 
We have examined the generalisation error, the consistency measure and the evi- 
dence procedure within a model which allows us to interpolate between a learnable 
and an unlearnable scenario. We have seen that the unlearnability acts like an ef- 
fective noise on the examples. Furthermore, we have seen that for a linear student 
the evidence procedure breaks down, in that it fails to optimise performance, when 
the teacher output is non-linear. However, even for relatively large deviations of 
the teacher from linearity the evidence procedure is close to optimal. 
Bayesian methods, such as the evidence procedure, are based on the assumption 
that the student or hypothesis space contains the teacher generating the data. In 
our case, in the non-linear regime, this is clearly not true and so it is perhaps 
not surprising that the evidence procedure is sub-optimal. Whether or not such a 
breakdown of the evidence procedure is a generic feature of a mismatch between 
the hypothesis space and the teacher is a matter for further study. 
2 62 Glenn Marion, David Saad 
o. 
0.4- 
o3- 
0.2- 
o.o 
(a) (b) 
Figure 4: The relative degradation in performance compared to the optimal when 
using the evidence procedure to set the hyperparameters. Graph (a) shows the 
percentage degradation in generalisation performance g. a(i) has Do= I with the 
real noise level r - 1. a(ii) has this noise level reduced to r = 0.1 and a(iii) has 
increased non-linearity, Dr0 = 3, and r = 1. Graph (b) shows the error made in 
predicting the generalisation error from the variance of the post training distribution 
as a percentage of the generalisation error itself, g/ . b(i) and b(ii) have the same 
parameter values as a(i) and a(ii), whilst b(iii) has D,o= 3 and r = 0.1 
Acknowledgments 
We are very grateful to Alastair Bruce and Peter Sollich for useful discussions. GM 
is supported by an E.P.S.R.C. studentship. 
References 
Bruce, A.D. and Saad, D. (1994) Statistical mechanics of hypothesis evaluation. 
J. of Phys. A: Math. Gen. 27:3355-3363 
Krogh, A. and Hertz, J. (1992) Generalisation in a linear perceptron in the 
presence of noise. J. of Phys. A: Math. Gen. 25:1135-1147 
MacKay, D.J.C. (1992a)Bayesian interpolation. Neural Comp. 4:415-447 
MacKay, D.J.C. (1992b) A practical Bayesian framework for backprop networks. 
Neural Comp. 4:448-472 
Marion, (3. and Saad, D. (1995) A statistical mechanical analysis of a Bayesian 
inference scheme for an unrealisable rule. To appear in J. of Phys. A: Math. Gen. 
Seung, H. S, Sompolinsky, H., Tishby, N. (1992) Statistical mechanics of 
learning from examples. Phys. Rev. A, 45:6056-6091 
Thodberg, H.H. (1994) Bayesian backprop in action:pruning, ensembles, error 
bars and application to spectroscopy. Advances in Neural Information Processing 
Systems 6:208-215. Cowan et a/.(Eds.), Morgan Kauffmann, San Mateo, CA 
Wolpert, D. H and Strauss, C. E. M. (1994) What Bayes has to say about 
the evidence procedure. To appear in Maximum entropy and Bayesian methods. G. 
Heidbreder (Ed.), Kluwer. 
