Online learning from finite training sets 
in nonlinear networks 
Peter Sollich* 
Department of Physics 
University of Edinburgh 
Edinburgh EH9 3JZ, U.K. 
P. Solliched. ac .uk 
David Barber t 
Department of Applied Mathematics 
Aston University 
Birmingham B4 7ET, U.K. 
D. Barberaston. ac. uk 
Abstract 
Online learning is one of the most common forms of neural net- 
work training. We present an analysis of online learning from finite 
training sets for non-linear networks (namely, soft-committee ma- 
chines), advancing the theory to more realistic learning scenarios. 
Dynamical equations are derived for an appropriate set of order 
parameters; these are exact in the limiting case of either linear 
networks or infinite training sets. Preliminary comparisons with 
simulations suggest that the theory captures some effects of finite 
training sets, but may not yet account correctly for the presence of 
local minima. 
I INTRODUCTION 
The analysis of online gradient descent learning, as one of the most common forms 
of supervised learning, has recently stimulated a great deal of interest [1, 5, 7, 3]. In 
online learning, the weights of a network ('student') are updated immediately after 
presentation of each training example (input-output pair) in order to reduce the 
error that the network makes on that example. One of the primary goals of online 
learning analysis is to track the resulting evolution of the generalization error - the 
error that the student network makes on a novel test example, after a given number 
of example presentations. In order to specify the learning problem, the training 
outputs are assumed to be generated by a teacher network of known architecture. 
Previous studies of online learning have often imposed somewhat restrictive and 
*Royal Society Dorothy Hodgkin Research Fellow 
tSupported by EPSRC grant GR/J75425: Novel Developments in Learning Theory for 
Neural Networks 
358 P SoIIich and D. Barber 
unrealistic assumptions about the learning framework. These restrictions are, either 
that the size of the training set is infinite, or that the learning rate is small[1, 5, 4]. 
Finite training sets present a significant analytical difficulty as successive weight 
updates are correlated, giving rise to highly non-trivial generalization dynamics. 
For linear networks, the difficulties encountered with finite training sets and non- 
infinitesimal learning rates can be overcome by extending the standard set of de- 
scriptive ('order') parameters to include the effects of weight update correlations[7']. 
In the present work, we extend our analysis to nonlinear' networks. The particular 
model we choose to study is the soft-committee machine which is capable of repre- 
senting a rich variety of input-output mappings. Its online learning dynamics has 
been studied comprehensively for infinite training sets[l, 5]. In order to carry out 
our analysis, we adapt tools originally developed in the statistical mechanics liter- 
ature which have found application, for example, in the study of Hopfield network 
dynamics[2]. 
2 MODEL AND OUTLINE OF CALCULATION 
For an N-dimensional input vector x, the output of the soft committee machine is 
given by 
y = x (1) 
where the nonlinear activation function #(ht) = erf(ht/x/) acts on the activations 
ht = wtTx/x/ (the hctor 1/ is for convenience only). This is a neurM network 
with L hidden units, input to hidden weight vectors wt, l = 1..L, and all hidden to 
output weights set to 1. 
In online leaning the student weights are apted on a sequence of presented exm- 
ples to better appromate the teacher mapping. The training exmples are drawn, 
with replacement, from a ite set, {(x ", y"),  = 1..p}. This set remns ed 
during training. Its size relative to the input dimension is denoted by e = p/N. 
We take the input vectors x"  staples from an N dimensional Gaussi distri- 
bution with zero me and unit mrice. The training outputs y e sumed to 
be generated by a teacher so committee machine with hidden weight vectors w, 
m = 1..M, with additive Gaussian noise corrupting its activations and output. 
The discrepancy between the teacher and student on a particular trying exm- 
pie (x,y), drawn from the trng set, is ven by the squed difference of their 
corresponding outputs, 
1 1 
E = [ a(ht) -y =  a(ht) - a(k. + m) - o 
where the student and teacher activations are, respectively 
I T /1  T 
= x = x, 
(2) 
and m, m = 1..M and o are noise variables corrupting the teacher activations and 
output respectively. 
Given a training example (x, y), the student weights are updated by a gradient 
descent step with learning rate r/, 
w-w, = -r/V,,E = --.--xOh, E (3) 
VN 
On-line Learning from Finite Training Sets in Nonlinear Networks 359 
The generalization error is defined to be the average error that the student makes on 
a test example selected at random (and uncorrelated with the training set), which 
we write as eg -- (E). 
Although one could, in principle, model the student weight dynamics directly, this 
will typically involve too many parameters, and we seek a more compact representa- 
tion for the evolution of the generalization error. It is straightforward to show that 
the generalization error depends, not on a detailed description of all the network 
1 ,T, * 
weights, but only on the overlap parameters Qtt, = -wtWwt ' and Rtm= N "t --m 
[1, 5, 7]. In the case of infinite a, it is possible to obtain a closed set of equations 
governing the overlap parameters Q, R [5]. For finite training sets, however, this is 
no longer possible, due to the correlations between successive weight updates[7]. 
In order to overcome this difficulty, we use a technique developed originally to study 
statistical physics systems[2]. Initially, consider the dynamics of a general vector of 
order parameters, denoted by f, which are functions of the network weights w. If 
the weight updates are described by a transition probability T(w -> w'), then an 
approximate update equation for f is 
(4) 
Intuitively, the integral in the above equation expresses the average change x of f 
caused by a weight update w - w , starting from (given) initial weights w. Since 
our aim is to develop a closed set of equations for the order parameter dynamics, we 
need to remove the dependency on the initial weights w. The only information we 
have regarding w is contained in the chosen order parameters f, and we therefore 
average the result over the 'subshell' of all w which correspond to these values of 
the order parameters. This is expressed as the &function constraint in equation(4). 
It is clear that if the integral in (4) depends on w only through f(w), then the 
average is unnecessary and the resulting dynamical equations are exact. This is in 
fact the case for a -> c and f = (Q, R), the standard order parameters mentioned 
above[5]. If this cannot be achieved, one should choose a set of order parameters to 
obtain approximate equations which are as close as possible to the exact solution. 
The motivation for our choice of order parameters is based on the linear perceptton 
case where, in addition to the standard parameters Q and R, the overlaps projected 
onto eigenspaces of the training input correlation matrix A = N2_,,=l x Ix ) are 
required 2. We therefore split the eigenvalues of A into F equal blocks (7 = 1... F) 
containing N  = N/F eigenvalues each, ordering the eigenvalues such that they 
increase with 7. We then define projectors PV onto the corresponding eigenspaces 
and take as order parameters: 
1 
Qt, = N , 
1 wtTP,w, 
Rr, = N  
Ui} = , wtTpn'b, 
(5) 
where the bs are linear combinations of the noise variables and training inputs, 
I P 
(6) 
Here we assume that the system size N is large enough that the mean values of the 
parameters alone describe the dynamics sufficiently well (i.e., self-averaging holds). 
2The order parameters actually used in our calculation for the linear perceptron[7] are 
Laplace transforms of these projected order parameters. 
360 P SoIIich and D. Barber 
As F -+ c, these order parameters become functionals of a continuous variable a. 
The updates for the order parameters (5) due to the weight updates (3) can be 
found by taking the scalar products of (3) with either projected student or teacher 
weights, as appropriate. This then introduces the following activation 'components', 
(7) 
1 
so that the student and teacher activations are h, =  v h and km= y  kVm, 
respectively. For the linear perceptron, the chosen order parameters form a complete 
set - the dynamical equations close, without need for the average in (4). 
For the nonlinear case, we now sketch the calculation of the order parameter update 
equations (4). Taken together, the integral over w' (a sum of p discrete terms in 
our case, one for each training example) and the subshell average in (4), define 
an average over the activations (2), their components (7), and the noise variables 
m, 0. These variables turn out to be Gaussian distributed with zero mean, and 
therefore only their covariances need to be worked out. One finds that these are in 
fact given by the naive training set averages. For example, 
I F 
r ,m, (8) 
= cN(Wt)TpVAw = avR v 
where we have used PVA = avPV with a v 'the' eigenvalue of A in the 7-th 
eigenspace; this is well defined for F - c (see [6] for details of the eigenvalue 
spectrum). The correlations of the activations and noise variables explicitly ap- 
pearing in the error in (3) are calculated similarly to give, 
1 
(9) 
where the final equation defines the noise variances. The v projected over- 
Wmrn , are 
laps between teacher weight vectors, v 1 (.r* hTp--- * 
- N " m/ -- m'- We will sume that 
the teacher weights d trning inputs are uncorrelated, so that v 
Tm, is indepen- 
dent of if. The required coviances of the 'component' activations e 
aNote that the limit F -- oc is taken after the thermodynamic limit, i.e., F << N. This 
ensures that the number of order parameters is always negligible compared to N (otherwise 
self-averaging would break down). 
On-line Learning from Finite Training Sets in Nonlinear Networks 361 
0.03 
'OOoooooooooOOOOOOOOOO 
, 
0.025 
0.02 
0.015 
(a) 0.25 t 
0.2 
0.15 
ooooooo:o:o:o:o:7T 
0.01 ' 
0 50 t 100 0 50 t 100 
(b) 
Figure 1: eg vs t for student and teacher with one hidden unit (L = M - 1); 
a = 2, 3, 4 from above, learning rate r/= 1. Noise of equal variance was added to 
both activations and output (a) rr - c02 = 0.01, (b) (r 2 - c02- 0.1. Simulations 
for N - 100 are shown by circles; standard errors are of the order of the symbol 
size. The bottom dashed lines show the infinite training set result for comparison. 
F - 10 was used for calculating the theoretical predictions; the curved marked "+" 
in (b), with F = 20 (and a = 2), shows that this is large enough to be effectively in 
the F -> c limit. 
Using equation (3) and the definitions (7), we can now write down the dynamical 
equations, replacing the number of updates n by the continuous variable t - n/N 
in the limit N - c: 
= 
where the averages are over zero mean Gaussian variables, with coviances (9,10). 
Using the explicit form of the error E, we have 
which, together with the equations (11) completes the description of the dynamics. 
The Gaussi averages in (11) can be strghtforwardly evuated in a manner 
similar to the infinite training set ce[5], and we omit the rather cumbersome 
explicit form of the resulting equations. 
We note that, in controt to the inite trning set ce, the student activations 
 and the noise variables c, and f, are now correlated through equation (10). 
Intuitively, this is reonable  the weights become correlated, during trning, 
with the examples in the trning set. In calculating the generalization error, on the 
other hand, such correlations are absent, d one h the same result  for infinite 
trning sets. The dynamical equations (11), together with (9,10) constitute our 
mn result. They e exact for the limits of either a linear network (R, Q, T  0, 
so that g(m)  m) or a  , and c be integrated numericly in a straightforward 
way. In principle, the limit F   should be ten but,  shown below, relatively 
small values of P can be taken in practice. 
3 RESULTS AND DISCUSSION 
We now discuss the main consequences of our result (11), comparing the resulting 
predictions for the generalization dynamics, eg (t), to the infinite training set theory 
362 P. Sollich and D. Barber 
0.25 
0.2 
0.15 
0.1 
0.05 
0 
0 
00.0 
0000000000000000 
10 20 30 40 t 50 
(a) 
0.4 
0.3' 
0.1 ooo 
o 
o 50 
00000000000000 
0000000000000001 
100 150 t 200 
(b) 
Figure 2: eg vs t for two hidden units (L = M = 2). Left: a = 0.5, with a = c 
shown by dashed line for comparison; no noise. Right: a = 4, no noise (bottom) 
and noise on teacher activations and outputs of variance 0.1 (top). Simulations for 
N = 100 are shown by small circles; standard errors are less than the symbol size. 
Learning rate r/= 2 throughout. 
and to simulations. Throughout, the teacher overlap matrix is set to Tij = 5ij 
(orthogonal teacher weight vectors of length Vt-). 
In figure(1), we study the accuracy of our method as a function of the training 
set size for a nonlinear network with one hidden unit at two different noise levels. 
The learning rate was set to r/ = 1 for both (a) and (b). For small activation 
and output noise (rr 2 = 0.01), figure(la), there is good agreement with the sim- 
ulations for c down to c = 3, below which the theory begins to underestimate 
the generalization error, compared to simulations. Our finite c theory, however, 
is still considerably more accurate than the infinite c predictions. For larger noise 
(rr 2 = 0.1, figure(lb)), our theory provides a reasonable quantitative estimate of the 
generalization dynamics for c > 3. Below this value there is significant disagree- 
ment, although the qualitative behaviour of the dynamics is predicted quite well, 
including the overfitting phenomenon beyond t m 10. The infinite a theory in this 
case is qualitatively incorrect. 
In the two hidden unit case, figure(2), our theory captures the initial evolution of 
eg (t) very well, but diverges significantly from the simulations at larger t; neverthe- 
less, it provides a considerable improvement on the infinite c theory. One reason for 
the discrepancy at large t is that the theory predicts that different student hidden 
units will always specialize to individual teacher hidden units for t -> c3, whatever 
the value of c. This leads to a decay of g from a plateau value at intermediate times 
t. In the simulations, on the other hand, this specialization (or symmetry breaking) 
appears to be inhibited or at least delayed until very large t. This can happen even 
for zero noise and c _> L, where the training data should should contain enough 
information to force student and teacher weights to be equal asymptotically. The 
reason for this is not clear to us, and deserves further study. Our initial investiga- 
tions, however, suggest that symmetry breaking may be strongly delayed due to the 
presence of saddle points in the training error surface with very 'shallow' unstable 
directions. 
When our theory fails, which of its assumptions are violated? It is conceivable 
that multiple local minima in the training error surface could cause self-averaging 
to break down; however, we have found no evidence for this, see figure(3a). On 
the other hand, the simulation results in figure(3b) clearly show that the implicit 
assumption of Gaussian student activations - as discussed before eq. (8) - can be 
violated. 
On-line Learning from Finite Training Sets in Nonlinear Networks 363 
10 -2 
10 -3 
Variance over aining histories 
(a) 
10 -4 
10 2 N 10 3 -2 0 hi 2 
(b) 
Figure 3: (a) Variance of eg(t -- 20) vs input dimension N for student and teacher 
with two hidden units (L = M = 2), c = 0.5, r/= 2, and zero noise. The bottom 
curve shows the variance due to different random choices of training examples from 
a fixed training set ('training history'); the top curve also includes the variance due 
to different training sets. Both are compatible with the 1/N decay expected if self- 
averaging holds (dotted line). (b) Distribution (over training set) of the activation 
hi of the first hidden unit of the student. Histogram from simulations for N - 1000, 
all other parameter values as in (a). 
In summary, the main theoretical contribution of this paper is the extension of online 
learning analysis for finite training sets to nonlinear networks. Our approximate 
theory does not require the use of replicas and yields ordinary first order differential 
equations for the time evolution of a set of order parameters. Its central implicit 
assumption (and its Achilles' heel) is that the student activations are Gaussian 
distributed. In comparison with simulations, we have found that it is more accurate 
than the infinite training set analysis at predicting the generalization dynamics for 
finite training sets, both qualitatively and also quantitatively for small learning 
times t. Future work will have to show whether the theory can be extended to cope 
with non-Gaussian student activations without incurring the technical difficulties 
of dynamical replica theory [2], and whether this will help to capture the effects of 
local minima and, more generally, 'rough' training error surfaces. 
Acknowledgments: We would like to thank Ansgar West for helpful discussions. 
References 
[1] M. Biehl and H. Schwarze. 
[2] 
Journal o)  Physics A, 28:643-656, 1995. 
A. C. C. Coolen, S. N. Laughton, and D. Sherrington. In NIPS 8, pp. 253-259, 
MIT Press, 1996; S.N. Laughton, A.C.C. Coolen, and D. Sherrington. Journal 
of Physics A, 29:763-786, 1996. 
[3] See for example: The dynamics of online learning. Workshop at NIPS'95. 
[4] T. Heskes and B. Kappen. Physical Review A, 44:2718-2762, 1994. 
[5] D. Saad and S. A. Solla Physical Review E, 52:4225, 1995. 
[6] P. Sollich. Journal of Physics A, 27:7771-7784, 1994. 
[7] P. Sollich and D. Barber. In NIPS 9, pp.274-280, MIT Press, 1997; Europhysics 
Letters, 38:477-482, 1997. 
