Dynamics of Supervised Learning with 
Restricted Training Sets and Noisy Teachers 
A.C.C. Coolen 
Dept of Mathematics 
King's College London 
The Strand, London WC2R 2LS, UK 
tcoolen @mth.kcl.ac. uk 
C.W.H. Mace 
Dept of Mathematics 
King's College London 
The Strand, London WC2R 2LS, UK 
cmace @m th.kcl. ac. uk 
Abstract 
We generalize a recent formalism to describe the dynamics of supervised 
learning in layered neural networks, in the regime where data recycling 
is inevitable, to the case of noisy teachers. Our theory generates reliable 
predictions for the evolution in time of training- and generalization er- 
rors, and extends the class of mathematically solvable learning processes 
in large neural networks to those situations where overfitting can occur. 
1 Introduction 
Tools from statistical mechanics have been used successfully over the last decade to study 
the dynamics of learning in layered neural networks (for reviews see e.g. [1] or [2]). The 
simplest theories result upon assuming the data set to be much larger than the number 
of weight updates made, which rules out recycling and ensures that any distribution of 
relevance will be Gaussian. Unfortunately, both in terms of applications and in terms of 
mathematical interest, this regime is not the most relevant one. Most complications and 
peculiarities in the dynamics of learning arise precisely due to data recycling, which creates 
for the system the possibility to improve performance by memorizing answers rather than 
by learning an underlying rule. The dynamics of learning with restricted training sets was 
first studied analytically in [3] (linear learning rules) and [4] (systems with binary weights). 
The latter studies were ahead of their time, and did not get the attention they deserved just 
because at that stage even the simpler learning dynamics without data recycling had not 
yet been studied. More recently attention has moved back to the dynamics of learning 
in the recycling regime. Some studies aimed at developing a general theory [5, 6, 7], 
some at finding exact solutions for special cases [8]. All general theories published so far 
have in common that they as yet considered realizable scenario's: the rule to be learned 
was implementable by the student, and overfitting could not yet occur. The next hurdle is 
that where restricted training sets are combined with unrealizable rules. Again some have 
turned to non-typical but solvable cases, involving Hebbian rules and noisy [9] or 'reverse 
wedge' teachers [10]. More recently the cavity method has been used to build a general 
theory [11] (as yet for batch learning only). In this paper we generalize the general theory 
launched in [6, 5, 7], which applies to arbitrary learning rules, to the case of noisy teachers. 
We will mirror closely the presentation in [6] (dealing with the simpler case of noise-free 
teachers), and we refer to [5, 7] for background reading on the ideas behind the formalism. 
238 A. C. C. Coolen and C. W. H. Mace 
2 Definitions 
As in [6, 5] we restrict ourselves for simplicity to perceptrons. A student perceptron oper- 
ates a linear separation, parametrised by a weight vector J  N: 
$: {-1,1} N --> {-1,1} $({) = sgn[J.{] 
It aims to emulate a teacher oerating a similar rule, which, however, is characterized by a 
variable weight vector B   , drawn at random from a distribution P(B) such as 
output noise: P(B) = 6[B+B*]+(1-)6[B-B*] (1) 
Gaussian weight noise: P(B) = [Ex//N] -N e -N(B-B*)2/2 (2) 
The parameters ,X and Y control the amount of teacher noise, with the noise-free teacher 
B = B* recovered in the limits ,X -- 0 and Y -- 0. The student modifies J iteratively, using 
examples of input vectors  which are drawn at random from a fixed (randomly composed) 
training set containing p = aN vectors '  {-1, 1) N with a > 0, and the corresponding 
values of the teacher outputs. We choose the teacher noise to be consistent, i.e. the answer 
given by the teacher to a question ' will remain the same when that particular question 
re-appears during the learning process. Thus T( ') = sgn[B '. '], with p teacher weight 
vectors B ', drawn randomly and independently from P(B), and we generalize the training 
set accordingly to D = {(1, B1),..., (p, Bp)). Consistency of teacher noise is natural 
in terms of applications, and a prerequisite for overfitting phenomena. Averages over the 
training set will be denoted as (...lb; averages over all possible input vectors   {-1, 1)N 
as (.. '/- We analyze two classes of learning rules, of the form J(+l) = J() 4- AJ(): 
on-line- AJ() =  { () G [J().(),B().()]- ,J() } 
(3) 
batch' AJ() =  ( ( 0 [J()', B'])b- 3,J(m) } 
In on-line learning one draws at each step  a question/answer pair ((), B ()) at ran- 
dom from the training set. In batch learning one iterates a deterministic map which is an 
average over all data in the training set. Our performance measures are the training- and 
generalization errors, defined as follows (with the step function O[x > O] = 1, O[x < O] = 0): 
Et(J) = (O[-(J')(B')])b Eg(J)- (O[-(J.)(B*.)]) (4) 
We introduce macroscopic observables, taylored to the present problem, generalizing [5, 6]: 
Q[j]_j2, R[J]=J.B*, P[x,y,z;J]=(5[x-J.]5[y-B*.]5[z-B.])b (5) 
As in [5, 6] we eliminate technical subtleties by assuming the number of arguments (x, y, z) 
for which P[x, y, z; J] is evaluated to go to infinity after the limit N -- cx has been taken. 
3 Derivation of Macroscopic Laws 
Upon generalizing the calculations in [6, 5], one finds for on-line learning: 
Q= 21 xdydz P[x.y.z]xO[x.z]-2l.Q+l  xdydz P[x.y.z]O[x.z] (6) 
dt = 1 xdydz P[x. y. z] y6[x. z] - lvR (7) 
= - -.[x', d]- 
/ 
-, dx'dy'dz' dx'dy'dz'6[x'.z]A[x.y.z;x'.y'.z'] + , {xP[x.y.z]} 
+  dx'dy'dz' P[x'.y'.z']6[x'.z']P[x.y.z] (8) 
Supervised Learning with Restricted Training Sets 239 
The complexity of the problem is concentrated in a Green's function: 
.A[x, y, z; x', y', z'] = lim 
N-+oo 
' 6x' J '6 ' B* '6 ' B ' 
Ill[1-'][x-J']dIy-B']d[z-B'](') [ - ' ] [y- '] [y- ' 
It involves a conditional average of the form (K[J])cwr;t= fdJ pt(JIQ,R,P)K[J], with 
pt(J) 6[Q-Q[J]]6[R- R[J]] I-Iyz 6[P[x, y, z]- P[x, y, z; J]] 
Pt(JIQ'R'P) = f dJ pt(J) 5[Q-Q[J]]O[R- R[J]] rlxyz 5[P[x, y, z]- P[x, y, z; J]] 
in which Pt(J) is the weight probability density at time t. The solution of (6,7,8) can be 
used to generate the N  cx performance measures (4) at any time: 
Et -/dxdydz P[x,y,z]O[-xz] Eg = x -1 arccos[R/v/] (9) 
Expansion of these equations in powers of r/, and retaining only the terms linear in /, gives 
the corresponding equations describing batch learning. So far this analysis is exact. 
4 Closure of Macroscopic Laws 
As in [6, 5] we close our macroscopic laws (6,7,8) by making the two key assumptions 
underlying dynamical replica theory: 
(i) For N  cx our macroscopic observables obey closed dynamic equations. 
(ii) These equations are self-averaging with respect to the specific realization of 
(i) implies that probability variations within {Q, R, P} subshells are either absent or irrel- 
evant to the macroscopic laws. We may thus make the simplest choice for Pt (JIQ, R, P): 
pt(JlO, R,P) - 5[Q-Q[J]]6[R-R[J]] HS[P[x,y,z]-P[x,y,z;J]] (10) 
e procedure (10) leads to exact laws if our observables {Q, R, P) indeed obey closed 
equations for N  . It is a mimum engopy approximation if not. (ii) allows us 
to average the macroscopic laws over all training sets; it is observed in simulations, and 
proven using the formalism of [4]. Our assumptions (10) result in the closure of (6,7,8), 
since now the Green's function can be written in terms of {O, R, P). c final ingredient 
of dynamical replica theo is doing the average of fractions with the replica identity 
f aJ w[Jlbl[JID] lim dJ 
willY] 
Our problem has been reduced to calculating (non-trivial) integrals and averages. One 
finds that P[x, y, z] = P[x, zly]P[y ] with P[y] = (2)-exp[ 
  With e sho-hands 
Dy = P[y]dy and (f(x,y,z)) = f Dydxdz P[x, zly]f(x,y,z ) we can write the resulting 
macroscopic laws, for the case of output noise (1), in the following compact way: 
d 2(V 7Q)+veZ = (W - 7R) (11) 
P[x, zly] =  dx'P[x',zly] {5[x-x'-G[x',z]]-5[x-x']}+k ZP[x, zly] 
0 {P[x,z[y] [U(x-Ry)+Wy-ffx + [V-RW-(O-R=)U][x,y,z]] } (12) 
with 
U=([x,y,z]6[x,z]), V=(x6[x,z]), W=(y6[x,z]), Z=(6=[x,z]) 
The solution of (12) is at any time of the following form: 
P[x,z]y] = (1-X)5[y-z]P+[x[y] + XS[y+z]P-[x[y] (13) 
240 A. C. C. Coolen and C. W. H. Mace 
Finding the function  [x, y, z] (in replica symmetric ansatz) requires solving a saddle-point 
problem for a scalar observable q and two functions M+[zly]. Upon introducing 
B - v/qQ-R2 fax M+[xlY]eBXSf[x,y] 
Q(1-q) (f[x,y]) = f dx M+[xly]e Bx8 
(with fdx M[xly] - I for all y) the saddle-point equations acquire the form 
for all X,y' ?[Xly] =/Z)s (14) 
 qQ + Q- W f 
((x-Ry) ) + (qQ-R)[1-] = qQ -R DyDs s[(1-A)(x) + A(x);] (15) 
The equations (14) which determine M  [xly ] have the same sgucture as the cogesponding 
(single) equation in [5, 6], so the proofs in [5, 6] again apply, d the solutions M[xly], 
given a q in the physical range q e [Re/Q, 1], e unique. e function [x,y, z] is then 
given by 
f Ds s 
[X,y,z]= v/qQ_R2P[X, zly ] 
( (1-A)5[z-y](d[X-x]), + + Ad[z+y](5[X-x]); } 
(16) 
Working out predictions from these equations is generally CPU-intensive, mainly due to 
the functional saddle-point equation (14) to be solved at each time step. However, as in [7] 
one can construct useful approximations of the theory, with increasing complexity: 
(i) Large a approximation (giving the simplest theory, without saddle-point equations) 
(ii) Conditionally Gaussian approximation for M[x[y] (with y-dependent moments) 
(iii) Annealed approximation of the functional saddle-point equation 
5 Benchmark Tests: The Limits a- c and A - 0 
We first show that in the limit a -->  our theory reduces to the simple (Q, R) formalism 
of infinite training sets, as worked out for noisy teachers in [ 12]. Upon making the ansatz 
P[xly ] = P[xly ] = [2w(Q-R")]  e [x-RYI2/(O-R2) (17) 
one finds 
M[xly] - P[xly], q = R/Q, [x,y,z] = (x-Ry)/(Q-R ) 
Insertion of our ansatz into (12), followed by rearranging of terms and usage of the above 
expression for  [x, y, z], shows that (12) is satisfied. The remaining equations (1 l) involve 
only averages over the Gaussian distribution (17), and indeed reduce to those of [12]: 
ld 
d-Q: (l-A)(2(x[x,y])+ r/([x,y])} +A (2(x[x,--y])+ r/([x,--y])}- 2ffQ 
ld 
v cttR = (1-)(y[x,y]) + (y[x,]) - R 
Next we turn to the limit A  0 (restricted aining sets & noise-free teachers) and show that 
here our eow reproduces the focalism of [6, 5]. Now we me e following ansatz: 
P+[xly] = P[xly], P[x, zly ] = 5[z-y]P[xly ] (18) 
Insertion shows that for A = 0 solutions of this fo indeed solve our equations, giving 
 [x,y,z]  [/,y] and M+[xly] = M[xly ], and leaving us exactly with the formalism 
of [6, 5] describing the case of noise-free teachers and resicted aining sets (apt from 
some new terms due to the presence of weight decay, which w absent in [6, 5]). 
Supervised Learning with Restricted Training Sets 241 
05 0.5 
0.4 0=0.5 0.4 
0,=2 
0.3 0.3 
0.2 0=2 0.2 
[ \ 0=1 
ot  ot 
0 5 10 15 0 5 10 15 
t t 
0=0.5 
0=1 
0=2 
0.=4 
0=2 
= 1 
a=0 5 
Figure 1: On-line Hebbian learning: conditionally Gaussian approximation versus exact 
solution in [9] (r/= 1, , - 0.2). Left: '), - 0.1, right: '), = 0.5. Solid lines: approximated 
theory, dashed lines: exact result. Upper curves: Eg as functions of time (here the two 
theories agree), lower curves: Et as functions of time. 
6 Benchmark Tests: Hebbian Learning 
The special case of Hebbian learning, i.e. [x, z] = sgn(z), can be solved exactly at any 
time, for arbitrary {a, , 7} [9], providing yet another excellent benchmark for our theory. 
For batch execution of Hebbian learning the macroscopic laws are obtained upon expanding 
(11,12) and retaining only those terms which are linear in r/. All integrations can now be 
done and all equations solved explicitly, resulting in U-0, Z = 1, W = (1-2)Vr-/r, and 
2Ro(1  [ ] [1-e-nvt] 2 
q qO e -2nvt q- --2,) e_nvt[l_e_nvt] 2 
= + 
R= Ro e-n'rt+(1-2X)X/[1-e-n'rt]/q , q= 
P[xly ] = [2x(Q-R2)] -& e -[x-Ry*sgncy)[1-e-n*t]/av]2/cQ-R2) (19) 
From these results, in turn, follow the peffomance mereurns Eg = x- 1 arccos[R/ and 
1  ) [y,R+[1-e-Vt]/affl 1 ) lYlR-[1-e -"t] 
Et= - (l-A) y erf[ 2 
2(Q-R 2) 
Comprison with the exact solution, calculamd along the lines of [9] or, equivalently, ob- 
tained upon putting t << -2 in [9], shows that the above expressions e all exact. 
For on-line execution we cannot (yet) solve the functional saddle-point equmion in general. 
However, some analytical predictions c still be exacted om (11,12,13): 
Q = Qoe -2nvt+ 
= 
fax x*[xly] = 
with U=0, W= (1-2X) V-/r, V= WR+[1-e-n'rtl/aq,, and Z= 1. Comparison with the 
results in [9] shows that the above expressions, and thus also that of Eg, are all fully exact, 
at any time. Observables involving P[x, y, z] (including the training error) are not as easily 
solved from our equations. Instead we used the conditionally Gaussian approximation 
(found to be adequate for the noiseless Hebbian case [5, 6, 7]). The result is shown in 
figure 1. The agreement is reasonable, but significantly less than that in [6]; apparently 
teacher noise adds to the deformation of the field distribution away from a Gaussian shape. 
242 A. C. C. Coolen and C. W. H. Mac 
E 
0.4 
0.0 .... OOOOO 
0 2 4 6 8 10 
t 
0.4 - 
0.6 
0.4 
0.2 
0.0 
-3 
I 
0.6  
0.4 
0.2 
-2 -1 0 1 2 3 
X 
0.0 
0 2 4 6 8 10 -3 -2 -1 0 1 2 3 
t X 
Figure 2: Large c approximation versus numerical simulations (with N = 10,000), for 
7 = 0 and A = 0.2. Top row: Perceptron rule, with r/- . Bottom row: Adatron rule, 
a Left: training errors Et and generalisation errors E o as functions of time, for 
with r/= . 
OE{ 1 
, 1, 2}. Lines: approximated theory, markers: simulations (circles: Et, squares: Eg). 
Right: joint distributions for student field and teacher noise P+ [x] = fdy P[z, , z = +] 
(upper: P+ [x], lower: P- [x]). Histograms: simulations, lines: approximated theory. 
7 Non-Linear Learning Rules: Theory versus Simulations 
In the case of non-linear learning rules no exact solution is known against which to test our 
formalism, leaving numerical simulations as the yardstick. We have evaluated numerically 
the large a approximation of our theory for Perceptron learning, [x, z] = sgn(z)O[--zz], 
and for Adatron learning, {[x,z] = sgn(z)lzlO[-zz]. This approximation leads to the 
following fully explicit equation for the field distributions: 
e+[xly] = 1 dx'P+[x'ly] + ] 
a {P[xly] [Wy-'x + U[x-(Y)-Ryl+(V-RW)[x-x-d:(Y)]]} 
with - l  Q _ R2 
V = f, Dydx x { (1- X)P+[xlylG[x, y]+ XP-[xly]G[x,-y] } 
W = [Dydx y { (1-X)P+[xly]G[x, y]+ AP-[IIy]G[I,---y] } 
z = 7Oyax { y]+ xP-[xly]{;= Ix,--y] } 
Supervised Learning with Restricted Training Sets 243 
and with the short-hands +(y) = fdx xpe[xly]. The result of our comparison is shown 
in figure 2. Note: Et increases monotonically with a, and Eg decreases monotonically 
with a, at any t. As in the noise-free formalism [7], the large a approximation appears to 
capture the dominant terms both for a   and for a  0. The predicting power of our 
theory is mainly limited by numerical constraints. For instance, the Adatron learning rule 
generates singularities at a; = 0 in the distributions P+[a:[y] (especially for small r/) which, 
although predicted by our theory, are almost impossible to capture in numerical solutions. 
8 Discussion 
We have shown how a recent theory to describe the dynamics of supervised learning with 
restricted training sets (designed to apply in the data recycling regime, and for arbitrary on- 
line and batch learning rules) [5, 6, 7] in large layered neural networks can be generalized 
successfully in order to deal also with noisy teachers. In our generalized approach the joint 
distribution P[a:, y, z] for the fields of student, 'clean' teacher, and noisy teacher is taken to 
be a dynamical order parameter, in addition to the conventional observables Q and/. From 
the order parameter set {Q, R, P} we derive the generalization error Eg and the training 
error Et. Following the prescriptions of dynamical replica theory one finds a diffusion 
equation for P[a:, y, z], which we have evaluated by making the replica-symmetric ansatz. 
We have carried out several orthogonal benchmark tests of our theory: (i) for a  cx (no 
data recycling) our theory is exact, (ii) for  -+ 0 (no teacher noise) our theory reduces 
to that of [5, 6, 7], and (iii) for batch Hebbian learning our theory is exact. For on-line 
Hebbian learning our theory is exact with regard to the predictions for Q, R, Eg and the 
y-dependent conditional averages fdz zP + [zly ], at any time, and a crude approximation 
of our equations already gives reasonable agreement with the exact results [9] for Et. For 
non-linear learning rules (Perceptron and Adatron) we have compared numerical solution 
of a simple large a aproximation of our equations to numerical simulations, and found 
satisfactory agreement. This paper is a preliminary presentation of results obtained in the 
second stage of a research programme aimed at extending our theoretical tools in the arena 
of learning dynamics, building on [5, 6, 7]. Ongoing work is aimed at systematic applica- 
tion of our theory and its approximations to various types of non-linear learning rules, and 
at generalization of the theory to multi-layer networks. 
References 
[9] 
[10] 
[11] 
[12] 
[1] Mace C.W.H. and Coolen A.C.C (1998), Statistics and Computing 8, 55 
[2] Saad D. (ed.) (1998), On-Line Learning in Neural Networks (Cambridge: CUP) 
[3] Hertz J.A., Krogh A. and Thorgersson G.I. (1989), J. Phys. A 22, 2133 
[4] Horner H. (1992a), Z. Phys. B 86, 291 and Homer H. (1992b), Z. Phys. B 87, 371 
[5] Coolen A.C.C. and Saad D. (1998), in On-Line Learning in Neural Networks, Saad 
D. (ed.), (Cambridge: CLIP) 
[6] Coolen A.C.C. and Saad D. (1999), in Advances in Neural Information Processing 
Systems 11, Kearns D., Solla S.A., Cohn D.A. (eds.), (MIT press) 
[7] Coolen A.C.C. and Saad D. (1999),preprints KCL-MTH-99-32 & KCL-MTH-99-33 
[8] Rae H.C., Sollich P. and Coolen A.C.C. (1999), in Advances in Neural Information 
Processing Systems 11, Kearns D., Solla S.A., Cohn D.A. (eds.), (MIT press) 
Rae H.C., Sollich P. and Coolen A.C.C. (1999), J. Phys. A 32, 3321 
Inoue J.I. (1999) private communication 
Wong K.Y.M., Li S. and Tong Y.W. (1999),preprint cond-mat/9909004 
Biehl M., Riegler P. and Stechert M. (1995), Phys. Rev. E 52, 4624 
