Dynamics of On-Line Gradient Descent 
Learning for Multilayer Neural Networks 
David Saad* 
Dept. of Comp. Sci. & App. Math. 
Aston University 
Birmingham B4 7ET, UK 
Sara A. Solla t 
CONNECT, The Niels Bohr Institute 
Blegdamsdvej 17 
Copenhagen 2100, Denmark 
Abstract 
We consider the problem of on-line gradient descent learning for 
general two-layer neural networks. An analytic solution is pre- 
sented and used to investigate the role of the learning rate in con- 
trolling the evolution and convergence of the learning process. 
Learning in layered neural networks refers to the modification of internal parameters 
{J} which specify the strength of the interneuron couplings, so as to bring the map 
fj implemented by the network as close as possible to a desired map f. The 
degree of success is monitored_through the generalization error, a measure of the 
dissimilarity between fj and f. 
Consider maps from an N-dimensional input space  onto a scalar (, as arise in 
the formulation of classification and regression tasks. Two-layer networks with an 
arbitrary number of hidden units have been shown to be universal approximators 
[1] for such N-to-one dimensional maps. Information about the desired map f is 
provided through independent examples (,(), with ( = f() for all/. The 
examples are used to train a student network with N input units, K hidden units, 
and a single linear output unit; the target map f is defined through a teacher 
network of similar architecture except for the number M of hidden units. We 
investigate the emergence of generalization ability in an on-line learning scenario 
[2], in which the couplings are modified after the presentation of each example so 
as to minimize the corresponding error. The resulting changes in {J} are described 
as a dynamical evolution; the number of examples plays the role of time. 
In this paper we limit our discussion to the case of the soft-committee machine 
[2], in which all the hidden units are connected to the output unit with positive 
couplings of unit strength, and only the input-to-hidden couplings are adaptive. 
* D.Saad@aston.ac.uk 
/On leave from AT&T Bell Laboratories, Holmdel, NJ 07733, USA 
Dynamics of On-line Gradient Descent Learning for Multilayer Neural Networks 303 
Consider the student network: hidden unit i receives information from input unit 
r through the weight Jir, and its activation under presentation of an input pattern 
 = ((,...,(N) is zi = Ji' , with Ji: (Jii,..., Jilv) defined as the vector of 
incoming weights onto the i-th hidden unit. The output of the student network is 
cr(J,) = -/K=l g (ji. ), where g is the activation function of the hidden units, 
taken here to be the error function g(z) -- erf(z/v), and J _= {Ji}i<i<K is the set 
of input-to-hidden adaptive weights. 
Training examples are of the form (, (). The components of the independently 
drawn input vectors  are uncorrelated random variables with zero mean and 
unit variance. The corresponding output ( is given by a deterministic teacher 
whose internal structure is the same as for the student network but may differ in 
the number of hidden units. Hidden unit n in the teacher network receives input 
information through the weight vector B, = (B, 1,..., B,y), and its activation 
under presentation of the input pattern  is y = B,  . The corresponding 
output is ( = -,M__l  (B,. ). We will use indices i,j,k,l... to refer to units 
in the student network, and n, m,... for units in the teacher network. 
The error made by a student with weights J on a given input  is given by the 
quadratic deviation 
1 ]2 1 [ K M] 2 
e(a,)_-- [cr(a,)-( =  yg(xi)-yg(y,) (1) 
i=1 n=l 
Performance on a typical input defines the generalization error ca(J ) -- 
< e(J,) >{e) through an average over all possible input vectors , to be per- 
formed implicitly through averages over the activations x = (x,..., x:) and 
Y = (Y,..., YM). Note that both < xi >=< y >= 0; second order correlations are 
given by the overlaps among the weight vectors associated with the various hidden 
units: < xi xk > = Ji  Jk ---- Qik, < xi Yn > = Ji ' Bn ---- lrin, and < y, y, > = 
B  B. -- T,. Averages over x and y are performed using the resulting multivari- 
ate Gaussian probability distribution, and yield an expression for the generalization 
error in terms of the parameters Qi, Rir, and Tm [3]. For g(x) -_- erf(x/v) the 
result is: 
eg(J) = 1 {/ arcsin Qi Trm 
7 v/1 q- Qii x/1 + Q +  arcsin 41 + T x/1 + 
nm 
-2 y arcsin v/1 + Qii x/1 + T,, ' (2) 
in 
The parameters T, are characteristic of the task to be learned and remain fixed. 
The overlaps Qit: and ]in, which characterize the correlations among the various 
student units and their degree of specialization towards the implementation of the 
desired task, are determined by the student weights J and evolve during training. 
A gradient descent rule for the update of the student weights results in J'+ = 
J' +  ' , where the learning rate r/has been scaled with the input size N, and 
r--1 
is defined in terms of both the activation function g and its derivative gr. The time 
evolution of the overlaps ]in and Qit: can be explicitly written in terms of similar 
304 D. SAAD, S. A. SOLLA 
difference equations. In the large N limit the normalized number of examples 
 =/tin can be interpreted as a continuous time variable, leading to the equations 
of motion 
dRin 
= r/<Si y >{} , 
da 
dQi 
= r/< 5i xk >{} +r/< 5k xi >{} +r/2 < 5 5 >{} , (4) 
da 
to be averaged over all possible ways in which an example can be chosen at a given 
time step. The dependence on the current input  is only through the activations 
x and y; the corresponding averages can be performed analytically for g(x) = 
erf(x/v), resulting in a set of coupled first-order differential equations [3]. These 
dynamical equations are exact, and provide a novel tool used here to analyze the 
learning process for a general soft-committee machine with an arbitrary number K 
of hidden units, trained to implement a task defined through a teacher of similar 
architecture except for the number M of hidden units. In what follows we focus on 
uncorrelated teacher vectors of unit length, T, = 
The time evolution of the overlaps tiin and Qit follows from integrating the equa- 
tions of motion (4) from initial conditions determined by a random initialization of 
the student vectors {Ji}l<i<K. Random initial norms Qii for the student vectors 
are taken here from a uniform distribution in the [0, 0.5] interval. Overlaps Qit 
between independently chosen student vectors Ji and J, or tiin between Ji and 
an unknown teacher vector B are small numbers, of order 1/v/-ff for N >> K, M, 
and taken here from a uniform distribution in the [0, 10 -2] interval. 
We show in Fig. la-c the evolution of the overlaps and generalization error for a 
realizable case: K = M = 3 and r/ = 0.1. This example illustrates the succes- 
sive regimes of the learning process. The system quickly evolves into a symmetric 
subspace controlled by an unstable suboptimal solution which exhibits no differenti- 
ation among the various student hidden units. Trapping in the symmetric subspace 
prevents the specialization needed to achieve the optimal solution, and the gener- 
alization error remains finite, as shown by the plateau in Fig. lc. The symmetric 
solution is unstable, and the perturbation introduced through the random initializa- 
tion of the overlaps tiin eventually takes over: the student units become specialized 
and the matrix R of student-teacher overlaps tends towards the matrix T, except 
for a permutational symmetry associated with the arbitrary labeling of the student 
hidden units. The generalization error plateau is followed by a monotonic decrease 
towards zero once the specialization begins and the system evolves towards the 
optimal solution. The evolution of the overlaps and generalization error for the un- 
realizable case K < M is characterized by qualitatively similar stages, except that 
the asymptotic behavior is controlled by a suboptimal solution which reflects the 
differences between student and teacher architectures. 
Curves for the time evolution of the generalization error for different values of 
shown in Fig. ld for K = M -- 3 identify trapping in the symmetric subspace 
as a small r/ phenomenon. We therefore consider the equations of motion (4) in 
the small r/ regime. The term proportional to r/2 is neglected and the resulting 
truncated equations of motion are used to investigate a phase characterized by 
students of similar norms: Qii - Q for all 1 _< i _< K, similar correlations among 
themselves: Qik -- C for all i - k, and similar correlations with the teacher vectors: 
Rin - / for all 1 < i < K, 1 _< n _< M. The resulting dynamical equations exhibit 
a fixed point solution at 
Q* = C* M M - K 2 + x/K 4 - K 
= K 2 2M-1 and R*= (5) 
Dynamics of On-line Gradient Descent Learning for Multilayer Neural Networks 305 
(a) 
0.8-- 
- 0.6- 
0.4- 
o.o/ 
I I 
0 2000 4000 
6000 8000 
(b) 
0.8' 
0.6- 
0.4- 
0.2- 
0.0 
0 
I I 
2000 4000 6000 
8000 
(c) 
(d) 
0.08- 
0.06- 
0.04- 
0.02- 
0.0 
0 
I I I 
2000 4000 6000 8000 
0.08 ........... ?'/o.  
0 2000 4000 6000 
0.06- 
Figure 1: Dependence of the overlaps and the generalization error on the normal- 
ized number of examples a for a three-node student learning a three-node teacher 
characterized by Trm -- 8.. Results for r/= 0.1 are shown for (a) student-student 
overlaps Qik and (b) student-teacher overlaps tin. The generalization error is shown 
in (c), and again in (d) for different values of the learning rate. 
for the general case, which reduces to 
Q* = C* = 1 and R* = QV/' 1 
1 = x/K(5)c- 1) (6) 
in the realizable case (K = M), where the corresponding generalization error is 
given by 
ca- r -Karcsin (7) 
A simple geometrical picture explains the relation Q* = C'* = M(R*) 2 at the 
symmetric fixed point. The learning process confines the student vectors {Ji} to 
the ubspace So spanned by the set of teacher vectors {B.}. For T.. = 6.. 
the teacher vectors form an orthonormal set: B, = e., with e..e. = 6.. for 
1 _< n, m <_ M, and provide an expansion for the weight vectors of the trained 
student: J' = -. tinen. The student-teacher overlaps tin are independent of i in 
the symmetric phase and independent of n for an isotropic teacher: tin = /* for 
all 1 _< i _< K and 1 _< n _< M. The expansion J? = R* -. e. for all i results in 
q* = C'* = M(R*) . 
306 D. SAAD, S. A. SOLLA 
The length of the symmetric plateau is controlled by the degree of asymmetry in the 
initial conditions [2] and by the learning rate r/. The small r/analysis predicts trap- 
ping times inversely proportional to r/, in quantitative agreement with the shrinking 
plateau of Fig. ld. The increase in the height of the plateau with decreasing r/is 
a second order effect, as the truncated equations of motion predict a unique value: 
e = 0.0203 for If = M - 3. The mechanism for the second order effect is re- 
vealed by an examination of Fig. la: the student-student overlaps do agree with 
the prediction C* = 0.2 of the small r/analysis for K = M = 3, but the norms of 
the student vectors remain larger, at Q - Q* + A. The gap A between diagonal 
and off-diagonal elements is observed numerically to increase with increasing r/, and 
is responsible for the excess generalization error. A first order expansion in A at 
R=R*,C=C*,andQ=Q*+Ayields 
K{r ( 1 ) /2K-1 }" 
e a = -- - K arcsin + A (8) 
r  - V2K+I ' 
in agreement with the trend observed in Fig. ld for the realizable case. 
The excess norm A of the student vectors corresponds to a residual component in 
Ji not confined to the subspace $B. The weight vectors of the trained student can 
be written as Ji: R* 5-, e,, + J, with J .e, = 0 for all 1 _< n <_ M. Student 
weight vectors are not constrained to be identical; they differ through orthogonal 
components J/J- which are typically uncorrelated: jL .jL = 0 for i  k. Correlations 
Qik =  C do satisfy C = C* = M(R*) 2, but norms Qii = Q are given by Q = Q*+A, 
with A -II j1112 Learning at very small r/tends to eliminate J- and confine the 
student vectors to $B. 
Escape from the symmetric subspace signals the onset of hidden unit specialization. 
As shown in Fig. lb, the process is driven by a breaking of the uniformity of the 
student-teacher correlations: each student node becomes increasingly specialized to 
a specific teacher node, while its overlap with the remaining teacher nodes decreases 
and eventually decays to its asymptotic value. In the realizable case this asymptotic 
value is zero, while in the unrealizable case two different non-zero asymptotic values 
distinguish weak overlaps with teacher nodes imitated by other student vectors from 
more significant overlaps with those teacher nodes not specifically imitated by any 
of the student vectors. 
The matrix of student-teacher overlaps can no longer be characterized by a unique 
parameter, as we need to distinguish between a dominant overlap R between a 
given student node and the teacher node it begins to imitate, secondary overlaps $ 
between the same student node and the teacher nodes to which other student nodes 
are being assigned, and residual overlaps U with the remaining teacher nodes. The 
student hidden nodes can be relabeled so as to bring the matrix of student-teacher 
overlaps to the form lrin ---- lrin q- $(1 -- 5in)O(K -- n) q- U(1 - O(K - n)), where 
the step function 6} is 0 for negative arguments and I otherwise. The emerging 
differentiation among student vectors results in a decrease of the overlaps Qik -- ' 
for i -fi k, while their norms Qii - Q increase. The matrix of student-student 
overlaps takes the form Qik - QSik q- C(1 - tik ). 
Here we limit our description of the onset of specialization to the realizable case, for 
which Ri. = RSi. +S(1-Si.). The small r/analysis is extended to allow for S - R in 
order to describe the escape from the symmetric subspace. The resulting dynamical 
equations are linearized around the fixed point solution at Q* = C* = 1/(2K- 1) 
and R* = S* = 1/v/K(2K - 1), and the generalization error is expanded around its 
fixed point value (7) to first order in the corresponding deviations q, c, r, and s. The 
analysis identifies a relevant perturbation with q = c = 0 and s = -r/(K-1), which 
Dynamics of On-line Gradient Descent Learning for Multilayer Neural Networks 307 
Figure 2: Dependence of the 
two leading decay eigenvalues on 
the learning rate r/in the realiz- 
able case: h (curved line) and 
he (straight line) are shown for 
M=K--3. 
0.3 
0.2- 
0.1- 
0.0- 
0.0 
0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 
to first order in 1/K. The optimal solution with e* = 0 is not accessible for 
r/ > r/max. Exponential convergence of R, S, C, an Q to their optimal values 
is guaranteed for all learning rates in the range (0, r/max); in this regime the gener- 
alization error decays exponentially to eg -- 0, with a rate controlled by the slowest 
decay mode. An expansion of % in terms of r = 1 - R, s, c, and q = 1 - Q reveals 
that of the leading modes whose eigenvalues are shown in Fig. 2 only the mode as- 
sociated with h contributes to the decay of the linear term, while the decay of the 
second order term is controlled by the mode associated with h2 and dominates the 
convergence if 2,k2 < hi. The learning rate r/opt which guarantees fastest asymptotic 
decay for the generalization error follows from h(r/opt) = 2he(r/opt). 
The asymptotic convergence of unrealizable learning is an intrinsically more com- 
plicated process that cannot be described in closed analytic form. The asymptotic 
leaves the generalization error unchanged and explains the behavior illustrated in 
Fig. la-b. It is the differentiation between R and S which signals the escape from the 
symmetric subspace; the differentiation between Q and C occurs for larger values 
of ct. The relevant perturbation corresponds to an enhancement of the overlap 
R = R* + r between a given student node and the teacher node it is learning to 
imitate, while the overlap S = S* + s between the same student node and the 
remaining teacher nodes is weakened. The time constant associated with this mode 
is r = (7r/2K)(2K - 1)/2(2K + 1) 3/2, with r  2rK in the large K limit. 
It is in the subsequent convergence to an asymptotic solution that the realizable 
and unrealizable cases exhibit fundamental differences. We examine first the real- 
izable scenario, in which the system converges to an optimal solution with perfect 
generalization. 
As the specialization continues, the dominant overlaps R grow, and the secondary 
overlaps S decay to zero. Further specialization involves the decay to zero of the 
student-student correlations C and the growth of the norms Q of the student vectors. 
To investigate the convergence to the optimal solution we linearize the equations 
of motion around the asymptotic fixed point at S* = C* = 0, R* = Q* = 1, 
with  = 0. We describe convergence to the optimal solution by applying the full 
equations of motion (4) to a phase characterized by Ri, = RSir, + S(1 - 5i) and 
Qik -- QSik + C(1 - in). 
Linearization of the full equations of motion around the asymptotic fixed point 
results in four eigenvalues; the dependence of the two largest eigenvalues on r/ is 
shown in Fig. 2 for M - K - 3. An initially slow mode corresponds to the 
eigenvalue h2, which remains negative for all values of r/, while the eigenvalue h 
for the initially fast mode becomes positive as r exceeds rmax, given by 
 75 - 42V (9) 
r/,ax = K 25x/- - 42 
308 D. SAAD, S. A. SOLLA 
values of the order parameters and the generalization error depend on the learn- 
ing rate r/; convergence to an optimal solution with minimal generalization error 
requires r/- 0 as a -* c. Optimal values for the order parameters follow from a 
small r/analysis, equivalent to neglecting jx and assuming student vectors confined 
to $B. The resulting expansion Ji : -nM_-i lrinen, with tii -- t, lrin -- S for 
1 < n < K, n  i, and Rin : U for K + 1 _< n < M, leads to 
Q = R + (x - 1)s 2 + (M - , c: 2Rs + - 2)s (M- c)v 2 . (10) 
The equations of motion for the remaining parameters R, S, and U exhibit a fixed 
point solution which controls the asymptotic behavior. This solution cannot be 
obtained analytically, but numerical results are well approximated to order (1/K 3) 
by 
where L -- M- K. The corresponding fixed point values Q* and C* follow from 
Eq. (10). Note that R* is lower than for the realizable case, and that correlations 
U* (significant) and S* (weaker) between student vectors and the teacher vectors 
they do not imitate are not eliminated. The asymptotic generalization error is given 
by 
, 1 L [4K2(7 r- 3) +4K(2VrJ - 3) + 1] (12) 
ea = 24 K 2 
to order (l/K2). Note its proportionality to the mismatch L between teacher and 
student architectures. 
Learning at fixed and sufficiently small r/ results in exponential convergence to 
an asymptotic solution whose fixed point coordinates are shifted from the values 
discussed above. The solution is suboptimal; the resulting increase in e; from its 
optimal value (12) is easily obtained to first order in r/, and it is also proportional 
to L. We have investigated convergence to the optimal solution (12) for schedules 
of the form r(ct) = r/0/(ct - ct0) z for the decay of the learning rate. A constant rate 
r/0 is used for ct _< ct0; the monotonic decrease of r for ct > ct0 is switched on after 
specialization begins. Asymptotic convergence requires 0 < z _< 1; fastest decay of 
the generalization error is achieved for z = 1/2. 
Specialization as described here and illustrated in Fig.1 is a simultaneous process in 
which each student node acquires a strong correlation with a specific teacher node 
while correlations to other teacher nodes decrease. Such synchronous escape from 
the symmetric phase is characteristic of learning scenarios where the target task is 
defined through an isotropic teacher. In the case of a graded teacher we find that 
specialization occurs through a sequence of escapes from the symmetric subspace, 
ordered according to the relevance of the corresponding teacher nodes [3]. 
Acknowledgement The work was supported by the EU grant CHRX-CT92-0063. 
References 
[1] G. Cybenko, Math. Control Signals and Systems 2, 303 (1989). 
[2] M. Biehl and H. Schwarze, J. Phys. A 28,643 (1995). 
[3] D. Saad and S. A. Solla, Phys. Rev. E, 52, 4225 (1995). 
