Temporal Dynamics of Generalization in 
Neural Networks 
Changfeng Wang 
Department of Systems Engineering 
University Of Pennsylvania 
Philadelphia, PA 19104 
wangpender. ee. upenn. edu 
Santosh S. Venkatesh 
Department of Electrical Engineering 
University Of Pennsylvania 
Philadelphia, PA 19104 
venkat eshee. upenn. edu 
Abstract 
This paper presents a rigorous characterization of how a general 
nonlinear learning machine generalizes during the training process 
when it is trained on a random sample using a gradient descent 
algorithm based on reduction of training error. It is shown, in 
particular, that best generalization performance occurs, in general, 
before the global minimum of the training error is achieved. The 
different roles played by the complexity of the machine class and 
the complexity of the specific machine in the class during learning 
are also precisely demarcated. 
1 INTRODUCTION 
In learning machines such as neural networks, two major factors that affect the 
'goodness of fit' of the examples are network size (complexity) and training time. 
These are also the major factors that affect the generalization performance of the 
network. 
Many theoretical studies exploring the relation between generalization performance 
and machine complexity support the parsimony heuristics suggested by Occam's ra- 
zor, to wit that amongst machines with similar training performance one should opt 
for the machine of least complexity. Multitudinous numerical experiments (cf. [5]) 
suggest, however, that machines of larger size than strictly necessary to explain the 
data can yield generalization performance similar to that of smaller machines (with 
264 Changfeng Wang, Santosh S. Venkatesh 
similar empirical error) if learning is optimally stopped at a critical point before 
the global minimum of the training error is achieved. These results seem to fly in 
contradiction with a learning theoretic interpretation of Occam's razor. 
In this paper, we ask the following question: How does the gradual reduction of 
training error affect the generalization error when a machine of given complexity is 
trained on a finite number of examples? Namely, we study the simultaneous effects 
of machine size and training time on the generalization error when a finite sample 
of examples is available. 
Our major result is a rigorous characterization of how a given learning machine 
generalizes during the training process when it is trained using a learning algorithm 
based on minimization of the empirical error (or a modification of the empirical 
error). In particular, we are enabled to analytically determine conditions for the 
existence of a finite optimal stopping time in learning for achieving optimal general- 
ization. We interpret the results in terms of a time-dependent, effective machine size 
which forms the link between generalization error and machine complexity during 
learning viewed as an evolving process in time. 
Our major results are obtained by introducing new theoretical tools which allow us 
to obtain finer results than would otherwise be possible by direct applications of the 
uniform strong laws pioneered by Vapnik and Cervonenkis (henceforth refered to as 
VC-theory). The different roles played by the complexity of the machine class and 
the complexity of the specific machine in the class during learning are also precisely 
demarcated in our results. 
Since the generalization error is defined in terms of an abstract loss function, the 
results find wide applicability including but not limited to regression (square-error 
loss function) and density estimation (log-likelihood loss) problems. 
2 THE LEARNING PROBLEM 
We consider the problem of learning from examples a relation between two vectors 
x and y determined by a fixed but unknown probability distribution P(x, y). This 
model includes, in particular, the input-output relation described by 
y:g(x,), (1) 
where g is some unknown function of x and , which are random vectors on the same 
probability space. The vector x can be viewed as the input to an unknown system, 
 a random noise term (possibly dependent on x), and y the system's output. 
The hypothesis class from which the learning procedure selects a candidate function 
(hypothesis) approximating g is a parametric family of functions 7-/a: { f(x, O)  
0  Oa C_ 3 a } indexed by a subset Oa of d-dimensional Euclidean space. For 
example, if x 6 3 rn and y is a scalar, 7-/a can be the class of functions computed 
by a feedforward neural network with one hidden layer comprised of h neurons and 
activation function b, viz., 
h rn 
Temporal Dynamics of Generalization in Neural Networks 265 
In the above, d = (rn + 2)h denotes the number of adjustable parameters. 
The goal of learning within the hypothesis class 7-/a is to find the best approx- 
imation of the relation between x and y in 7-/a from a finite set of n examples 
Dn = { (Xl, yl),..., (xn, yn)} drawn by independent sampling from the distribu- 
tion P(x, y). A learning algorithm is simply a map which, for every sample 
(n >_ 1), produces a hypothesis in 7-/a. 
In practical learning situations one first selects a network of fixed structure 
fixed hypothesis class 7-/a), and then determines the "best" weight vector 0* (or 
equivalently, the best function f(x, 0') in this class) using some training algorithm. 
The proximity of an approximation f(x,O) to the target function g(x,) at each 
point x is measured by a loss function q  (f(x,0),g(x,))  3 +. For a given 
hypothesis class, the function f(.,O) is completely determined by the parameter 
vector 0. With g fixed, the loss function may be written, with a slight abuse of 
notation, as a map q(x, y, 0) from 3 m x 3 x O into 3 + . Examples of the forms of loss 
functions are the familiar square-law loss function q(x, y, O) = (g(x, ) - f(z, 0)) 2 
commonly used in regression and learning in neural networks, and the Kulback- 
Leibler distance (or relative entropy) q(x, y, 0) = In p(yl.(*,0)) for density estimation, 
p(ylg(,)) 
where p(y]i(x,O)) denotes the conditional density function of y given i(x,). 
The closeness of f(., ) to g(.) is measured by the expected (ensemble) loss or error 
(O,d)  / q(x,y,O)P(dx, dy). 
The optimal approximation f(., 0') is such that (0', d) = min0eoa (0, d). In 
similar fashion, we define the corresponding empirical loss (or training error) by 
f o) 
n(O,d)- q(x,y,O)Pn(dx, dy) =  . 
i--1 
where Pndenotes the joint empirical distribution of input-output pairs (x,y). 
The global minimum of the empirical error over Oa is denoted by t, namely, 
t = arg min0eo n0, d). An iterative algorithm for minimizing n0, d) (or a modi- 
fication of it) over Oa generates at each epoch t a random vector Ot'l)n - Oa. The 
quantity (Ot,d) = E f q(x, Ot)P(dx, d) is referred to as the generalization error 
of 0t. We are interested in the properties of the process { 0t 't = 1,2,...}, and the 
time-evolution of the sequence { (0t, d) 't = 1, 2,... }. 
Note that each 0t is a functional of Pn When P = Pn learning reduces to an opti- 
mization problem. Deviations from optimality arise intrinsically as a consequence 
of the discrepancy between Pnand P. The central idea of this work is to analyze 
the consequence of the deviation Ang Pn- P on the generalization error. 
To simplify notation, we henceforth suppress d and write simply O, (0) and n(O) 
instead of Oa, (O,d), and n(O, d), respectively. 
2.1 Regularity Conditions 
We will be interested in the local behavior of learning algorithms. Consequently, we 
assume that O is a compact set, and 0* is the unique global minimum of (0) on O. 
266 Changfeng Wang, Santosh S. Venkatesh 
It can be argued that these assumptions are an idealization of one of the following 
situations: 
 A global algorithm is used which is able to find the global minimum of 
n (0), and we are interested in the stage of training when 0t has entered a 
region 0 where 0* is the only global minimum of (0); 
 A local algorithm is used, and the algorithm has entered a region 0 which 
contains 0* as the unique global minimum of (0) or as a unique local 
minimum with which we are content. 
In the sequel, we write 0/00 to denote the gradient operator with respect to the 
r 2 I a 
vector 0, and likewise write 02/002 to denote the matrix of operators [ooioo d J i,j=l' 
In the rest of the development we assume the following regularity conditions: 
A1. The loss function q(x, y, .) is twice continuously differential for all 0  0 and 
for almost all (x, y); 
A2. P(x, y) has compact support; 
A3. The optimal network 0* is an interior point of O; 
A4. The matrix (0') = 2 era. 
,:ov ) is nonsingular. 
These assumptions are typically satisfied in neural network applications. We will 
also assume that the learning algorithm converges to the global minimum of nO) 
over ) (note that t may not be a true global minimum, so the assumption applies to 
gradient descent algorithms which converge locally). It is easy to demonstrate that 
for each such algorithm, there exists an algorithm which decreases the empirical 
error monotonically at each step of iteration. Thus, without loss of generality, we 
also assume that all the algorithms we consider have this monotonicity property. 
3 GENERALIZATION DYNAMICS 
3.1 First Phase of Learning 
The quality of learning based on the minimization of the empirical error depends 
on the value of the quantity sure InO) - (0)1. Under the above assumptions, it 
is shown in [3] that 
(0)-'n(0)%0 llnn 
k, Vrj and (0)=(0')+ 0 1 ---n-n 
Therefore, for any iterative algorithm for minimizing nO), in the initial phase of 
learning the reduction of training error is essentially equivalent to the reduction of 
generalization error. It can be further shown that this situation persists until the 
estimates Ot enter an n -6' neighborhood of t, where 5n-- 1/2. 
The basic tool we have used in arriving at this conclusion is the VC-method. The 
characterization of the precise generalization properties of the machine after Ot 
enters an n - neighborhood of the limiting solution needs a more precise language 
than can be provided by the VC-method, and is the main content of the rest of this 
work. 
Temporal Dynamics of Generalization in Neural Networks 267 
3.2 Learning by Gradient Descent 
In the following, we focus on generalization properties when the machine is trained 
using the gradient descent algorithm (Backpropagation is a Gauss-Seidel implemen- 
tation of this algorithm); in particular, the adaptation is governed by the recurrence 
Ot+l = ot- (t >_ o), (2) 
where the positive quantity e governs the rate of learning. Learning and general- 
ization properties for other algorithms can be studied using similar techniques. 
Replace nby  in (2) and let { 0 F ,t _ 0 ) denote the generated sequence of vectors. 
We can show (though we will not do so here) that the weight vector Ot is asymp- 
totically normally distributed with expectation 0 F and covariance matrix with all 
entries of order O(). It is precisely the deviation of Ot from 0 F caused by the 
perturbation of amount An- Pn - P to the true distribution P which results 
in interesting artifacts such as a finite optimal stopping time when the number of 
examples is finite. 
3.3 The Main Equation of Generalization Dynamics 
Under the regularity conditions mentioned in the last section, we can find the gen- 
eralization error at each epoch of learning as an explicit function of the number of 
iterations, machine parameters, and the initial error. Denote by A1 _ A2 _ ... _ Ad 
the eigenvalues of the matrix (O*) and suppose T is the orthogonal diagonaliz- 
ing matrix for (O*), viz., TqS(O*)T = diag(Al,...,Ad). Set 6 -- (61,...,6d)   
T(0o - 0') and for each i let Yi denote the ith diagonal element of the d x d matrix 
TtE{(oq(x,O*)) (ooq(x,O*))t}T. Also let S(O,p) denote the open ball of radius p 
at 0. 
MAIN THEOREM Under Assumptions A1-Ad, the generalization error of the ma- 
chine trained according to (2) is governed by the following equation for all starting 
points 0o  $(0', n -r) (0 < r _ ), and uniformly for all t _ O: 
d 
= + 
..._ 
2 + ,,(1 - e,,) 2t} + O(n-a"). (3) 
If Oo f $(O*,n-), then the generalization dynamics is governed by the following 
equation valid for all r > O: 
d 
i=1 i 
1 -eAi)t+Ci(tl)(1 - eAi) TM 5}+O(n-r), (4) 
where tl is the smallest t such that E I[I- = An -r for some A > O, and 
C(tl), C(tl) >_ 0 are constants depending on network parameters and tl. 
In the special case when the data is generated by the following additive noise model 
y=g(x)+, (5) 
268 Changfeng Wang, Santosh S. Venkatesh 
with E [1 x] : 0, and E [2[x] : 2: constant, if g(x): f(x,O*) and the loss 
function q(x, O) is given by the square-error loss function, the above equation reduces 
to the following form: 
(T 2 d 
= 
i=1 
+ - + O(n-3r). 
In particular, if f(x, O) is linear in O, we obtain our previous result [4] for linear 
machines. The result (3) is hence a substantive extension of the earlier result to 
very general settings. It is noted that the extension goes beyond nonlinearity and 
the original additive noise data generating model--we no longer require that the 
'true' model be contained in the hypothesis class. 
3.4 Effective Complexity 
Write ci = 1/i/hi. The effective complexity of the nonlinear machine Ot at t is defined 
to be 
d 
i=1 
Analysis shows that the term ci indicates the level of sensitivity of output of the 
machine to the ith component of the normalized weight vector, 0; C(O*, d, t) denotes 
the degree to which the approximation power of the machine is invoked by the 
learning process at epoch t. Indeed, as t - 0% C(O*,d,t) - Ca = E{=i ci, which 
is the complexity of the limiting machine t which represents the maximal fitting 
of examples to the machine (i.e., minimized training error). For the additive noise 
data generating model (5) and square-error loss function, the effective complexity 
becomes, 
d 
(1- 
i=1 
The sum can be interpreted as the effective number of parameters used at epoch t. 
At the end of training, it becomes exactly the number of parameters of the machine. 
Now write 07 g 0* +(00-0')(1- eA) t. With these definitions, (3) can be rewritten 
to give the following approximation error and complexity error decomposition of 
generalization error in the learning process: 
C(O*,a,t) 
e(0) = s(0;)+ 2n + O(n -3r) (t >_ 0). (6) 
The first term on the right-hand-side, (0), denotes the approximation error at 
epoch t and is the error incurred in using 07 as an approximation of the 'truth.' Note 
that the approximation error depends on time t and the initial value 00, but not 
the examples. Clearly, it is the error one would obtain at epoch t in minimizing the 
function (0) (as opposed to nO)) using the same learning algorithm and starting 
with the same step length e and initial value 00. The second term on the right- 
hand-side is the complexity error at epoch t. This is the part of the generalization 
error at t due to the substitution of nO) for (0). 
Temporal Dynamics of Generalization in Neural Networks 269 
The overfitting phenomena in learning is often intuitively attributed to the 'fitting 
of noise.' We see that is only partly correct: it is in fact due to the increasing use 
of the capacity of the machine, that the complexity penalty becomes increasingly 
large, this being true even when the data is clean, i.e., when  _= 0! Therefore, 
we see that (6) gives an exact trade-off of the approximation error and complexity 
error in the learning process. 
For the case of large initial error, we see from the main theorem that the complexity 
error is essentially the same as that at the end of training, when the initial error 
is reduced to about the same order as before. The reduction of the training error 
leads to monotone decrease in generalization error in this case. 
3.5 Optimal Stopping Time 
We can phrase the following succinct open problem in learning in neural networks: 
When should learning be ideally stopped? The question was answered for linear 
machines which is a special form of neural networks in [4]. This section extends the 
result to general nonlinear machines (including neural networks) in regular cases. 
For this purpose, we write the generalization error in the following form: 
= + 6(t) + O(n-3r). 
where 
d 
() ---- (T2 E(li(1- )i) 2t -- di(1- (,)li)  }, 
i=1 
and di and li are machine parameters. The time-evolution of generalization error 
during the learning process is completely determined by the function q(t). 
Define min  { T __ 0' (0r) __ (Ot) for all  _ 0 }, that is tmin denotes an epoch 
at which the generalization error is minimized. The smallest such number will be 
referred to as the optimal stopping time of learning. In general we have ci > 0 for 
all i. In this case, it is possible to determine that there is a finite optimal stopping 
time. More specifically, there exists two constants tl and tu which depend on the 
machine parameters such that tl _< train __ tu. Furthermore, it can be shown that 
the function qb(t) decreases monotonically for t _< tt and increases monotonically 
for all t >_ tu. Finally, we can relate the generalization performance when learning 
is optimally stopped to the best achievable performance by means of the following 
inequality: 
(Otmin ) _< (0') -Jr- 2n ' 
where n = O(n ) is a constant depending on li and di's, and is in the interval (0, ], 
and Ca denotes, as before, the limiting value of the effective machine complexity 
C(O*,d,t) as t - 
In the pathological case where there exists i such that ci = O, there may not exist 
a finite optimal stopping time. However, even in such cases, it can be shown that 
if ln(1 - fax)/ln(1 - eAa) < 2, a finite optimal stopping time still exists. 
270 Changfeng Wang, Santosh S. Venkatesh 
4 CONCLUDING REMARKS 
This paper describes some major results of our recent work on a rigorous characteri- 
zation of the generalization process in neural network types of learning machines. In 
particular, we have shown that reduction of training error may not lead to improved 
generalization performance. Two major techniques involved are the uniform weak 
law (VC-theory) and differentiable statistical functionals, with the former deliver- 
ing an initial estimate, and the latter giving finer results. The results shows that 
the complexity (e.g. VC-dimension) of a machine class does not suffice to describe 
the r61e of machine complexity in generalization during the learning process; the 
appropriate complexity notion required is a time-varying and algorithm-dependent 
concept of effective machine complexity. 
Since results in this work contain parameters which are typically unknown, they 
cannot be used directly in practical situations. However, it is possible to frame 
criteria overcoming such difficulties. More details of the work described here and 
its extensions and applications can be found in [3]. The methodology adopted here 
is also readily adapted to study the dynamical effect of regularization on the learning 
process [3]. 
Acknowledgements 
This research was supported in part by the Air Force Office of Scientific Research 
under grant F49620-93-1-0120. 
References 
[1] Kolmogorov, A. and V. Tihomirov (1961). e-entropy and e-capacity of sets in 
functional spaces. Amer. Math. Soc. Trans. (Set. ), 17:277-364. 
[2] Vapnik, V. (1982). Estimation of Dependences Based on Empirical Data. 
Springer-Verlag, New York. 
[3] Wang, C. (1994). A Theory of Generalization in Learning Machines. rh. D. 
Thesis, University of Pennsylvania. 
[4] Wang, C., S.S. Venkatesh, and J. S. Judd (1993). Optimal stopping and effec- 
tive machine size in learning. Proceedings of NIPS'93. 
[5] Weigend, A. (1993). On overtraining and the effective number of hidden units. 
Proceedings of the 1993 Connectionist Models Summer School. 335-342. Ed. 
Mozer, M. C. et al. Hillsdale, NJ: Erlbaum Associates. 
