Geometry of Early Stopping in Linear 
Networks 
Robert Dodier * 
Dept. of Computer Science 
University of Colorado 
Boulder, CO 80309 
Abstract 
A theory of early stopping as applied to linear models is presented. 
The backpropagation learning algorithm is modeled as gradient 
descent in continuous time. Given a training set and a validation 
set, all weight vectors found by early stopping must lie on a cer- 
tain quadric surface, usually an ellipsoid. Given a training set and 
a candidate early stopping weight vector, all validation sets have 
least-squares weights lying on a certain plane. This latter fact can 
be exploited to estimate the probability of stopping at any given 
point along the trajectory from the initial weight vector to the least- 
squares weights derived from the training set, and to estimate the 
probability that training goes on indefinitely. The prospects for 
extending this theory to nonlinear models are discussed. 
I INTRODUCTION 
'Early stopping' is the following training procedure: 
Split the available data into a training set and a "validation" set. 
Start with initial weights close to zero. Apply gradient descent 
(backpropagation) on the training data. If the error on the valida- 
tion set increases over time, stop training. 
This training method, as applied to neural networks, is of relatively recent origin. 
The earliest references include Morgan and Bourlard [4] and Weigend et al. [7]. 
*Address correspondence to: dodiercs. colordo.edu 
366 R. DODIER 
Finnoff et al. [2] studied early stopping empirically. While the goal of a theory of 
early stopping is to analyze its application to nonlinear approximators such as sig- 
moidal networks, this paper will deal mainly with linear systems and only marginally 
with nonlinear systems. Baldi and Chauvin [1] and Wang et al. [6] have also ana- 
lyzed linear systems. 
The main result of this paper can be summarized as follows. It can be shown 
(see Sec. 5) that the most probable stopping point on a given trajectory (fixing 
the training set and initial weights) is the same no matter what the size of the 
validation set. That is, the most probable stopping point (considering all possible 
validation sets) for a finite validation set is the same as for an infinite validation 
set. (If the validation data is unlimited, then the validation error is the same as the 
true generalization error.) However, for finite validation sets there is a dispersion 
of stopping points around the best (most probable and least generalization error) 
stopping point, and this increases the expected generalization error. See Figure 1 
for an illustration of these ideas. 
2 MATHEMATICAL PRELIMINARIES 
In what follows, backpropagation will be modeled as a process in continuous time. 
This corresponds to letting the learning rate approach zero. This continuum model 
simplifies the necessary algebra while preserving the important properties of early 
stopping. Let the inputs be denoted X - (xij), so that xij is the j'th component of 
the i'th observation; there are p components of each of the n observations. Likewise, 
let y = (yi) be the (scalar) outputs observed when the inputs are X. Our regression 
model will be a linear model, yi -- wxi + el, i = 1,...,n. Here ei represents 
independent, identically distributed (i.i.d.) Gaussian noise, ei "-' N(O, a2). Let 
E(w) - [[Xw -y[[2 be one-half the usual sum of squared errors. 
The error gradient with respect to the weights is VE(w) - wXX - yX. The 
backprop algorithm is modeled as  = -VE(w). The least-squares solution, at 
which VE(w) - 0, is w$ -- (XX)-lXy. Note the appearence here of the 
input correlation matrix, XX - (.= xkixkj). The properties of this matrix 
determine, to a large extent, the properties of the least-squares solutions we find. It 
turns out that as the number of observations n increases without bound, the matrix 
a(XX) - converges with probability one to the population covariance matrix of 
the weights. We will find that the correlation matrix plays an important role in the 
analysis of early stopping. 
We can rewrite the error E using a diagonalization of the correlation matrix XX - 
SAS . Omitting a few steps of algebra, 
p 
1 1  
E(w) =   Akv + y (y- Xw$) (1) 
k=l 
where v = S(w-ws) and A = diag(A1,..., Ap). In this sum we see that the mag- 
nitude of the k'th term is proportional to the corresponding characteristic value, 
so moving w toward w$ in the direction corresponding to the largest character- 
istic value yields the greatest reduction of error. Likewise, moving in the direction 
corresponding to the smallest characteristic value gives the least reduction of error. 
Geometry of Early Stopping in Linear Networks 367 
So far, we have implicitly considered only one set of data; we have assumed all data 
is used for training. Now let us distinguish training data, Xt and Yt, from validation 
data, Xv and Yv; there are nt training and nv validation data. Now each set of 
data has its own least-squares weight vector, wt and Wv, and its own error gradient, 
VEt(w) and VEv (w). Also define Mt = X'tXt and My = XvXv for convenience. 
The early stopping method can be analyzed in terms of the these pairs of matrices, 
gradients, and least-squares weight vectors. 
3 THE MAGIC ELLIPSOID 
Consider the early stopping criterion,  (w) - 0. Applying the chain rule, 
dt 
der der dw 
-- = -- -- = VE.-VEt, (2) 
dt dw dt 
where the last equality follows from the definition of gradient descent. So the early 
stopping criterion is the same as saying 
VEt  VE, = 0, 
(3) 
that is, at an early stopping point, the training and validation error gradients are 
perpendicular, if they are not zero. 
Consider now the set of all points in the weight space such that the training and 
validation error gradients are perpendicular. These are the points at which early 
stopping may stop. It turns out that this set of points has an easily described shape. 
The condition given by Eq. 3 is equivalent to 
0 = VEt' VE, = (w- wt)'MtM,(w- w,). 
(4) 
Note that all correlation matrices are symmetric, so MtM[ - MtMv. We see that 
Eq. 4 gives a quadratic form. Let us put Eq. 4 into a standard form. Toward this 
end, let us define some useful terms. Let 
M = MtMv, (5) 
1I = + M ) -}(MtM, + M,Mt), (6) 
- 
1 
w = (wt + (7) 
/Xw = wt - w, (8) 
and 
 = -- 41-1(/I-1(M- M')Aw. 
(9) 
Now an important result can be stated. The proof is omitted. 
Proposition 1. VEt  VEv = 0 is equivalent to 
1 
(w - g')'l(/l(w - g') = 41-Aw[l(/l + ;(M - M)I(/I-i(M - M')]Aw. [] (10) 
The matrix 1QI of the quadratic form given by Eq. 10 is "usually" positive definite. 
As the number of observations nt and nv of training and validation data increase 
without bound, 1QI converges to a positive definite matrix. In what follows it will 
368 R. DODIER 
always be assumed that 1QI is indeed positive definite. Given this, the locus defined 
by VEt  VEv is an ellipsoid. The centroid is ,, the orientation is determined by 
the characteristic vectors of 1QI, and the length of the k'th semiaxis is cv/, where 
c is the constant on the righthand side of Eqo 10 and k is the k'th characteristic 
value of 1QI. 
4 THE MAGIC PLANE 
Given the least-squares weight vector wt derived from the training data and a 
candidate early stopping weight vector Wes, any least-squares weight vector Wv 
from a validation set must lie on a certain plane, the 'magic plane.' The proof of 
this statement is omitted. 
Proposition 2. The condition that wt, Wv, and wes all lie on the magic ellipsoid, 
(w,-,)'(w-,) = (w,-')'(v -') = (w,-,)'(w-') =c, (11) 
implies 
(wt- we,)'Mwv = (wt - we,)'Mwe,. [] (12) 
This shows that Wv lies on a plane, the magic plane, with normal M(wt - Wes). 
The reader will note a certain difficulty here, namely that M - MtMv depends on 
the particular validation set used, as does Wv. However, we can make progress by 
considering only a fixed correlation matrix My and letting Wv vary. Let us suppose 
the inputs (x, x2,..., Xp) are i.i.d. Gaussian random variables with mean zero and 
some covariance . (Here the inputs are random but they are observed exactly, so 
the error model y: wx + e still applies.) Then 
(M) = (XX) = nvXl, 
so in Eq. 12 let us replace M with its expected value n,,. That is, we can 
approximate Eq. 12 with 
(wt - Wes)'Mtwv = (wt - Wes)'Mtwes. 
(13) 
Now consider the probability that a particular point w(t) on the trajectory from 
w(0) to wt is an early stopping point, that is, VEt(w(t)). VE(w(t)) - 0. This is 
exactly the probability that Eq. 12 is satisfied, and approximately the probability 
that Eq. 13 is satisfied. This latter approximation is easy to calculate: it is the 
mass of an infinitesimally-thin slab cutting through the distribution of least-squares 
validation weight vectors. Given the usual additive noise model y - wx q- e with e 
being i.i.d. Gaussian distributed noise with mean zero and variance a 2, the least- 
squares weights are approximately distributed as 
W- W* " N(O,o'2(X'X) -1) 
(14) 
when the number of data is large. 
Consider now the plane f/= {w: wfi = k}. The probability mass on this plane as 
it cuts through a Gaussian distribution N(/, C) is then 
l(k-Wa) 2 
pn(k, fi) = (27rtCfl) -1/2exp(- fi'Cfi ) ds 
(15) 
where ds denotes an infinitesimal arc length. (See, for example, Sec. VIII-9.3 of 
von Mises [3].) 
Geometry of Early Stopping in Linear Networks 369 
0.25 
0.1 
0.2 
0.15 
0.05 
1 2 3 . 5 6 7 8 
Arc Length Along Trajectory 
Figure 1: Histogram of early stopping points along a trajectory, with bins of equal 
arc length. An approximation to the probability of stopping (Eq. 16) is superim- 
posed. Altogether 1000 validation sets were generated for a certain training set; of 
these, 288 gave "don't start" solutions, 701 gave early stopping solutions (which are 
binned here) somewhere on the trajectory, and 11 gave "don't stop" solutions. 
5 PROBABILITY OF STOPPING AT A GIVEN POINT 
Let us apply Eq. 15 to the problem at hand. Our normal is fi - nvMt(wt - Wes) 
and the offset is k = n wes. A formal statement of the approximation of pn can 
now be made. 
I 
Proposition 3. Assuming the validation correlation matrix XvX equals the mean 
correlation matrix nv]E, the probability of stopping at a point wes = w(t) on the 
trajectory from w(0) to wt is approximately 
pfl(t) -- pfl(k(t),fi(t)) = (2fi'Cfi) -1/2 exp(- 2 fi'Cfi )' 
(16) 
with 
fi'Cfi = nva(w, - wes)'M,gM(w, - wes). [] (17) 
How useful is this approximation? Simulations were carried out in which the initial 
weight vector w(0) and the training data (nt = 20) were fixed, and many validation 
sets of size n 20 were generated (without fixing  
- XX). The trajectory was 
divided into segments of equal length and histograms of the number of early stopping 
weights on each segment were constructed. A typical example is shown in Figure 1. 
It can be seen that the empirical histogram is well-approximated by Eq. 16. 
If for some w(t) on the trajectory the magic plane cuts through the true weights 
w*, then p will have a peak at t. As the number of validation data n increases, 
the variance of wv decreases and the peak narrows, but the position w(t) of the 
peak does not move. As n -  the peak becomes a spike at w(t). That is, the 
peak of p for a finite validation set is the same as if we had access to the true 
generalization error. In this sense, early stopping does the right thing. 
It has been observed that when early stopping is employed, the validation error 
may decrease forever and never rise - thus the 'early stopping' procedure yields the 
least-squares weights. How common is this phenomenon? Let us consider a fixed 
3 70 R. DODIER 
training set and a fixed initial weight vector, so that the trajectory is fixed. Letting 
the validation set range over all possible realizations, let us denote by P(t) - 
P(k(t), fi(t)) the probability that training stops at time t or later. 1- P(0) is the 
probability that validation error rises immediately upon beginning training, and let 
us agree that P () denotes the probability that validation error never increases. 
This P(t) is approximately the mass that is "behind" the plane fiwv = 
"behind" meaning the points wv such that (w - w,s)fi < 0. (The identification 
of P with the mass to one side of the plane is not exact because intersections of 
magic planes are ignored.) As Eq. 15 has the form of a Gaussian p.d.f., it is easy 
to show that 
/ k_-fiw*  
(k, a) = a 
(18) 
where G denotes the standard Gaussian c.d.f., G(z) = (2r) -1/' f_ exp(-t2/2)dt. 
Recall that we take the normal fi of the magic plane through ws as fi = Mt (w/- 
w,). For t = 0 there is no problem with Eq. 18 and an approximation for the 
"never-starting" probability is stated in the next proposition. 
Proposition 4. The probability that validation error increases immediately upon 
beginning training ("never starting"), assuming the validation correlation matrix 
! 
XvX equals the mean correlation matrix n,,, is approximately 
1 - Pu(0) = 1 -G (x/ (w* - w(0))'Mtl(wt- w(0))  [] (19) 
a [(wt - w(0))'MtlMt(wt - w(0))]l/ J' 
With similar arguments we can develop an approximation to the "never-stopping" 
probability. 
Proposition 5. The probability that training continues indefinitely ("never stop- 
ping"), assuming the validation correlation matrix XXv equals the mean correla- 
tion matrix nv]E, is approximately 
Pn() = G ( (w*-- wt)'Mt(+s*) (20) 
(Y *[($*)'SS*]1/2 / ' 
In Eq. 20 pick +s* if (wt - w(0))'s* > 0, otherwise pick -s*. [] 
Simulations are in good agreement with the estimates given by Propositions 4 and 
5. 
6 
EXTENDING THE THEORY TO NONLINEAR 
SYSTEMS 
It may be possible to extend the theory presented in this paper to nonlinear approx- 
imators. The elementary concepts carry over unchanged, although it will be more 
difficult to describe them algebraically. In a nonlinear early stopping problem, there 
will be a surface corresponding to the magic ellipsoid on which VEt 2. V'Ev, but 
this surface may be nonconvex or not simply connected. Likewise, corresponding 
to the magic plane there will be a surface on which least-squares validation weights 
must fall, but this surface need not be fiat or unbounded. 
It is customary in the world of statistics to apply results derived for linear systems 
to nonlinear systems by assuming the number of data is very large and various 
Geometry of Early Stopping in Linear Networks 371 
regularity conditions hold. If the errors e axe additive, the least-squares weights 
again have a Gaussian distribution. As in the linear case, the Hessian of the total 
error appears as the inverse of the covariance of the least-squares weights. In this 
asymptotic (large data) regime, the standard results for linear regression carry over 
to nonlinear regression mostly unchanged. This suggests that the linear theory of 
early stopping will also apply to nonlinear regression models, such as sigmoidal 
networks, when there is much data. 
However, it should be noted that the asymptotic regression theory is purely local 
- it describes only what happens in the neighborhood of the least-squares weights. 
As the outcome of early stopping depends upon the initial weights and the trajec- 
tory taken through the weight space, any local theory will not suffice to analyze 
early stopping Nonlinear effects such as local minima and non-quadratic basins 
cannot be accounted for by a linear or asymptotically linear theory, and these may 
play important roles in nonlinear regression problems. This may invalidate direct 
extrapolations of linear results to nonlinear networks, such as that given by Wang 
and Venkatesh [5]. 
7 ACKNOWLEDGMENTS 
This research was supported by NSF Presidential Young Investigator award IRI- 
9058450 and grant 90-21 from the James S. McDonnell Foundation to Michael C. 
Mozer. 
References 
[1] Baldi, P., and Y. Chauvin. "Temporal Evolution of Generalization during Learn- 
ing in Linear Networks," Neural Computation 3,589-603 (Winter 1991). 
[2] 
Finnoff, W., F. Hergert, and H. G. Zimmermann. "Extended Regularization 
Methods for Nonconvergent Model Selection," in Advances in NIPS 5, S. Han- 
son, J. Cowan, and C. L. Giles, eds., pp 228-235. San Mateo, CA: Morgan 
Kaufmann Publishers. 1993. 
[3] von Mises, R. Mathematical Theory of Probability and Statistics. New York: 
Academic Press. 1964. 
[4] 
Morgan, N., and H. Bourlard. "Generalization and Parameter Estimation in 
Feedforward Nets: Some Experiments," in Advances in NIPS 2, D. Touretzky, 
ed., pp 630-637. San Mateo, CA: Morgan Kaufmann. 1990. 
[5] 
Wang, C., and S. Venkatesh. "Temporal Dynamics of Generalization in Neural 
Networks," in Advances in NIPS 7, G. Tesauro, D. Touretzky, and T. Leen, eds. 
pp 263-270. Cambridge, MA: MIT Press. 1995. 
[6] 
Wang, C., S. Venkatesh, J. S. Judd. "Optimal Stopping and Effective Machine 
Complexity in Learning," in Advances in NIPS 6, J. Cowan, G. Tesauro, and J. 
Alspector, eds., pp 303-310. San Francisco: Morgan Kaufmann. 1994. 
[7] Weigend, A., B. Huberman, and D. Rumelhart. "Predicting the Future: A Con- 
nectionist Approach," Int'l J. Neural Systems 1,193-209 (1990). 
