A Simple Weight Decay Can Improve 
Generalization 
Anders Krogh* 
CONNECT, The Niels Bohr Institute 
Blegdamsvej 17 
DK-2100 Copenhagen, Denmark 
krogh@cse.ucsc.edu 
John A. Hertz 
Nordita 
Blegdamsvej 17 
DK-2100 Copenhagen, Denmark 
hertz@nordita.dk 
Abstract 
It has been observed in numerical simulations that a weight decay can im- 
prove generalization in a feed-forward neural network. This paper explains 
why. It is proven that a weight decay has two effects in a linear network. 
First, it suppresses any irrelevant components of the weight vector by 
choosing the smallest vector that solves the learning problem. Second, if 
the size is chosen right, a weight decay can suppress some of the effects of 
static noise on the targets, which improves generalization quite a lot. It 
is then shown how to extend these results to networks with hidden layers 
and non-linear units. Finally the theory is confirmed by some numerical 
simulations using the data from NetTalk. 
I INTRODUCTION 
Many recent studies have shown that the generalization ability of a neural network 
(or any other 'learning machine') depends on a balance between the information in 
the training examples and the complexity of the network, see for instance [1,2,3]. 
Bad generalization occurs if the information does not match the complexity, e.g. 
if the network is very complex and there is little information in the training set. 
In this last instance the network will be over-fitting the data, and the opposite 
situation corresponds to under-fitting. 
*Present address: Computer and Information Sciences, Univ. of California Santa Cruz, 
Santa Cruz, CA 95064. 
950 
A Simple Weight Decay Can Improve Generalization 951 
Often the number of free parameters, i.e. the number of weights and thresholds, is 
used as a measure of the network complexity, and algorithms have been developed, 
which minimizes the number of weights while still keeping the error on the training 
examples small [4,5,6]. This minimization of the number of free parameters is not 
always what is needed. 
A different way to constrain a network, and thus decrease its complexity, is to limit 
the growth of the weights through some kind of weight decay. It should prevent the 
weights from growing too large unless it is really necessary. It can be realized by 
adding a term to the cost function that penalizes large weights, 
I 
i 
where E0 is one's favorite error measure (usually the sum of squared errors), and 
X is a parameter governing how strongly large weights are penalized. w is a vector 
containing all free parameters of the network, it will be called the weight vector. If 
gradient descend is used for learning, the last term in the cost function leads to a 
new term -AWl in the weight update: 
OEo 
O) i C , W i . (2) 
Owi 
Here it is formulated in continuous time. If the gradient of E0 (the 'force term') 
were not present this equation would lead to an exponential decay of the weights. 
Obviously there are infinitely many possibilities for choosing other forms of the 
additional term in (1), but here we will concentrate on this simple form. 
It has been known for a long time that a weight decay of this form can improve 
generalization [7], but until now not very widely recognized. The aim of this paper 
is to analyze this effect both theoretically and experimentally. Weight decay as a 
special kind of regularization is also discussed in [8,9]. 
2 FEED-FORWARD NETWORKS 
A feed-forward neural network implements a function of the inputs that depends 
on the weight vector w, it is called fw. For simplicity it is assumed that there is 
only one output unit. When the input is  the output is fw (). Note that the input 
vector is a vector in the N-dimensional input space, whereas the weight vector is a 
vector in the weight space which has a different dimension 14/. 
The aim of the learning is not only to learn the examples, but to learn the underlying 
function that produces the targets for the learning process. First, we assume that 
this target function can actually be implemented by the network. This means there 
exists a weight vector u such that the target function is equal to f,,. The network 
with parameters u is often called the teacher, because from input vectors it can 
produce the right targets. The sum of squared errors is 
i -.[f,,(,,)_ f (,?,)]: (3) 
Zo(,-,,) = i , 
952 Krogh and Hertz 
where p is the number of training patterns. The learning equation (2) can then be 
written 
Ofw() _Awi. (4) 
Now the idea is to expand this around the solution u, but first the linear case will 
be analyzed in some detail. 
3 THE LINEAR PERCEPTRON 
The simplest kind of 'network' is the linear perceptron characterized by 
(5) 
where the N -x/2 is just a convenient normalization factor. Here the dimension of 
the weight space (W) is the same as the dimension of the input space (N). 
The learning equation then takes the simple form 
Defining 
and 
it becomes 
tO i O( E N-1 E [ttj - wjly -- Wi. 
(6) 
Vi  ti- Wi (7) 
% = (8) 
i) i O( -- E Aijvj q- (Iti - vi)' 
(9) 
Transforming this equation to the basis where A is diagonal yields 
b,. o(-(A,. + A)vr + Au,., (10) 
where Ar are the eigenvalues of A, and a subscript r indicates transformation to this 
basis. The generalization error is defined as the error averaged over the distribution 
of input vectors 
F 
-- ([fu() -- fw()]2) -- (N-l(y]vii)2) --- N -1 vivj(ij) 
i ij 
i 
(11) 
Here it is assumed that (ij)( -- 6ij. The generalization error F is thus proportional 
to Ivl 2, which is also quite natural. 
The eigenvalues of the covariance matrix A are non-negative, and its rank can easily 
be shown to be less than or equal to p. It is also easily seen that all eigenvectors 
belonging to eigenvalues larger than 0 lies in the subspace of weight space spanned 
A Simple Weight Decay Can Improve Generalization 953 
by the input patterns l,...,p. This subspace, called the pattern subspace, will 
be denoted Vp, and the orthogonal subspace is denoted by V.  When there are 
sufficiently many examples they span the whole space, and tee will be no zero 
eigenvalues. This can only happen for p >_ N. 
When X = 0 the solution to (10) inside Vp is just a simple exponential decay to 
vr = 0. Outside the pattern subspace Ar = 0, and the corresponding part of v will 
be constant. Any weight vector which has the same projection onto the pattern 
subspace as u gives a learning error 0. One can think of this as a 'valley' in the 
error surface given by u + l/p  . 
The training set contains no information that can help us choose between all these 
solutions to the learning problem. When learning with a weight decay A > 0, the 
constant part in V.  will decay to zero asymptotically (as e -At, where t is the time). 
p 
An infinitesimal weight decay will therefore choose the solution with the smallest 
norm out of all the solutions in the valley described above. This solution can be 
shown to be the optimal one on average. 
4 LEARNING WITH AN UNRELIABLE TEACHER 
Random errors made by the teacher can be modeled by adding a random term r/to 
the targets: 
f()  f,({) + r/. (12) 
The variance of r/is called er e, and it is assumed to have zero mean. Note that these 
targets are not exactly realizable by the network (for ( > 0), and therefore this is 
a simple model for studying learning of an unrealizable function. 
With this noise the learning equation (2) becomes 
i CK -(m-1 - vj + N-1/2.Y)- wi. (la) 
 J 
Transforming it to the bis where  is diagonal  before, 
  -(A + X)v + Xu - N -/' .. (14) 
The asymptotic solution to this equation is 
 Ur -- N-I/2Y'.i litter  
v,. = X + A,. 
(15) 
The contribution to the generalization error is the square of this summed over all 
r. If averaged over the noise (shown by the bar) it becomes for each r 
The last expression has a minimum in A, which can be found by putting the deriva- 
tive with respect to A equal to zero, >ptimal -- ff2/Itr 2' Remarkably it depends only 
954 
Krogh and Hertz 
Figure 1: Generalization error as a 
function of a = piN. The full line is 
for ,X = 0-2 = 0.2, and the dashed line 
for ,X = 0. The dotted line is the gener- 
alization error with no noise and ,X = 0. 
I 
", I 
/ 
o 1 2 
p/N 
on u and the variance of the noise, and not on A. If it is assumed that u is random 
(16) can be averaged over u. This yields an optimal A independent of r, 
0-2 
opt,mal-- , (17) 
where u 2 is the average of N-JuJ 2 
In this case the weight decay to some extent prevents the network from fitting the 
noise. 
From equation (14) one can see that the noise is projected onto the pattern subspace. 
Therefore the contribution to the generalization error from Vp -L is the same as before, 
and this contribution is on average minimized by a weight decay of any size. 
Equation (17) was derived in [10] in the context of a particular eigenvalue spectrum. 
Figure fig. i shows the dramatic improvement in generalization error when the 
optimal weight decay is used in this case. The present treatment shows that (17) 
is independent of the spectrum of A. 
We conclude that a weight decay has two positive effects on generalization in a 
linear network: 1) It suppresses any irrelevant components of the weight vector by 
choosing the smallest vector that solves the learning problem. 2) If the size is chosen 
right, it can suppress some of the effect of static noise on the targets. 
15 NON-LINEAR NETWORKS 
It is not possible to analyze a general non-linear network exactly, as done above 
for the linear case. By a local linearization, it is however, possible to draw some 
interesting conclusions from the results in the previous section. 
Assume the function is realizable, f = fu. Then learning corresponds to solving the 
p equations 
(18) 
A Simple Weight Decay Can Improve Generalization 955 
in W variables, where W is the number of weights. For p < W these equations 
define a manifold in weight space of dimension at least W - p. Any point v on this 
manifold gives a learning error of zero, and therefore (4) can be expanded around 
0. Putting v = O - w, expanding f,o in v, and using it in (4) yields 
(Ofw() ) Ofw() 
: __  ij(O)Vj -- XV i + Xi (19) 
J 
(The derivatives in this equation should be taken at .) 
The analogue of  is defined  
0l(g) (20) 
OWl Owj 
Since it is of outer product form (like a) its rank R(O) 5 min{p, W}. Thus when 
p < W, A is never of full rank. The rank of A is of course equal to W minus the 
dimension of the manifold mentioned above. 
From these simple observations one can argue that good generalization should not 
be expected for p < W. This is in accordance with other results (cf. [3]), and with 
current 'folk-lore'. The difference from the linear case is that the 'rain gutter' need 
not be (and most probably is not) linear, but curved in this ce. There may in hct 
be other valleys or rain gutters disconnected from the one containing u. One can 
also see that if A h full rank, all points in the immediate neighborhood of  = u 
give a learning error larger than 0, i.e. there is a simple minimum at u. 
Assume that the learning finds one of these valleys. A small weight decay will 
pick out the point in the valley with the smallest norm among all the points in the 
valley. In general it can not be proven that picking that solution is the best strategy. 
But, at let from a philosophical point of view, it seems sensible, because it is (in a 
loose sense) the solution with the smallest complexity--the one that Ockham would 
probably have chosen. 
The value of a weight decay is more evident if there are small errors in the targets. 
In that case one can go through exactly the same line of arguments  for the linear 
case to show that a weight decay can improve generalization, and even with the 
same optimal choice (17) of X. This is strictly true only for small errors (where the 
linear approximation is valid). 
6 NUMERICAL EXPERIMENTS 
A weight decay has been tested on the NetTalk problem [11]. In the simulations 
back-propagation derived from the 'entropic error measure' [12] with a momentum 
term fixed at 0.8 was used. The network had 7 x 26 input units, 40 hidden units and 
26 output units. In all about 8400 weights. It was trained on 400 to 5000 random 
words from the data base of around 20.000 words, and tested on a different set of 
1000 random words. The training set and test set were independent from run to 
run. 
956 
Krogh and Hertz 
1.2 
1.0 
.6 t 
 0.22 "'.,',,, 
o.o - 
0.18 ' 
'". 016 " . "" - 
",, ",, , 0 2'10 4 4'10 4 
2.10 4 4" 
P 
0 4 
Figure 2: The top full line corresponds to the generalization error after 300 epochs 
(300 cycles through the training set) without a weight decay. The lower full line is 
with a weight decay. The top dotted line is the lowest error seen during learning 
without a weight decay, and the lower dotted with a weight decay. The size of the 
weight decay was ,X = 0.00008. 
Insert: Same figure except that the error rate is shown instead of the squared error. 
The error rate is the fraction of wrong phonemes when the phoneme vector with 
the smallest angle to the actual output is chosen, see [11]. 
Results are shown in fig. 2. There is a clear improvement in generalization error 
when weight decay is used. There is also an improvement in error rate (insert of 
fig. 2), but it is less pronounced in terms of relative improvement. Results shown 
here are for a weight decay of ,X = 0.00008. The values 0.00005 and 0.0001 was also 
tried and gave basically the same curves. 
7 CONCLUSION 
It was shown how a weight decay can improve generalization in two ways: 1) It 
suppresses any irrelevant components of the weight vector by choosing the smallest 
vector that solves the learning problem. 2) If the size is chosen right, a weight decay 
can suppress some of the effect of static noise on the targets. Static noise on the 
targets can be viewed as a model of learning an unrealizable function. The analysis 
assumed that the network could be expanded around an optimal weight vector, and 
A Simple Weight Decay Can Improve Generalization 957 
therefore it is strictly only valid in a little neighborhood around that vector. 
The improvement from a weight decay was also tested by simulations. For the 
NetTalk data it was shown that a weight decay can decrease the generalization 
error (squared error) and also, although less significantly, the actual mistake rate 
of the network when the phoneme closest to the output is chosen. 
Acknowledgement s 
AK acknowledges support from the Danish Natural Science Council and the Danish 
Technical Research Council through the Computational Neural Network Center 
(coNNEcT). 
References 
[1] D.B. Schwartz, V.K. Samalam, S.A. Solla, and J.S. Denker. Exhaustive learn- 
ing. Neural Computation, 2:371-382, 1990. 
[2] N. Tishby, E. Levin, and S.A. Solla. Consistent inference of probabilities in 
layered networks: predictions and generalization. In International Joint Con- 
ference on Neural Networks, pages 403-410, (Washington 1989), IEEE, New 
York, 1989. 
[3] E.B. Baum and D. Haussler. What size net gives valid generalization? Neural 
Computation, 1:151-160, 1989. 
[4] Y. Le Cun, J.S. Denker, and S.A. Solla. Optimal brain damage. In D.S. Touret- 
zky, editor, Advances in Neural Information Processing Systems, pages 598- 
605, (Denver 1989), Morgan Kaufmann, San Mateo, 1990. 
[5] H.H. Thodberg. Improving generalization of neural networks through pruning. 
International Journal of Neural Systems, 1:317-326, 1990. 
[6] D.H. Weigend, D.E. Rumelhart, and B.A. Huberman. Generalization by 
weight-elimination with application to forecasting. In R.P. Lippmann et al, 
editors, Advances in Neural Information Processing Systems, page 875-882, 
(Denver 1989), Morgan Kaufmann, San Mateo, 1991. 
[7] G.E. Hinton. Learning translation invariant recognition in a massively parallel 
network. In G. Goos and J. Hartmanis, editors, PARLE: Parallel Architec- 
tures and Languages Europe. Lecture Notes in Computer Science, pages 1-13, 
Springer-Verlag, Berlin, 1987. 
J .Moody. Generalization, weight decay, and architecture selection for nonlin- 
ear learning systems. These proceedings. 
D. MacKay. A practical bayesian framework for backprop networks. These 
proceedings. 
A. Krogh and J.A. Hertz. Generalization in a Linear Perceptton in the Presence 
of Noise. To appear in Journal of Physics A 1992. 
T.J. Sejnowski and C.R. Rosenberg. Parallel networks that learn to pronounce 
english text. Complex Systems, 1:145-168, 1987. 
J.A. Hertz, A. Krogh, and R.G. Pahner. Introduction to the Theory of Neural 
Computation. Addison-Wesley, Redwood City, 1991. 
[8] 
[9] 
[10] 
[11] 
[12] 
