Second Order Properties of Error Surfaces: 
Learning Time and Generalization 
Yann Le Cun Ido Kanter 
AT&;T Bell Laboratories Department of Physics 
Crawfords Corner Rd. Bar Ilan University 
Holmdel, NJ 07733, USA Ramat Gan, 52100 Israel 
Abstract 
Sara A. Solla 
AT&T Bell Laboratories 
Crawfords Corner Rd. 
Holmdel, NJ 07733, USA 
The learning time of a simple neural network model is obtained through an 
analytic computation of the eigenvalue spectrum for the Hessian matrix, 
which describes the second order properties of the cost function in the 
space of coupling coefficients. The form of the eigenvalue distribution 
suggests new techniques for accelerating the learning process, and provides 
a theoretical justification for the choice of centered versus biased state 
variables. 
I INTRODUCTION 
Consider the class of learning algorithms which explore a space {W} of possible 
couplings looking for optimal values W* for which a cost function E(W) is minimal. 
The dynamical properties of searches based on gradient descent are controlled by 
the second order properties of the E(W) surface. An analytic investigation of such 
properties provides a characterization of the time scales involved in the relaxation 
to the solution W*. 
The discussion focuses on layered networks with no feedback, a class of architectures 
remarkably successful at perceptual tasks such as speech and image recognition. 
We derive rigorous results for the learning time of a single linear unit, and discuss 
their generalization to multi-layer nonlinear networks. Causes for the slowest time 
constants are identified, and specific prescriptions to eliminate their effect result in 
practical methods to accelerate convergence. 
918 
Second Order Properties of Error Surfaces 919 
2 LEARNING BY GRADIENT DESCENT 
Multi-layer networks are composed of model neurons interconnected through a feed- 
forward graph. The state a:i of the i-th neuron is computed from the states {a: i } of 
the set Si of neurons that feed into it through the total input (or induced local field) 
ai = Y']is wilxi. The coefficient wij of the linear combination is the coupling from 
neuron j to neuron i. The local field ai determines the state xi through a nonlinear 
differentiable function f called the activation function:  = f(ai). The activation 
function is often chosen to be the hyperbolic tangent or a similar sigmoid function. 
The connection graph of multi-layer networks has no feedback loops, and the stable 
state is computed by propagating state information from the input units (which 
receive no input from other units) to the output units (which propagate no infor- 
mation to other units). The initialization of the state of the input units through 
an input vector X results in an output vector O describing the state of the output 
units. The network thus implements an input-output map, 0 -- O(X, W), which 
depends on the values assigned to the vector W of synaptic couplings. 
The learning process is formulated as a search in the space W, so as to find an 
optimal configuration W* which minimizes a function E(W). Given a training set 
of p input vectors X" and their desired outputs D", 1 < t < P, the cost function 
1 r 
E(W) = 2p y" liD"- o(x", w)112 (2.1) 
measures the discrepancy between the actual behavior of the system and the desired 
behavior. The minimization of E with respect to W is usually performed through 
iterative updates using some form of gradient descent: 
W(k + 1) = W(k) - riVE , (2.2) 
where r/is used to adjust the size of the updating step, and X7E is an estimate of the 
gradient of E with respect to W. The commonly used Back-Propagation algorithm 
popularized by (Rumelhart, Hinton, and Williams, 1986), provides an efficient way 
of estimating X7E for multi-layer networks. 
The dynamical behavior of learning algorithms based on the minimization of E(W) 
through gradient descent is controlled by the properties of the E(W) surface. The 
goal of this work is to gain better understanding of the structure of this surface 
through an investigation of its second derivatives, as contained in the Hessian matrix 
H. 
3 SECOND ORDER PROPERTIES 
We now consider a simple model which can be investigated analytically: an N- 
dimensional input vector feeding onto a single output unit with a linear activation 
function f(a) -- a. The output corresponding to input X" is given by 
N 
O" =  wix = W 'X", (3.1) 
i=1 
920 Le Cun, Kanter, and Solla 
where z is the i-th component of the p-th input vector, and wi is the coupling 
from the i-th input unit to the output. 
The rule for weight updates 
p 
"(o - a)x (3.2) 
W(k+l)=W(k)-,= 1 
follows from the gradient of the cost function 
1o 
1 ' 1 Z(d _ W,X)2 ' (3.3) 
z(w) =  (. - o.)  = 2- 
/=1 /=1 
Note that the cost function of Eq. (3.3) is quadratic in W, and can be rewritten as 
E(W) = (WrRW- 2QrW + c), (3.4) 
where R is the covariance matrix of the input, Rij lip P xx symmetric 
---- E=I , a 
and nonnegative N x N matrix; the N-dimensional vector Q has components qi = 
1/pE=ldX', and the constant c: 1/pE=(d) 2. The gradient is given by 
VE - RW - Q, while the Hessian matrix of second derivatives is H = R. 
The solution space of vectors W* which minimize E(W) is the subspace of solutions 
of the linear equation RW = Q, resulting from VE = 0. This subspace reduces to 
a point if R is full rank. The diagonalization of R provides a diagonal matrix A 
formed by its eigenvalues, and a matrix U formed by its eigenvectors. Since R is 
nonnegative, all eigenvalues satisfy , >_ 0. 
Consider now a two-step coordinate transformation: a translation V' = W - W* 
provides new coordinates centered at the solution point; it is followed by a rotation 
V - UV' = U(W- W*) onto the principal axes of the error surface. In the new 
coordinate system 
E(V) = V ray + Eo, (3.5) 
with A = UTRU and E0 = E(W*). Then OE/Ovj = )v, and OE/OvOvk = 
 5jk. The eigenvalues of the input covariance matrix give the second derivatives 
of the error surface with respect to its principal axes. 
In the new coordinate system the Hessian matrix is the diagonal matrix A, and the 
rule for weight updates becomes a set of N decoupled equations: 
V(k + 1) = V(k)- /AV(k), (3.6) 
The evolution of each component along a principal direction is given by 
vi(k ) = (1 - r!Ai):vi(O), (3.7) 
so that v1 will converge to zero (and thus wy to the solution wj) provided that 
0 < r/< 2/Aj. In this regime vy decays to zero exponentially, with characteristic 
time ry = (r/Aj) -. The range 1lAy < rl < 2lAy corresponds to underdamped 
dynamics: the step size is large and convergence to the solution occurs through 
Second Order Properties of Error Surfaces 921 
oscillatory behavior. The range 0 < r/< 1/Aj corresponds to overdamped dynamics: 
the step size is small and convergence requires many iterations. Critical damping 
occurs for r/= 1/Aj; if such choice is possible, the solution is reached in one iteration 
(Newton's method). 
If all eigenvalues are equal, A i = A for all 1 < j < N, the Hessian matrix is 
diagonal: H = A. Convergence can be obtained in one iteration, with optimal 
step size r/ = l/A, and learning time q- = 1. This highly symmetric case occurs 
when cross-sections of E(W) are hyperspheres in the N-dimensional space {W}. 
Such high degree of symmetry is rarely encountered: correlated inputs result in 
nondiagonal elements for H, and the principal directions are rotated with respect 
to the original coordinates. The cross-sections of E(W) are elliptical, with different 
eigenvalues along different principal directions. Convergence requires 0 < r/< 2/Aj 
for all 1 < j < N, thus r/must be chosen in the range 0 < r/< 1/Amax, where Amax is 
the largest eigenvalue. The slowest time constant in the system is q'max -- (r/Amin) -1 , 
where Amin is the lowest nonzero eigenvalue. The optimal step size r/- 1/Amax thus 
leads to Tma x -- Amax/Ami n for the decay along the principal direction of smallest 
nonzero curvature. A distribution of eigenvalues in the range Amin _< A _< Amax 
results in a distribution of learning times, with average < q- >= Amax < 1/A >. 
This analysis demonstrates that learning dynamics in quadratic surfaces are fully 
controlled by the eigenvalue distribution of the Hessian matrix. It is thus of interest 
to investigate such eigenvalue distribution. 
4 EIGENVALUE SPECTRUM 
The simple linear unit of Eq. (3.1) leads to the error function (3.4), for which the 
Hessian is given by the covariance matrix 
p 
(4.1) 
It is assumed that the input components {x} are independent, and drawn from a 
distribution with mean m and variance v. The size of the training set is quantified by 
the ratio a = pin between the number of training examples and the dimensionality 
of the input vector. The eigenvalue spectrum has been computed (Le Cun, Kanter, 
and Solla, 1990), and it exhibits three dominant features: 
(a) Ifp < N, the rank of the matrix R is p. The existence of (N-p) zero eigenvalues 
out of N results in a delta function contribution of weight (l-a) at A = 0 for a < 1. 
(b) A continuous part of the spectrum, 
[4cv 2 - (Ac - v(1 + c))2] t/a 
p(A) = 2rAv 
(4.2) 
within the bounded interval A_ < A < A+, with A-t- = (1 -t- v/-)av/ot (Krogh and 
Hertz, 1991). Note that p(A) is controlled only by the variance v of the distribution 
from which the inputs are drawn. The bounds A are well defined, and of order 
one. For all a < 1, A_ > 0, indicating a gap at the lower end of the spectrum. 
922 Le Cun, Kanter, and Solla 
(c) An isolated eigenvalue of order N, AN, present in the case of biased inputs 
(m # 0). 
True correlations between pairs (j, k) of input components might lead to a quite 
different spectrum from the one described above. 
The continuous part (4.2) of the eigenvalue spectrum has been computed in the 
N -- cx limit, while keeping ct constant and finite. The magnitude of finite size 
effects has been investigated numerically for N _ 200 and various values of ct. 
Results for N - 200, shown in Fig. 1, indicate that finite size effects are negligible: 
the distribution p() is bounded within the interval [_, +], in good agreement 
with the theoretical prediction, even for such small systems. The result (4.2) is 
thus applicable in the finite p = aN case, an important regime given the limited 
availability of training data in most learning problems. 
2.5 
2.0 
16 
p(X.) z'  I f/ 
1.o 
0.5 4 12 
0.0 
0 1 2 3 4 5 6 
Figure 1: Spectral density p(,) predicted by Eq. (4.2) for m -- 0, v -- 1, and 
ct = 0.6, 1.2,4, and 16. Experimental histograms for ct = 0.6 (full squares) and 
ct = 4 (open squares) are averages over 100 trials with N = 200 and z: +1 with 
probability 1/2 each. 
The existence of a large eigenvalue ,/v is easily understood by considering the 
structure of the covariance matrix R in the p  c limit, a regime for which a 
detailed analysis is available in the adaptive filters literature (Widrow and Stearns, 
1985). In this limit, all off-diagonal elements of R are equal to m :, and all diagonal 
elements are equal to v + m 2. The eigenvector Uv -- (1...1) thus corresponds to 
the eigenvalue ,v = Nm : + v. The remaining (N - 1) eigenvalues are all equal 
to v (note that the continuous part of the spectrum collapses onto a delta function 
at ,_ = ,+ = v as p  c ), thus satisfying trR = N(m  + v). The large part 
of ,v is eliminated for centered distributions with m = 0, such as x = +1 with 
probability 1/2, or x = 3,-1,-2 with probability 1/3. Note that although m is 
Second Order Properties of Error Surfaces 923 
crucial in controlling the existence of an isolated eigenvalue of order N, it plays no 
role in the spectral density of Eq. (4.2). 
5 LEARNING TIME 
Consider the learning time - = ((max/min). The eigenvalue ratio (max/min) 
measures the maximum number of iterations, and the factor of a accounts for the 
time needed for each presentation of the full training set. 
For m = 0, Amax -- +, and min -- -- The learning time - = a(A+/A_) can be 
easily computed using Eq. (4.2)' - = a(1 + x/')2/(1- V/') 2. As a function of a, - 
diverges at a = 1, and, surprisingly, goes through a minimum at a = (1 + v) 2 = 
5.83 before diverging linearly for a --, . Numerical simulations were performed to 
estimate - by counting the number T of presentations of training examples needed 
to reach an allowed error level/ through gradient descent. If the prescribed error/ 
is sufficiently close to the minimum error Eo, T is controlled by the slowest mode, 
and it provides a good estimate for -. Numerical results for T as a function of 
a, shown in Fig. 2, were obtained by training a single linear neuron on randomly 
generated vectors. As predicted, the curve exhibits a clear maximum at a = 1, 
as well as a minimum between a = 4 and a = 5. The existence of such optimal 
training set size for fast learning is a surprising result. 
8OO 
T(a) 
6OO 
400 
2OO 
0 2 4 6 8 10 12 14 16 18 
Figure 2: Number of iterations T (averaged over 20 trials) needed to train a linear 
neuron with N - 100 inputs. The x i are uniformly distributed between -1 and +1. 
Initial and target couplings kV are chosen randomly from a uniform distribution 
within the N 
[--1, +1] hypercube. Gradient descent is considered complete when the 
error reaches the prescribed value  = 0.001 above the E0 = 0 minimum value. 
924 Le Cun, Kanter, and Solla 
Biased inputs m  0 produce a large eigenvalue Amax -- AN, proportional to N 
and responsible for slow convergence. A simple approach to reducing the learning 
time is to center each input variable zj by subtracting its mean. An obvious source 
of systematic bias m is the use of activation functions which restrict the state 
variables to the [0,1] interval. Symmetric activation functions such as the hyperbolic 
tangent are empirically known to yield faster convergence than their nonsymmetric 
counterparts such as the logistic function. Our results provide an explanation to 
this observation. 
The extension of these results to multi-layer networks rests on the observation that 
each neuron i receives state information {zj} from the j  $i neurons that feed 
into it, and can be viewed as minimizing a local objective function Ei whose Hes- 
sian involves the the covariance matrix of such inputs. If all input variables are 
uncorrelated and have zero mean, no large eigenvalues will appear. But states with 
a:-' = m y 0 produce eigenvalues proportional to the number of input neurons Ni in 
the set Si, resulting in slow convergence if the connectivity is large. An empirically 
known solution to this problem, justified by our theoretical analysis, is to use indi- 
vidual learning rates r/i inversely proportional to the number of inputs Ni to the i-th 
neuron. Yet another approach is to keep a running estimate of the average jj and 
use centered state variables j = z -z. Such algorithm results in considerable 
reductions in learning time. 
6 CONCLUSIONS 
Our results are based on a rigorous calculation of the eigenvalue spectrum for a sym- 
metric matrix constructed from the outer product of random vectors. The spectral 
density provides a full description of the relaxation of a single adaptive linear unit, 
and yields a surprising result for the optimal size of the training set in batch learn- 
ing. Various aspects of the dynamics of learning in multi-layer networks composed 
of nonlinear units are clarified: the theory justifies known empirical methods and 
suggests novel approaches to reduce learning times. 
References 
A. Krogh and J. A. Hertz (1991), 'Dynamics of generalization in linear perceptrons', 
in Advances in Neural Information Processing Systems 3, ed. by D. S. Touretzky 
and R. Lippman, Morgan Kaufmann (California). 
Y. Le Cun, I. Kanter, and S. A. Solla (1990), 'Eigenvalues of covariance matrices: 
application to neural-network learning', Phys. Rev., to be published. 
D. E. Rumelhart, G. E. Hinton, and R. J. Williams (1986), 'Learning representations 
by back-propagating errors', Nature 323,533-536. 
B. Widrow and S. D. Stearns (1985), Adaptive Signal Processing, Prentice-Hall 
(New Jersey). 
