Asymptotics of Gradient-based 
Neural Network Training Algorithms 
Sayandev Mukherjee 
saymukhee. cornell. edu 
School of Electrical Engineering 
Cornell University 
Ithaca, NY 14853 
Terrence L. Fine 
tlf ineee. cornell. edu 
School of Electrical Engineering 
Cornell University 
Ithaca, NY 14853 
Abstract 
We study the asymptotic properties of the sequence of iterates of 
weight-vector estimates obtained by training a multilayer feedfor- 
ward neural network with a basic gradient-descent method using 
a fixed learning constant and no batch-processing. In the one- 
dimensional case, an exact analysis establishes the existence of a 
limiting distribution that is not Gaussian in general. For the gen- 
eral case and small learning constant, a linearization approximation 
permits the application of results from the theory of random ma- 
trices to again establish the existence of a limiting distribution. 
We study the first few moments of this distribution to compare 
and contrast the results of our analysis with those of techniques of 
stochastic approximation. 
i INTRODUCTION 
The wide applicability of neural networks to problems in pattern classification and 
signal processing has been due to the development of efficient gradient-descent al- 
gorithms for the supervised training of multilayer feedforward neural networks with 
differentiable node functions. A basic version uses a fixed learning constant and up- 
dates all weights after each training input is presented (on-line mode) rather than 
after the entire training set has been presented (batch mode). The properties of 
this algorithm as exhibited by the sequence of iterates are not yet well-understood. 
There are at present two major approaches. 
336 Sayandev Mukherjee, Terrence L. Fine 
Stochastic approximation techniques (Bucklew,Kurtz,Sethares, 1993; Finnoff, 1993; 
Kuan,Hornik, 1991; White, 1989) study the limiting behavior of the stochastic pro- 
cess that is the piecewise-constant or piecewise-linear interpolation of the sequence 
of weight-vector iterates (assuming infinitely many i.i.d. training inputs) as the 
learning constant approaches zero. It can be shown (Bucklew,Kurtz,Sethares, 1993; 
Finnoff, 1993) that as the learning constant tends to zero, the fluctuation between 
the paths and their limit, suitably normalized, tends to a Gaussian diffusion process. 
Leen and Moody (1993) and Orr and Leen (1993) have considered the Markov 
process formed by the sequence of iterates (again, assuming infinitely many i.i.d. 
training inputs) for a fixed nonzero learning constant. This approach has the merit 
of dealing with the nonzero learning constant case and of linking the study of the 
training algorithm with the well-developed literature on Markov processes. 
In particular, it is possible to solve (Leen,Moody, 1993) for the asymptotic distribu- 
tion of the sequence of weight-vector iterates from the Chapman-Kolmogorov equa- 
tion after certain assumptions have been used to simplify it considerably. However, 
the assumptions are unrealistic: in particular, the assumption of detailed balance 
does not hold in more than one dimension. This approach also fails to establish the 
existence of a limiting distribution in the general case. 
This paper follows the nethod of considering the sequence of weight-vector iter- 
ates as a discrete-time continuous state-space Markov process, when the learning 
constant is fixed and nonzero. We shall first seek to establish the existence of an 
asymptotic distribution, and then examine this distribution through its first few 
moments. 
It can be proved (Mukherjee, 1994), using Foster's criteria (Tweedie, 1976) for the 
positive-recurrence of a Markov process, that when a single sigmoidal node with one 
parameter is trained using the iterative form of the basic gradient-descent training 
algorithm (without batch-processing), the sequence of iterates of the parameter has 
a limiting distribution which is in general non-Gaussian, thereby qualifying the oft- 
stated claims in the literature (see, for example, (Bucklew,Kurtz,Sethares, 1993; 
Finnoff, 1993; White, 1989)). However, this method proves to be intractable in the 
multiple parameter case. 
2 THE GENERAL CASE AND LINEARIZATION IN Wn 
The general version of this problem for a neural network V with scalar out- 
put involves training V with the i.i.d. training sequence {(X___n,Yn)} , loss function 
__ 1 
(x_,y,w) [y- V(x,w)] 2 (x E ]Ra, y E ]R,w  ]R m) and the gradient-descent 
updating equation for the estimates of the weight vector given by 
Wn+ 1 ---- W n - iVw__(x__,y,w)l(x+,y,+,w) 
'-- W n -- [Yn+l - rl(Wn,Xn+l)]Vw_rl(w__,x_)l___w,,.,x,,.+l. 
As is customary in this kind of analysis, the training set is assumed infinite, so that 
{Wn}n__o forms a homogeneous Markov process in discrete time. In our analysis, 
the training data is assumed to come from the model 
Y = v(w , _x) + z, 
Asymptomatics of Gradient-Based Neural Network Training Algorithms 33 7 
where Z and X are independent, and Z has zero mean and variance a 2. Hence, 
the unrestricted Bayes estimator of Y given X, E(YIX ) = r/(w,X), is in the class 
of neural network estimators, and w  is the goal of training. For convenience, we 
define ITV -- W- w . 
Assuming that / is small and that after a while, successive iterates, with high 
probability, jitter about in a close neighborhood of the optimal value w , we make 
the important assumption that 
, = ore k) () 
for some 0 < k < 1 (see Section 4) I Applying Taylor series expansions to r/ 
and V_r/and neglecting all terms Op(Ig l+2k) and higher, we obtain the following 
linearized form of the updating equation: 
/n+l -- An+lln + Bn+ 1 , 
where 
(2) 
(3) 
(4) 
do not depend on W. The matrices {(An+,Bn+l) } form an i.i.d. sequence, but 
An+l and B+ 1 are dependent for each n. Hence the linearized W again forms a 
homogeneous Markov process in discrete time. 
In what follows we analyze this process in the hope that its asymptotics agree with 
those of the original Markov process. 
3 EXISTENCE OF A LIMITING DISTRIBUTION 
Let A, B, G, J denote random matrices with the common distributions of the i.i.d. 
sequences {An}, {Bn}, {Gn}, and {Jn} respectively, and let T' IR m --. IR m be 
the random affine transformation 
 AwL+B_. 
The following result establishes the existence of a limiting distribution of W n. 
Lemma 1 (Berger Thm. V, p.162) Suppose 
',[log + IIAII + log + IIl[] < 
E log IIA,A,_... A I1 < 
where 
log + x - log x V O. 
Then the following conclusions hold: 
XI.e., (Ve > O)(3MJ(n)P(-k[l__nl[ <_ M) > 1 -e. 
oo; (5) 
0 for some n (6) 
338 Sayandev Mukherjee, Terrence L. Fine 
Unique stationary distribution: There exists a unique random variable 
W  IR m, upto distribution, that is stationary with respect to T (i.e., 17V is 
independent of T, and TITV has the same distribution as 17V). 
2. Asymptotic stationarity: We have convergence in distribution: 
Our choice of norm is the operator norm for the matrix A, 
][A[[ = max 
where are the eigenvalues of X, and the Euclidean norm for the vector 
We first verify (5). From the inequality Vx  IR, log+x < x 2, it is easily seen 
that if r/is a feedforward net where all activation functions are twice-continuously 
differentiable in the weights, all hidden-layer activation functions are bounded and 
have bounded derivatives up to order 2, and if the training sequence (X, Y) is 
i.i.d. with finite fourth moments, then (5) holds for the Euclidean norm for B and 
the Frobenius norm for A, [IA][ 2 = n=l En=l ]Alii 2. Since 
m m 
(max IA(A)I) 2 _<  IA(A)I 2 _<   I&jl 2, (7) 
i= 
we see that (5) also holds for the operator norm of A. 
Assumption (6) forces the product A-.. A1 to tend to 0 mx m almost surely (Berger, 
1993, p.146) and therefore removes the dependence of the asymptotic distribution 
of {17Vn} on that of the initial value ITV 0. A sufficient condition for (6) is given by 
the following lemma. 
Lemma 2 Suppose lEG is positive definite (note that it is positive semidefinite by 
definition), and for all n, lEA n < cx>. Then (6) holds for suJ2ficiently small, positive 
l. 
Proof: By assumption, min A(lEG) = 5 > 0 for some 5. 
I n 
Let H =  i=i(Gi - ZiJi). By the Strong Law of Large Numbers applied to the 
i.i.d. random matrices (Gi - ZiJi), we have H -4 lEG a.s., so 
minA(Hn)-* minA(lEG) a.s. (8) 
Applying (7) to rain A(H,), it is easily shown that the same conditions on r/ and 
the training sequence that are sufficient for (5) also give SUPn 1E[min A(Hn)] 2 < 
which in turn implies that {min A(Hn)} are uniformly integrable. Together with (8), 
this implies (Love, 1977, p.165) that minA(Hn) --, minA(lEG) in L 1. Hence there 
Asymptomatics of Gradient-Based Neural Network Training Algorithms 339 
exists some (nonrandom) N, say, such that EIminA(HN ) -- minA(EG)l <_ 6/2. 
Since 
[]Emin (HN) -- min,X(EG)l _< El min(HN) - min (EG)  6/2, 
we therefore have 
E min (Gi- ZiJi)  min(EG)-6/2 =6-6/2 =6/2 > 0. (9) 
We shall prove that (6) holds for this N ( m) by showing that 
log IIANAN-x    Axil = log [IANAN-x '" Axil 0, 
For our choice of norm, we therefore want 1F, [log (max IA(AN... A1)I) 2] < 0. From 
Jensen's inequality, it is sufficient to have log lE[max IA(AN'" A1)1] 2 < 0, or equiv- 
alently, 
E[maxIA(AN...A1)[] 2 < 1. (10) 
Now, since N is fixed, we can choose u small enough that 
N 
AN'" A1 =[m - [1E(C_i - ZiJi) q- 
i:1 
N 
Hence, A(AN ... A1) = 1 - ;tA(Y],i:l(G/- Z/J/)) + Oe(it2), and N is fixed, so 
[A(AN... A1)] 2 - 1 - 2A (G/- &J) + 
giving 
mA(AN...A)I 2 g 1--2minA (G-ZJ) +Oe(2), 
EmIA(AN ...A)] 2 5 1 - NS + o(), (11) 
where we use (9) and the observation that the structure of the lt Oe( 2) term is 
such that its expectation (guaranteed finite by the hypothesis EA N < ) is O(2), 
or o(), and we also restrict  < 1/N5 so that 
1 - 2E rain A (G - ZiJi) > 0. 
From (11), it is clear that (10) holds for all sufficiently small, positive/z (<< 1/N6). 
Therefore (6) holds for n = N. 
We can combine these two lemmas into the following theorem. 
Theorem 1 Let  be a feedforward net where all activation functions are twice- 
continuously differentiabIe in the weights, all hidden-layer activation functions are 
bounded and have bounded derivatives up to order 2, and let the training sequence 
(X___n, Yn) be i.i.d. with finite moments. Further, assume that lEG is positive definite. 
Then, for all sufficiently small, positive tz the sequence of random vectors {Vn}n=1 
obtained from the updating equation (2) has a unique limiting distribution. 
340 Sayandev Mukherjee, Terrence L. Fine 
We circumvent the generally intractable problem of finding the limiting distribution 
by calculating and investigating the behavior of its moments. 
4 MOMENTS OF THE LIMITING DISTRIBUTION 
Let us assume that the mean and variance of the limiting distribution exist, and 
that Z  A/'(0, or2). From (2) and the form of An+l and Bn+l, it is easy to show 
that 1F, I7V = 0, or 1F, W = w , so the optimal value w  is the mean of the limiting 
distribution of the sequence of iterates {Wn}. It can also be shown (Mukherjee, 
1994) that ]EI7VI7V T = (/cr2/2)lm, yielding I7V = Op(v/-fi). This is consistent with 
our assumption (1) with k = 1/2. 
_ 1/zcr 2 
In the one-dimensional case (d - m = 1), we have 1F, I7V = 0 and 1F, I7V 2 --  
if ]E[Xn+IT]t(wXn+i)] 2  O. Using these results, the fact that Z  Af(0, a2), 
1F, I7V = 0, the independence of Z and X, and assuming that 1F, X s < oo, it is not 
difficult to compute the expressions 
V  = '24[XvV (1 - ,X2v)] 
]!:?,[X2T] '2 -/X4(T] '4 q- r2T] "2) q-/2X6T]'2(r]'4/3 q- r2T]"2)] ' 
and 
where 
K1 (/-z) 
'[]74 .._. 3(]EI,2)2K1 (/j,)q- -[]:?3K2(/j,), 
E[X2tlt2(1 -/_zX27/'2) 2 q-/_If47/'4 q- 3/12X%'/"27/'2] 
, 
K2() = 
K() ' 
K(u) = [xn ' - ux(n ' + n"( - uxn') ) 
+ X ' _ X*( '* + 34"4)], 
and ' and " are evaluated at the argument wX for . 
om the above expressions, it is seen that if (.) = 1/[1 + e-')] and X h a sym- 
metric distribution (say (0, 1)), hen a  0 and   3() , implying 
ha  is non-Gaussian in general. This resuk is consisen wkh tha obtained by 
direct application of Foster's criterion (Mukherjee, 1994). 
5 
RECONCILING LINEARIZATION AND 
STOCHASTIC APPROXIMATION METHODS 
The results of stochastic approximation analysis give a Gaussian distribution for I7V 
in the limit as u - 0 (Bucklew,Kurtz,Sethares, 1993; Finnoff, 1993). However, our 
results establish that the Gaussian distribution result is not valid for small nonzero 
u in general. To reconcile these results, recall I7V = Or (v/-fi). Hence, if we consider 
Asymptomatics of Gradient-Based Neural Network Training Algorithms 341 
only moments of the normalized quantity 17V/v/- fi (and neglect higher-order terms 
in Op(v/-fi)), we obtain E(17V/v)3 = 0 and E(17V/v) 4 -- 3[]E(17V/x/-fi)212 , which 
suggests that the normalized quantity 12V/v/- fi is Gaussian in the limit of vanishing 
/, a conclusion also reached from stochastic approximation analysis. 
In support of this theoretical indication that the conclusions of our analysis (based 
on linearization for small /) might tally with those of stochastic approxima- 
tion techniques for small values of/, simulations were done on the simple one- 
dimensional training case of the previous section for 8 cases: u = 0.1, 0.2,0.3,0.5, 
and a 2 -- 0.1, 0.5 for each value of u, with w  fixed at 3. For each of the 8 cases, 
either 5 or 10 runs were made, with lengths (for the given values of u) of 810000, 
500000, 300000, and 200000 respectively. Each run gave a pair of sequences {17Vn} 
obtained by starting off at 17V0 = 0 and training the network independently twice. 
Each resulting sequence {17Vn} was then downsampled at a large enough rate that 
the true autocorrelation of the downsampled sequence was less than 0.05, followed 
by deleting the first 10% of the samples of this downsampled sequence, so as to 
remove any dependence on initial conditions that might persist. (Autocorrelation 
at lag unity for this Markov Chain was so high that when u = 0.1, a decimation rate 
of 9000 was required.) This was done to ensure that the elements of the resulting 
downsampled sequences could be assumed independent for the various hypothesis 
tests that were to follow. 
(a) For each run of each case, the empirical distribution functions of the two 
downsampled sequences thus generated were compared by means of the 
Kolmogorov-Smirnov test (Bickel,Doksum, 1977) at level 0.95, with the 
null hypothesis being that both sequences had the same actual cumulative 
distribution function (assumed continuous). This test was passed with ease 
on all trials, thereby showing that a limiting distribution existed and was 
attained by such a training algorithm. 
(b) For each run of each case, a skewness test and a kurtosis test 
(Bickel,Doksum, 1977) for normality were done at level 0.95 to test for 
normality. The sequences generated failed both tests for the (u, or) pair 
(0.1,0.1) and passed them both for the pairs (0.1,0.5), (0.3,0.1), (0.5,0.1), 
and (0.5,0.5). For the pair (0.2,0.5), the skewness test was passed and the 
kurtosis test failed, and for the pairs (0.2,0.1) and (0.3,0.5), the skewness 
test was failed and the kurtosis test passed. 
(c) All trials cleared the Kolmogorov tests (Bickel,Doksum, 1977) for normality 
at level 0.95, both when the normal distribution was taken to have the 
sample mean and variance (computed on the downsampled sequence), and 
when the normal distribution function had the asymptotic values of mean 
(zero) and variance (cr2/2). 
Hence we may conclude: 
1. The limiting distribution of (Wn) exists. 
2. For small values of/, the deviation from Gaussianhess is so small that the 
Gaussian distribution may be taken as a good approximation to the limiting 
distribution. 
342 Sayandev Mukherjee, Terrence L. Fine 
In other words, though stochastic approximation analysis states that 17V/v/- fi is 
Gaussian only in the limit of vanishing/, our simulation shows that this is a good 
approximation for small values of/ as well. 
Acknowledgements 
The research reported here was partially supported by NSF Grant SBR-9413001. 
References 
Berger, Marc A. An Introduction to Probability and Stochastic Processes. Springer- 
Verlag, New York, 1993. 
Bickel, Peter, and Doksum, Kjell. Mathematical Statistics: Basic Ideas and Selected 
Topics. Holden-Day, San Francisco, 1977. 
Bucklew, J.A., Kurtz, T.G., and Sethares, W.A. "Weak Convergence and Local 
Stability Properties of Fixed Step Size Recursive Algorithms," IEEE Trans. Inform. 
Theory, vol. 39, pp. 966-978, 1993. 
Finnoff, W. "Diffusion Approximations for the Constant Learning Rate Backprop- 
agation Algorithm and Resistence to Local Minima." In Giles, C.L., Hanson, S.J., 
and Cowan, J.D., editors, Advances in Neural Information Processing Systems 5. 
Morgan Kaufmann Publishers, San Mateo CA, 1993, p.459 if. 
Kuan, C-M, and Hornik, K. "Convergence of Learning Algorithms with Constant 
Learning Rates," IEEE Trans. Neural Networks, vol. 2, pp. 484-488, 1991. 
Leen, T.K., and Moody, J.E. "Weight Space Probability Densities in Stochastic 
Learning: I. Dynamics and Equilibria," Adv. in NIPS 5, Morgan Kaufmann Pub- 
lishers, San Mateo CA, 1993, p.451 if. 
Love, M. Probability Theory I, 4th ed. Springer-Verlag, New York, 1977. 
Mukherjee, Sayandev. Asymptotics of Gradient-based Neural Network Training Al- 
gorithms. M.S. thesis, Cornell University, Ithaca, NY, 1994. 
Orr, G.B., and Leen, T.K. "Probability densities in stochastic learning: II. Tran- 
sients and Basin Hopping Times," Adv. in NIPS 5, Morgan Kaufmann Publishers, 
San Mateo CA, 1993, p.507 if. 
Rumelhart, D.E., Hinton, G.E., and Williams, R.J. "Learning interval represen- 
tations by error propagation." In D.E. Rumelhart and J.L. McClelland, editors, 
Parallel Distributed Processing, Ch. 8, MIT Press, Cambridge MA, 1985. 
Tweedie, R.L. "Criteria for Classifying General Markov Chains,"Adv. Appl. Prob., 
vol. 8, 737-771, 1976. 
White, H. "Some Asymptotic Results for Learning in Single Hidden-Layer Feedfor- 
ward Network Models," J. Am. Star. Assn., vol. 84, 1003-1013, 1989. 
PART IV 
REINFORCEMENT LEARNING 
