Weight Space Probability Densities 
in Stochastic Learning: 
I. Dynamics and Equilibria 
Todd K. Leen and John E. Moody 
Department of Computer Science and Engineering 
Oregon Graduate Institute of Science & Technology 
19600 N.W. von Neumann Dr. 
Beaverton, OR 97006-1999 
Abstract 
The ensemble dynamics of stochastic learning algorithms can be 
studied using theoretical techniques from statistical physics. We 
develop the equations of motion for the weight space probability 
densities for stochastic learning algorithms. We discuss equilibria 
in the diffusion approximation and provide expressions for special 
cases of the LMS algorithm. The equilibrium densities are not in 
general thermal (Gibbs) distributions in the objective function be- 
ing minimized, but rather depend upon an effective potential that 
includes diffusion effects. Finally we present an exact analytical 
expression for the time evolution of the density for a learning algo- 
rithm with weight updates proportional to the sign of the gradient. 
I Introduction: Theoretical Framework 
Stochastic learning algorithms involve weight updates of the form 
w(n + 1) = w(n) + .(n)H[w(n),z(n)] (1) 
where w e "' is the vector of m weights, p is the learning rate, H[.] e "' is the 
update function, and x(n) is the exemplar (input or input/target pair) presented 
451 
452 Leen andMoody 
to the network at the n th iteration of the learning rule. Often the update function 
is based on the gradient of a cost function H(w,x) - -O(w,x) low. We assume 
that the exemplars are i.i.d. with underlying probability density p(x). 
We are interested in studying the time evolution and steady state behavior of 
the weight space probability density P(w, n) for ensembles of networks trained by 
stochastic learning. Stochastic process theory and classical statistical mechanics 
provide tools for doing this. As we shall see, the ensemble behavior of stochas- 
tic learning algorithms is similar to that of diffusion processes in physical systems, 
although significant differences do exist. 
1.1 Dynamics of the Weight Space Probability Density 
Equation (1) defines a Markov process on the weight space. Given the particular 
input x, the single time-step transition probability density for this process is a Dirac 
delta function whose arguments satisfy the weight update (1): 
w(co' = (2) 
From this conditional transition probability, we calculate the total single time-step 
transition probability (Leen and Orr 1992, Ritter and Schulten 1988) 
W(co t  co) = ( ( co -co'- pH[wt,x]) ), (3) 
where (...) denotes integration over the measure on the random variable x. 
The time evolution of the density is given by the Kolmogorov equation 
P(co, n + 1) = faw t P(wt, n) W(co t  co), (4) 
which forms the basis for our dynamical description of the weight space probability 
density 1 
Stationary, or equilibrium, probability distributions are eigenfunctions of the tran- 
sition probability 
's(co) = f aco' 's(') w(co' -, co). (5) 
It is particularly interesting to note that for problems in which there exists an 
optimal weight co, such that 
H(CO,,x) = 0, z , 
one stationary solution is a delta function at co - co,. An important class of such 
examples are noise-free mapping problems for which weight values exist that realize 
the desired mapping over all possible input]target pairs. For such problems, the 
ensemble can settle into a sharp distribution at the optimal weights (for examples 
see Leen and Orr 1992, Orr and Leen 1993). 
Although the Kolmogorov equation can be integrated numerically, we would like 
to make further analytic progress. Towards this end we convert the Kolmogorov 
An alternative is to base the time evolution on a suitable master equation. Both 
approaches give the same results. 
Weight Space Probability Densities in Stochastic Learning: I. Dynamics and Equilibria 453 
equation into a differential-difference equation by expanding (3) as a power series 
in p. Since the transition probability is defined in the sense of generalized functions 
(i.e. distributions), the proper way to proceed is to smear (4) with a smooth test 
function of compact support f(w) to obtain 
do; f(w) P(w,n + 1) : / dwdw' f(w) P(w',n) W(w' -+ co) (6) 
Next we use the transition probability (3) to perform the integration over w and 
expand the resulting expression as a power series in p. Finally, we integrate by 
parts to take derivatives off f, dropping the surface terms. This results in a discrete 
time version of the classic Kramers-Moyal expansion (Risken 1989) 
+ 1) - = 
0 i 
Owix Owi ... Owi 
{ (pHi, phi,... pH,) P(w, n) } , (7) 
where Hj denotes the ja ta component of the m-component vector H. 
In section 3, we present an algorithm for which the Kramers-Moyal expansion can 
be explicitly summed. In general the full expansion is not analytically tractable, 
and to make further analytic progress we will truncate it at second order to obtain 
the Fokker-Planck equation. 
1.2 The Fokker-Planck (Diffusion) Approximation 
For small enough IpHI, the Kramers-Moyal expansion (7) can be truncated to 
second order to obtain a Fokker-Planck equation: 2 
+ 1) - = 
+ 2 . (s) 
In (8), and throughout the remainder of the paper, repeated indices are summed 
over. In the Fokker-Planck approximation, only two coefficients appear: Ai(w)  
(Hi), called the drift vector, and Bij(w) -- (Hi H5), called the diffusion matrix. 
The drift vector is simply the average update applied at w. Since the diffusion 
coefficients can be strongly dependent on the position in weight space, the equilib- 
rium densities will, in general, not be thermal (Gibbs) distributions in the potential 
corresponding to (H(w,x)). This is exemplified in our discussion of equilibrium 
densities for the LMS algorithm in section 2.1 below 3. 
2 Radons et al. (1990) independently derived a Fokker-Planck equation for backpropaga- 
tion. Earlier, Ritter and Schulten (1988) derived a Fokker-Planck equation (for Kohonen's 
self-ordering feature map) that is valid in the neighborhood of a local optimum. 
aSee (Leen and Orr 1992, Orr and Leen 1993) for further examples. 
454 Leen andMoody 
2 
Equilibrium Densities in the Fokker-Planck 
Approximation 
In equilibrium the probability density is stationary, P(w, n+ 1) = P(w, n) m Ps(w), 
so the Fokker-Planck equation (8) becomes 
o j,(o;) =  A,(o;) P,(o;) -- [ B,j(o;) P,(o;) ] (9) 
0- 0,--7' O i 2 0,  
Here, we have implicitly defined the probability density current J(w). In equilib- 
rium, its divergence is zero. 
If the drift and diffusion coefficients satisfy potential conditions, then the equilibrium 
current itself is zero and detailed balance is obtained. The potential conditions are 
(Gardiner, 1990) 
[1 0 Bij(w)- Ai(w) ] (10) 
ozk oz, = o, where Zk(o:) _= 50j 
Under these conditions the solution to (9) for the equilibrium density is: 
i e_2:()/ .(w) m fdw Z(w) (11) 
es(o:) = , 
where K is a normalization constant and (w) is called the effective potential. 
In general, the potential conditions are not satisfied for stochastic learning algo- 
rithms in multiple dimensions. 4 In this respect, stochastic learning differs from 
most physical diffusion processes. However for LMS with inputs whose correlation 
matrix is isotropic, the conditions are satisfied and the equilibrium density can be 
reduced to the quadrature in (11). 
2.1 Equilibrium Density for the LMS Algorithm 
The best known on-line learning system is the LMS adaptive filter. For the LMS 
algorithm, the training examples consist of input/target pairs x(n) = {s(n), t(n) 
the model output is u(n) = w. s(n), and the cost function is the squared error: 
1 ]2_ 1 
(w,:r(n)) = [t(n)-u(n) -- [t(n)-w.s(n) (12) 
The resulting update equations (for constant learning rate/) are 
o:(.+1) = 04.) + 
(13) 
We assume that the training data are generated according to a "signal plus noise" 
model: 
t(n) = w, . s(n) + e(n) , (14) 
where co, is the "true" weight vector and e(n) is i.i.d. noise with mean zero and 
variance a 2. We denote the correlation matrix of the inputs s(n) by R and the 
4For one-dimensional algorithms, the potential conditions axe trivially satisfied. 
Weight Space Probability Densities in Stochastic Learning: I. Dynamics and Equilibria 455 
fouri:h order correlation tensor of the inputs by $. 
origin of coordinates in weight space and define the weight error vector 
V  60 -- Od,, 
In terms of v, the weight update is 
v(n + 1) = v(n) -- p [s(n). v(n)] s(n) + pc(n) s(n). 
The drift vector d diffusion matrix are given by 
Ai = - (sis 5 )  vj = -Rij vj 
and 
It is convenient to shift the 
(15) 
The spatial dependence of the the diffusion coefficient forces the effective potential 
to soften relative to the cost function for large Ivl. This accentuates the tails of the 
distribution relative to a gaussian. 
Bij = ( si sj sk st vk v, + e 2 si sj )s,e = $ijt v vt + er 2 Rij (16) 
respectively. Notice that the diffusion matrix is quadratic in v. Thus  we move 
away from the globM minimum at v = 0, diffusive spreading of the probability 
density is enhced. Notice also that, in generM, both terms of the diffusion matrix 
contribute  anisotropy. 
We further assume that the inputs are drawn from a zero-me Gaussi process. 
This sumption Mlows us to appeal to the Gaussian moment factoring theorem 
(Haykin, 1991, p318) to express the fourth-order correlation S in terms of R 
The diffusion matrix reduces to 
B = (vRv+)R + 2(Rv)(Rv)  (17) 
To compute the effective potentiM (10 and 11) the diffusion matrix is inverted 
using the Sherm-Morrison formula (Press, 1987, p67). As a final simplification, 
we assume that the input distribution is spherically symmetric. Thus 
R=rI, 
where  denotes the identity matrix. 
Together these assumptions insure detled bMance, d we c integrate (11) in 
closed form. In figure 1, we compare the effective potential W(v) (for 1-D LMS) 
with the potentiM corresponding to the quadratic cost function. 
Fig.l: Effective potential cot r,,ctio, (solid curve) for LMS. 
456 Leen and Moody 
The equilibrium density is 
3  
Ps(v) =  1 + 7/1.1 , (is) 
where, as before, m and K denote the dimension of the weight vector and the 
normalization constant for the density respectively. For a 1-D filter, the equilibrium 
density can be found in closed form without assuming Gaussian input data. We 
find 
1 [ S v2]-(+-r) 
Ps(v) =  1+-- . (19) 
r if2 
With gaussian inputs (for which S = 3r 2) (19) properly reduces to (is) with m = 1. 
The equilibrium densities (18) and (19) are clearly not gaussian, however in the limit 
of very small pr they reduce to gaussian distributions with variance pa2/2. Figure 
2 shows a comparison between the theoretical result and a histogram of 200,000 
values of v generated by simulation with p = 0.005, and a s = 1.0. The input data 
were drawn from a zero-mean Gaussian distribution with r = 4.0. 
O' O' 
-0.2 - .1 0:0 .1 0.2 
v 
Fig.2: Equilibrium density for 1-D LMS 
3 An Exactly Summable Model 
As in the case of LMS learning above, stochastic gradient descent algorithms update 
weights based on an instantaneous estimate of the gradient of some average cost 
function (w) - ( (v, x))r. That is, the update is given by 
0 
ui(,x) = 
An alternative is to increment or decrement each weight by a fixed amount depend- 
ing only on the sign of O/Ocoi. We formulated this alternative update rule because 
it avoids a common problem for sigmoidal networks, getting stuck on "fiat spots" or 
"plateaus". The standard gradient descent update rule yields very slow movement 
on plateaus, while second order methods such as gauss-newton can be unstable. 
The sign-of-gradient update rule suffers from neither of these problems. 5 
SThe use of the sign of the gradient has been suggested previously in the stochastic 
approximation literature by Fabian (1960) and in the neural network literature by Derthick 
(1984). 
Weight Space Probability Densities in Stochastic Learning: I. Dynamics and Equilibria 457 
If at each iteration one chooses a weight at random for updating, then the Kramers- 
Moyal expansion can be exactly summed. Thus at each iteration we 1) choose a 
weight wi and an exemplar x at random, and 2) update wi with 
Hi(w,x) = -sign (0(w,x(n)) 
(20) 
With this update rule, H 5 = d:l or 0 and Hi H 5 = 6ij (or 0). All of the coefficients 
( HiHjHk ... )x in the Kramers-Moyal expansion (7) vanish unless i = j = k = .... 
The remaining series can be summed by breaking it into odd and even parts. This 
leaves 
P(,.,,. + ) - P(,.) = 
m 
1 
- m  {P(+m'n)A(w+m)-P(w-m'")A(-m) } 
=1 
m 
1 
+ 2m  { P(w+P'n) B(w+Pi)-2P(w'n)B)(w) 
j=l 
+ V(w-p),n)B)5(w-pS) } (21) 
where pj denotes a displacement Mong wj a distance p, As(w)  (Hj(w,x) ), and 
tnt = i ..lss = 0, for al., i. 
which ce Bit(w)  0. Although exact, (21) curiously h the form of a second 
order finite difference appromation to the Fokker-Planck equation with diagonal 
diffusion matrix. This form is understdable, since the dynamics (20) restrict the 
weight vMues w to a hypercubic lattice with cell length p d generate only newest 
neighbor interactions. 
n--1 n=50 
-0.5 0.5 I 1.5 2 2.5 -0.5 0.5 I 1.5 2 2.5 
v v 
n -- 500 n -- 5000 
-0.5 0.5 I 1.5 2 2.5 -0.5 0.5 I 1.5 2 2.5 
v v 
Fig.3: Sequence of densities for the XOR problem 
As an example, figure 3 shows the cost function evaluated along a 1-D slice through 
the weight space for the XOR problem. Along this line are local and global minima 
at v --- I and v - 0 respectively. Also shown is the probability density (vertical 
lines). The sequence shows the spreading of the density from its initialization at 
the local minimum, and its eventual collection at the global minimum. 
458 Leen andMoody 
4 Discussion 
A theoretical approach that focuses on the dynamics of the weight space probability 
density, as we do here, provides powerful tools to extend understanding of stochastic 
search. Both transient and equilibrium behavior can be studied using these tools. 
We expect that knowledge of equilibrium weight space distributions can be used in 
conjunction with theories of generalization (e.g. Moody, 1992) to assess the influence 
of stochastic search on prediction error. Characterization of transient phenomena 
should facilitate the design and evaluation of search strategies such as data batching 
and adaptive learning rate schedules. Transient phenomena are treated in greater 
depth in the companion paper in this volume (Orr and Leen, 1993). 
Acknowledgements 
T. Leen was supported under grants N00014-91-J-1482 and N00014-90-J-1349 from 
ONR. J. Moody was supported under grants 89-0478 from AFOSR, ECS-9114333 
from NSF, and N00014-89-J-1228 and N00014-92-J-4062 from ONR. 
References 
Todd K. Leen and Genevieve B. Orr (1992), Weight-space probability densities and conver- 
gence times for stochastic learning. In International Joint Conference on Neural Networks, 
pages IV 158-164. IEEE, June. 
H. Ritter and K. Schulten (1988), Convergence properties of Kohonen's topology conserv- 
ing maps: Fluctuations, stability and dimension selection, Biol. Cybern., 60, 59-71. 
Genevieve B. Orr and Todd K. Leen (1993), Probability densities in stochastic learning: 
II. Transients and Basin Hopping Times. In Giles, C.L., Hanson, S.J., and Cowan, J.D. 
(eds.), Advances in Neural Information Processing Systems 5. San Mateo, CA: Morgan 
Kaufmann Publishers. 
H. Risken (1989), The Fokker-Planck Equation Springer-Verlag, Berlin. 
G. Radons, H.G. Schuster and D. Werner (1990), Fokker-Planck description of learning in 
backpropagation networks, International Neural Network Conference - INNC 90, Paris, II 
993-996, Kluwer Academic Publishers. 
C.W. Gatdiner (1990), Handbook of Stochastic Methods, nd Ed. Springer-Verlag, Berlin. 
Simon Haykin (1991), Adaptive Filter Theory, nd edition. Prentice Hall, Englewood 
Cliffs, N.J. 
W.H. Press, B.P. Flannery, S.A. Teukolsky, and W.T. Vetterling (1987) Numerical Recipes 
- the Art of Scientific Computing. Cambridge University Press, Cambridge / New York. 
V. Fabian (1960), Stochastic approximation methods. Czechoslovak Math J., 10, 123-159. 
Mark Derthick (1984), Variations on the Boltzmann machine learning algorithm. Technical 
Report CMU-CS-84-120, Department of Computer Science, Carnegie-Mellon University, 
Pittsburgh, PA, August. 
John E. Moody (1992), The effective number of parameters: An analysis of generaliza- 
tion and regularization in nonlinear learning systems. In J.E. Moody, S.J. Hanson, and 
R.P. Lipmann, editors, Advances in Neural Information Processing Systems . Morgan 
Kaufmann Publishers, San Mateo, CA. 
