A Local Algorithm to Learn Trajectories 
with Stochastic Neural Networks 
Javier R. Movellan* 
Department of Cognitive Science 
University of California San Diego 
La Jolla, CA 92093-0515 
Abstract 
This paper presents a simple algorithm to learn trajectories with a 
continuous time, continuous activation version of the Boltzmann 
machine. The algorithm takes advantage of intrinsic Brownian 
noise in the network to easily compute gradients using entirely local 
computations. The algorithm may be ideal for parallel hardware 
implementations. 
This paper presents a learning algorithm to train continuous stochastic networks 
to respond with desired trajectories in the output units to environmental input 
trajectories. This is a task, with potential applications to a variety of problems such 
as stochastic modeling of neural processes, artificial motor control, and continuous 
speech recognition. For example, in a continuous speech recognition problem, the 
input trajectory may be a sequence of fast Fourier transform coefficients, and the 
output a likely trajectory of phonemic patterns corresponding to the input. This 
paper was based on recent work on diffusion networks by Movellan and McClelland 
(in press) and by recent papers by Apolloni and de Falco (1991) and Neal (1992) 
on asymmetric Boltzmann machines. The learning algorithm can be seen as a 
generalization of their work to the stochastic diffusion case and to the problem of 
learning continuous stochastic trajectories. 
Diffusion networks are governed by the standard connectionist differential equations 
plus an independent additive noise component. The resulting process is governed 
* Part of this work was done while at Carnegie Mellon University. 
83 
84 Movellan 
by a set of Langevin stochastic differential equations 
dai(t) = hi drifti(t) dt +adBi(t) ; i  {1,...,n} (1) 
where hi is the processing rate of the ita unit, a is the diffusion constant, which con- 
trols the flow of entropy throughout the network, and dBi(t) is a Brownian motion 
differential (Soon, 1973). The drift function is the deterministic part of the process. 
For consistency I use the same drift function as in Movellan and McClelland, 1992 
but many other options are possible: drifti(t) = -?=l wiiaJ(t) - f-lai(t), where 
wij is the weight from the jth to the i th unit, and f- is the inverse of a logistic 
function scaled in the (rnin - max) interval:f-(a) = log a-,i 
mtso- t$ ' 
In practice DNs are simulated in digital computers with a system of stochastic 
difference equations 
ai(t + At) = ai(t) + hi drifti(t) At +  zi(t) At ; i  {1, ...,n} (2) 
where zi(t) is a standard Gaussian random variable. I start the derivations of 
the learning algorithm for the trajectory learning task using the discrete time pro- 
cess (equation 2) and then I take limits to obtain the continuous diffusion ex- 
pression. To simplify the derivations I adopt the following notation: a trajectory 
of states -input, hidden and output units- is represented as a = [a(1)...a(tm)] = 
The trajectory vector can be partitioned into 3 
consecutive row vectors representing the trajectories of the input, hidden and out- 
put units a = [xhy]. 
The key to the learning algorithm is obtaining the gradient of the probability of 
specific trajectories. Once we know this gradient we have all the information needed 
to increase the probability of desired trajectories and decrease the probability of 
unwanted trajectories. To obtain this gradient we first need to do some derivations 
on the transition probability densities. Using the discrete time approximation to 
the diffusion process, it follows that the conditional transition probability density 
functions are multivariate Gaussian 
p(a(t + At)la(t)) = 
From equation 2 and 3 it follows that 
Since the 
computed from the product of the transition probabilities 
era--! 
r(a) = r(a(t0)) 17I p((t + 
=o 
The derivative of the probability of a specific trajectory follows 
Or()  
= 
=to 
0 
Owi-log p(a(t + At)l a(t)) = hizi(t) aj(t) (4) 
network is Markovian, the probability of an entire trajectory can be 
(5) 
(6) 
A Local Algorithm to Learn Trajectories with Stochastic Neural Networks 85 
In practice, the above rule is all is needed for discrete time computer simulations. 
We can obtain the continuous time form by taking limits as At -. O, in which case 
the sum becomes Ito's stochastic integral of aj(t) with respect to the Brownian 
motion differential over a {to, T} interval. 
A similar equation may be obtained for the )q parameters 
For notational convenience I define the following random variables and refer to them 
as the delia signals 
Svii(a) c31og p(a) hl T 
= = -- a(i)dBi(t) (9) 
and 
5x,(a) = Olog p(a) = 1_. a' drifti(i)dBi(t) (10) 
A 1 B 
1 
I I 
0 100 200 900 0 100 200 300 
Time Steps Time Steps 
Figure 1: A) A sample Trajectory. B) The Average Trajectory. As Time Progresses 
Sample Trajectories Become Statistically Independent Dampening the Average. 
86 Movellan 
The approach taken in this paper is to minimize the expected value of the error 
assigned to spontaneously generated trajectories 0 = E(p(a)) where p(a) is a signal 
indicating the overall error of a particular trajectory and usually depends only on 
the output unit trajectory. The necessary gradients follow 
0o 
= E(6,,jp) (11) 
Owij 
o 
= 
Since the above learning rule does not require calculating derivatives of the p func- 
tion, it provides great flexibility making it applicable to a wide variety of situations. 
For example p(a) can be the TSS between the desired and obtained output unit 
trajectories or it could be a reinforcement signal indicating whether the trajectory is 
or is not desirable. Figure 1.a shows a typical output of a network trained with TSS 
as the p signal to follow a sinusoidal trajectory. The network consisted of 1 input 
unit, 3 hidden units, and 1 output unit. The input was constant through time and 
the network was trained only with the first period of the sinusoid. The expected 
values in equations 11 and 12 were estimated using 400 spontaneously generated 
trajectories at each learning epoch. It is interesting to note that although the net- 
work was trained for a single period, it continued oscillating without dampening. 
However, the expected value of the activations dampened, as Figure 1.b shows. 
The dampening of the average activation is due to the fact that as time progresses, 
the effects of noise accumulate and the initially phase locked trajectories become 
independent oscillators. 
2O 
p transition: 0.2 
Hidden state: 0 
Hidden state = 1 
p(response 1): 0.1 p(response 1): 0.8 
18- 
14- 
0 
0 6- 
4- 
0 
.0 
best possible performance 
I I I 
900 1900 2900 
Learning Epoch 
Figure 2: A) The Hidden Markov Emitter. B) Average Error Throughout Training. 
The Bayesian Limit is Achieved at About 2000 Epochs. 
A Local Algorithm to Learn Trajectories with Stochastic Neural Networks 87 
The learning rule is also applicable in reinforcement situations where we just have 
an overall measure of fitness of the obtained trajectories, but we do not know 
what the desired trajectory looks like. For example, in a motor control problem 
we could use as fitness signal (-p) the distance walked by a robot controlled by 
a DN network. Equations 11 and 12 could then be used to gradually improve the 
average distance walked by the robot. In trajectory recognition problems we could 
use an overall judgment of the likelihood of the obtained trajectories. I tried this 
last approach with a toy version of a continuous speech recognition problem. The 
"emitter" was a hidden Markov model (see Figure 2) that produced sequences of 
outputs - the equivalent of fast Fourier transform loads - fed as input to the receiver. 
The receiver was a DN network which received as input, sequences of 10 outputs 
from the emitter Markov model. The network's task was to guess the sequence 
of hidden states of the emitter given the sequence of outputs from the emitter. 
The DN outputs were interpreted as the inferred state of the emitter. Output unit 
activations greater than 0.5 were evaluated as indicating that the emitter was in 
state I at that particular time. Outputs smaller than 0.5 were evaluated as state 
0. To achieve optimal performance in this task the network had to combine two 
sources of information: top-down information about typical state transitions of the 
emitter, and bottom up information about the likelihood of the hidden states of the 
emitter given its responses. 
The network was trained with rules 11 and 12 using the negative log joint prob- 
ability of the DN input trajectory and the DN output trajectory as error signal. 
This signal was calculated using the transition probabilities of the emitter hidden 
Markov model and did not require knowledge of its actual state trajectories. The 
necessary gradients for equations 11 and 12 were estimated using 1000 spontaneous 
trajectories at each learning epoch. As Figure 3 shows the network started pro- 
ducing unlikely trajectories but continuously improved. The figure also shows the 
performance expected from an optimal classifier. As training progressed the network 
approached optimal performance. 
Acknowledgements 
This work was funded through the NIMH grant MH47566 and a grant from the 
Pittsburgh Supercomputer Center. 
References 
B. Apolloni, k D. de Falco. (1991) Learning by asymmetric parallel Boltzmann 
machines. Neural Computation, 3,402-408. 
R. Neal. (1992) Asymmetric Parallel Boltzmann Machines are Belief Networks, 
Neural Computation, 4, 832-834. 
J. Movellan k J. McClellan& (1992a) Learning continuous probability distributions 
with symmetric diffusion networks. To appear in Cognitive Science. 
T. Soon. (1973) Random Differential Equations in Science and Engineering, Aca- 
demic Press, New York. 
