A Learning Analog Neural Network Chip 
with Continuous-Time Recurrent 
Dynamics 
Gert Cauwenberghs* 
California Institute of Technology 
Department of Electrical Engineering 
128-95 Caltech, Pasadena, CA 91125 
E-mail: gertcco. caltech. edu 
Abstract 
We present experimental results on supervised learning of dynam- 
ical features in an analog VLSI neural network chip. The recur- 
rent network, containing six continuous-time analog neurons and 42 
free parameters (connection strengths and thresholds), is trained to 
generate time-varying outputs approximating given periodic signals 
presented to the network. The chip implements a stochastic pertur- 
bative algorithm, which observes the error gradient along random 
directions in the parameter space for error-descent learning. In ad- 
dition to the integrated learning functions and the generation of 
pseudo-random perturbations, the chip provides for teacher forc- 
ing and long-term storage of the volatile parameters. The network 
learns a I kHz circular trajectory in 100 sec. The chip occupies 
2mm x 2mm in a 2pm CMOS process, and dissipates 1.2 mW. 
I Introduction 
Exact gradient-descent algorithms for supervised learning in dynamic recurrent net- 
works [1-3] are fairly complex and do not provide for a scalable implementation in 
a standard 2-D VLSI process. We have implemented a fairly simple and scalable 
*Present address: Johns Hopkins University, ECE Dept., Baltimore MD 21218-2686. 
858 
A Learning Analog Neural Network Chip with Continuous-Time Recurrent Dynamics 859 
learning architecture in an analog VLSI recurrent network, based on a stochastic 
perturbative algorithm which avoids calculation of the gradient based on an explicit 
model of the network, but instead probes the dependence of the network error on 
the parameters directly [4]. As a demonstration of principle, we have trained a 
small network, integrated with the learning circuitry on a CMOS chip, to gener- 
ate outputs following a prescribed periodic trajectory. The chip can be extended, 
with minor modifications to the internal structure of the cells, to accommodate 
applications with larger size recurrent networks. 
2 System Architecture 
The network contains six fully interconnected recurrent neurons with continuous- 
time dynamics, 
d 6 
r = + + , 
j=l 
with xi(t) the neuron states representing the outputs of the network, y(t) the 
external inputs to the network, and a(.) a sigmoidal activation function. The 36 
connection strengths Wi and 6 thresholds Oj constitute the free parameters to be 
learned, and the time constant r is kept fixed and identical for all neurons. Below, 
the parameters W and 0 are denoted as components of a single vector p. 
The network is trained with target output signals x(t) and x(t) for the first two 
neuron outputs. Learning consists of minimizing the time-averaged error 
12 
(p) = li__}rn  T E [x(t) -- xk(t)[Vdt , (2) 
using a distance metric with norm y. The learning algorithm [4] iteratively specifies 
incremental updates in the parameter vector p as 
p(+) = p() - t () *r  (3) 
with the perturbed error 
(k) = _1 ((p() + r()) _ (p() _ r()) ) (4) 
2 
obtained from a two-sided parallel activation of fixed-amplitude random perturba- 
tions h(k) onto the parameters pi(t); .i(t) _ -t-a with equal probabilities for both 
polarities. The algorithm basically performs random-direction descent of the error 
as a multi-dimensional extension to the Kiefer-Wolfowitz stochastic approximation 
method [5], and several related variants have recently been proposed for optimiza- 
tion [6,7] and hardware learning [8-10]. 
To facilitate learning, a teacher forcing signal is initially applied to the external 
input y according to 
yi(t) = X ff(xiT(t)-xi(t)) , i= 1,2 (5) 
providing a feedback mechanism that forces the network outputs towards the tar- 
gets [3]. A symmetrical and monotonically increasing "squashing" function for '7(-) 
serves this purpose. The teacher forcing amplitude A needs to be attenuated along 
the learning process, as to suppress the bias in the network outputs at convergence 
that might result from residual errors. 
860 Cauwenberghs 
3 Analog VLSI Implementation 
The network and learning circuitry are implemented on a single analog CMOS chip, 
which uses a transconductance current-mode approach for continuous-time opera- 
tion. Through dedicated transconductance circuitry, a wide linear dynamic range 
for the voltages is achieved at relatively low levels of power dissipation (experimen- 
tally 1.2 mW while either learning or refreshing). While most learning functions, 
including generation of the pseudo-random perturbations, are integrated on-chip in 
conjunction with the network, some global and higher-level learning functions of 
low dimensionality, such as the evaluation of the error (2) and construction of the 
perturbed error (4), are performed outside the chip for greater flexibility in tailoring 
the learning process. The structure and functionality of the implemented circuitry 
are illustrated in Figures i to 3, and a more detailed description follows below. 
3.1 Network Circuitry 
Figure 1 shows the schematics of the synapse and neuron circuitry. A synapse cell of 
single polarity is shown in Figure I (a). A high output impedance triode multiplier, 
using an adjustable regulated cascode [11], provides a constant current Ii. linear in the 
voltage Wij over a wide range. The synaptic current Iff feeds into a differential pair, 
injecting a differential current Ii.i cr(xj -0.) into the diode-connected Io+ut and I-ut output 
lines. The double-stack transistor configuration of the differential pair offers an expanded 
linear sigmoid range. The summed output currents Io+, and I-, of a row of synapses are 
collected in the output cell, Figure I (b), which also subtracts the reference currents 
and /f obtained from a reference row of "dummy" synapses defining the "zero-point" 
synaptic strength Woa for bipolar operation. The thus established current corresponds to 
the summed synaptic contributions in (1). Wherever appropriate (i = 1, 2), a differential 
transconductance element with inputs xi and x is added to supply an external input 
current for forced teacher action in accordance with (5). 
xof 
(a) co) 
Figure 1 Schematics of synapse and neuron circuitry. (a) Synapse of single polarity. 
(b) Output cell with current-to-voltage converter. 
The output current is converted to the neuron output voltage xi, through an active resistive 
element using the same regulated high output impedance triode circuitry as used in the 
synaptic current source. The feedback delay parameter r in (1) corresponds to the RC 
A Learning Analog Neural Network Chip with Continuous-Time Recurrent Dynamics 861 
product of the regulated triode active resistance value and the capacitance Oot. With 
Cout = 5 pF, the delay ranges between 20 and 200/sec, adjustable by the control voltage 
of the regulated cascode. Figure 2 shows the measured static characteristics of the synapse 
and neuron functions for different values of Wij and 0j ( i = j = 1), obtained by disabling 
the neuron feedback and driving the synapse inputs externally. 
II i I I I I 
'V 
, -0.4- 
-0.8 - -- - 1.6 V 
I I I I I 
4.o -o.s o.o o.s .o 
Input Voltage xj (v) 
(a) 
0.0 
.0.4 
4).6 
4).8 
! I I I I 
I I I I 
4.o -o.s o.o o.s .o 
Input Voltage x j (v) 
Figure 2 Measured static synapse and neuron characteristics, for various values of 
(a) the connection strength Wil, and (b) the threshold 0. 
3.2 Learning Circuitry 
Figure 3 (a) shows the simplified schematics of the learning and storage circuitry, replicated 
locally for every parameter (connection strength or threshold) in the network. Most of 
the variables relating to the operation of the cells are local, with exception of a few global 
signals communicating to all cells. Global signals include the sign and the amplitude 
of the perturbed error  and predefined control signals. The stored parameter and its 
binary perturbation are strictly local to the cell, in that they do not need to communicate 
explicitly to outside circuitry (except trivially through the neural network it drives), which 
simplifies the structural organization and interconnection of the learning cells. 
The parameter voltage Pi is stored on the capacitor Cstore, which furthermore couples 
to capacitor Cpert for activation of the perturbation. The perturbation bit *ri selects 
either of two complementary signals V+ and V_ with corresponding polarity. With 
the specific shape of the waveforms V+ and V- depicted in Figure 3 (b), the proper 
sequence of perturbation activations is established for observation of the complementary 
error terms in (4). The obtained global value for  is then used, in conjunction with 
the local perturbation bit *ri, to update the parameter value pi according to (3). A fine- 
resolution charge-pump, shown in the dashed-line inset of Figure 3 (a), is used for this 
purpose. The charge pump dumps either of a positive or negative update current, of equal 
amplitude, onto the storage capacitor whenever it is activated by means of an EN_UPD 
high pulse, effecting either of a given increment or decrement on the parameter value pi 
respectively. The update currents are supplied by two complementary transistors, and are 
switched by driving the source voltages of the transistors rather than their gate voltages 
in order to avoid typical clock feed-through effects. The amplitude of the incremental 
update, set proportionally to [[, is controlled by the VUPD n and Vum> p gate voltage 
levels, operated in the sub-threshold region. The polarity of the increment or decrement 
action is determined by the control signal DECR/INCR, obtained from the polarities of 
862 Cauwenberghs 
the perturbed error  and the perturbation bit *ri through an exclusive-or operation. The 
learning cycle is completed by activating the update by a high pulse on EN_UPD. The 
next learning cycle then starts with a new random bit value for the perturbation 
>0= 
o.1 pF 
 Cstor 
1 pF 
I I I I 
El'q UPD I I I I H 
- I I I I 
i i i I 
(p) (p+J (p ) 
Figure 3 Learning cell circuitry. (a) Simplified schematics. 
(b) Waveform and timing diagram. 
The random bit stream 7ri  is generated on-chip by means of a set of linear feedback 
shift registers [12]. For optimal performance, the perturbations need to satisfy certain 
statistical orthogonality conditions, and a rigorous but elaborate method to generate a 
set of uncorrelated bit streams in VLSI has been derived [13]. To preserve the scalability 
of the learning architecture and the local nature of the perturbations, we have chosen a 
simplified scheme which does not affect the learning performance to first order, as verified 
experimentally. The array of perturbation bits, configured in a two-dimensional arrange- 
ment as prompted by the location of the parameters in the network, is constructed by an 
outer-product exclusive-or operation from two generating linear sets of uncorrelated row 
and column bits on lines running horizontally and vertically across the network array. 
In the present implementation the evaluation of the error functional (2) is performed 
externally with discrete analog components, leaving some flexibility to experiment with 
different formulations of error functionals that otherwise would have been hardwired. A 
mean absolute difference ( = 1) norm is used for the metric distance, and the time- 
averaging of the error is achieved by a fourth-order Butterworth low-pass filter. The 
cut-off frequency is tuned to accommodate an AC ripple smaller than 0.1%, giving rise to 
a filter settling time extending 20 periods of the training signal. 
3.3 Long-Term Volatile Storage 
After learning, it is desirable to retain ("freeze") the learned information, in principle 
for an infinite period of time. The volatile storage of the parameter values on capacitors 
undergoes a spontaneous decay due to junction leakage and other drift phenomena, and 
needs to be refreshed periodically. For eight effective bits of resolution, a refresh rate of 
10 Hz is sufficient. Incidentally, the charge pump used for the learning updates provides 
for refresh of the parameter values as well. To that purpose, probing and multiplexing 
circuitry (not shown) are added to the learning cell of Figure 3 (a) for sequential refresh. 
In the experiment conducted here, the parameters are stored externally and refreshed 
sequentially by activating the corresponding charge pump with a DECR/INCR bit defined 
by the polarity of the observed deviation between internally probed and externally stored 
A Learning Analog Neural Network Chip with Continuous-Time Recurrent Dynamics 863 
values. The parameter refresh is performed in the background with a 100 msec cycle, 
and does not interfere with the continuous-time network operation. A simple internal 
analog storage method obliterating the need of external storage is described in [14], and 
is supported by the chip architecture. 
4 Learning Experiment 
As a proof of principle, the network is trained with a circular target trajectory 
defined by the quadrature-phase oscillator 
cos (2,,ft) (6) 
sin 
with A - 0.8V and f = lkHz. In principle a recurrent network of two neurons 
suffices to generate quadrature-phase oscillations, and the extra neurons in the 
network serve to accommodate the particular amplitude and frequency requirements 
and assist in reducing the nonlinear harmonic distortion. 
Clearly the initial conditions for the parameter values distinguish a trivial learning 
problem from a hard one, and training an arbitrarily initialized network may lead 
to unpredictable results of poor generality. Incidentally, we found that the majority 
of randomly initialized learning sessions fail to generate oscillatory behavior at con- 
vergence, the network being trapped in a local minimum defined by a strong point 
attractor. Even with strong teacher forcing these local minima persist. In contrast, 
we obtained consistent and satisfactory results with the following initialization of 
network parameters: strong positive diagonal connection strengths Wil = 1, zero 
off-diagonal terms Wij = 0; i  j and zero thresholds 0i - 0. The positive di- 
agonal connections Wii repel the neuron outputs from the point attractor at the 
origin, counteracting the spontaneous decay term -xi in (1). Applying non-zero 
initial values for the cross connections Wij; i  j would introduce a bias in the 
dynamics due to coupling between neurons. With zero initial cross coupling, and 
under strong initial teacher forcing, fairly fast and robust learning is achieved. 
Figure 4 shows recorded error sequences under training of the network with the tar- 
get oscillator (6), for five different sessions of 1,500 learning iterations each starting 
from the above initial conditions. The learning iterations span 60 msec each, for a 
total of 100 sec per session. The teacher forcing amplitude A is set initially to 3 V, 
and thereafter decays logarithmically over one order of magnitude towards the end 
of the sessions. Fixed values of the learning rate and the perturbation amplitude 
are used throughout the sessions, with/ - 25.6 V - and a = 12.5 mV. All five ses- 
sions show a rapid initial decrease in the error under stimulus of the strong teacher 
forcing, and thereafter undergo a region of persistent fiat error slowly tapering off 
towards convergence as the teacher forcing is gradually released. Notice that this 
fiat region does not imply slow learning; instead the learning constantly removes 
error as additional error is adiabatically injected by the relaxation of the teacher 
forcing. 
864 Cauwenberghs 
I ! I I I 
2.5 ( 12.5 mV 
2.0 
1.5 
0.5 
0.0 
0 20 40 60 80 100 
Time (set) 
Figure 4 Recorded evolution of the error during learning, 
for five different sessions on the network. 
Near convergence, the bias in the network error due to the residual teacher forcing 
becomes negligible. Figure 5 shows the network outputs and target signals at con- 
vergence, with the learning halted and the parameter refresh activated, illustrating 
the minor effect of the residual teacher forcing signal on the network dynamics. 
The oscillogram of Figure 5 (a) is obtained under a weak teacher forcing signal, 
and that of Figure 5 (b) is obtained with the same network parameters but with 
the teacher forcing signal disabled. In both cases the oscilloscope is triggered on 
the network output signals. Obviously, in absence of teacher forcing the network 
does no longer run synchronously with the target signal. However, the discrepancy 
in frequency, amplitude and shape between either of the free-running and forced 
oscillatory output waveforms and the target signal waveforms is evidently small. 
(a) (b) 
Figure 15 Oscillograms of the network outputs and target signals after learning, 
(a) under weak teacher forcing, and (b) with teacher forcing disabled. 
Top traces: z(t) and z'(t). Bottom traces: z2(t) and z2'(t). 
A Learning Analog Neural Network Chip with Continuous-Time Recurrent Dynamics 865 
5 Conclusion 
We implemented a small-size learning recurrent neural network in an analog VLSI 
chip, and verified its learning performance in a continuous-time setting with a simple 
dynamic test (learning of a quadrature-phase oscillator). By virtue of its scalable 
architecture, with constant requirements on interconnectivity and limited global 
communication, the network structure with embedded learning functions can be 
freely expanded in a two-dimensional arrangement to accommodate applications of 
recurrent dynamical networks requiring larger dimensionality. A present limitation 
of the implemented learning model is the requirement of periodicity on the input 
and target signals during the learning process, which is needed to allow a repetitive 
and consistent evaluation of the network error for the parameter updates. 
Acknowledgments 
Fabrication of the CMOS chip was provided through the DARPA/NSF MOSIS service. 
Financial support by the NIPS Foundation largely covered the expenses of attending the 
conference. 
References 
[1] B.A. Pearlmutter, "Learning State Space Trajectories in Recurrent Neural Networks," 
Neural Computation, vol. I (2), pp 263-269, 1989. 
[2] R.J. Williams and D. Zipset, "A Learning Algorithm for Continually Running Fully 
Recurrent Neural Networks," Neural Computation, vol. I (2), pp 270-280, 1989. 
[3] N.B. Toomarian, and J. Barhen, "Learning a Trajectory using Adjoint Functions and 
Teacher Forcing," Neural Networks, vol. 5 (3), pp 473-484, 1992. 
[4] G. Cauwenberghs, "A Fast Stochastic Error-Descent Algorithm for Supervised Learning 
and Optimization," in Advances in Neural Information Processing Systems, San Mateo, 
CA: Morgan Kaufman, vol. 5, pp 244-251, 1993. 
[5] H.J. Kushner, and D.S. Clark, "Stochastic Approximation Methods for Constrained 
and Unconstrained Systems," New York, NY: Springer-Verlag, 1978. 
[6] M.A. Styblinski, and T.-S. Tang, "Experiments in Nonconvex Optimization: Stochastic 
Approximation with Function Smoothing and Simulated Annealing," Neural Networks, 
vol. 3 (4), pp 467-483, 1990. 
[7] J.C. Spall, "Multivariate Stochastic Approximation Using a Simultaneous Perturbation 
Gradient Approximation," IEEE Trans. Automatic Control, vol. 37 (3), pp 332-341, 1992. 
[8] J. Alspector, R. Meir, B. Yuhas, and A. Jayakumar, "A Parallel Gradient Descent 
Method for Learning in Analog VLSI Neural Networks," in Advances in Neural Information 
Processing Systems, San Mateo, CA: Morgan Kaufman, vol. 5, pp 836-844, 1993. 
[9] B. Flower and M. Jabri, "Summed Weight Neuron Perturbation: An O(n) Improve- 
ment over Weight Perturbation," in Advances in Neural Information Processing Systems, 
San Mateo, CA: Morgan Kaufman, vol. 5, pp 212-219, 1993. 
[10] D. Kirk, D. Kerns, K. Fleischer, and A. Bart, "Analog VLSI Implementation of 
Gradient Descent," in Advances in Neural Information Processing Systems, San Mateo, 
CA: Morgan Kaufman, vol. 5, pp 789-796, 1993. 
[11] J.W. Fattaruso, S. Kiriaki, G. Warwar, and M. de Wit, "Self-Calibration Techniques 
for a Second-Order Multibit Sigma-Delta Modulator," in ISSCC Technical Digest, IEEE 
Press, vol. 36, pp 228-229, 1993. 
[12] S.W. Golomb, "Shift Register Sequences," San Francisco, CA: Holden-Day, 1967. 
[13] J. Alspector, J.W. Gannett, S. Haber, M.B. Parker, and R. Chu, "A VLSI-Efficient 
Technique for Generating Multiple Uncorrelated Noise Sources and Its Application to 
Stochastic Neural Networks," IEEE T. Circuits and Systems, 38 (1), pp 109-123, 1991. 
[14] G. Cauwenberghs, and A. Yariv, "Method and Apparatus for Long-Term Multi-Valued 
Storage in Dynamic Analog Memory," U.S. Patent pending, filed 1993. 
