498 Barhen, Toomarian and Gulati 
A djoint 
Learning 
Operator Algorithms for Faster 
in Dynamical Neural Networks 
Jacob Barhen 
Nikzad Toomarian 
Sandeep Gulati 
Center for Space Microelectronics Technology 
Jet Propulsion Laboratory 
California Institute of Technology 
Pasadena, CA 91109 
ABSTRACT 
A methodology for faster supervised learning in dynamical nonlin- 
ear neural networks is presented. It exploits the concept of adjoint 
operators to enable computation of changes in the network's re- 
sponse due to perturbations in all system parameters, using the so- 
lution of a single set of appropriately constructed linear equations. 
The lower bound on speedup per learning iteration over conven- 
tional methods for calculating the neuromorphic energy gradient is 
O(N2), where N is the number of neurons in the network. 
I INTRODUCTION 
The biggest promise of artifcial neural networks as computational tools lies in the 
hope that they will enable fast processing and synthesis of complex information 
patterns. In particular, considerable efforts have recently been devoted to the for- 
mulation of efficent methodologies for learning (e.g., Rumelhart et al., 1986; Pineda, 
1988; Pearlmutter, 1989; Williams and Zipser, 1989; Barhen, Gulati and Zak, 1989). 
The development of learning algorithms is generally based upon the minimization 
of a neuromorphic energy function. The fundamental requirement of such an ap- 
proach is the computation of the gradient of this objective function with respect 
to the various parameters of the neural architecture, e.g., synaptic weights, neural 
Adjoint Operator Algorithms 499 
gains, etc. The paramount contribution to the often excessive cost of learning us- 
ing dynamical neural networks arises from the necessity to solve, at each learning 
iteration, one set of equations for each parameter of the neural system, since those 
parameters affect both directly and indirectly the network's energy. 
In this paper we show that the concept of adjoint operators, when applied to dynam- 
ical neural networks, not only yields a considerable algorithmic speedup, but also 
puts on a firm mathematical basis prior results for "recurrent" networks, the deriva- 
tions of which sometimes involved much heuristic reasoning. We have already used 
adjoint operators in some of our earlier work in the fields of energy-economy mod- 
eling (Alsmiller and Barhen, 1984) and nuclear reactor thermal hydraulics (Barhen 
et al., 1982; Toomarian et al., 1987) at the Oak Ridge National Laboratory, where 
the concept flourished during the past decade (Oblow, 1977; Cacuci et al., 1980). 
In the sequel we first motivate and construct, in the most elementary fashion, a 
computational framework based on adjoint operators. We then apply our results 
to the Cohen-Grossberg-Hopfield (CGH) additive model, enhanced with terminal 
attractor (Barhen, Gulati and Zak, 1989) capabilities. We conclude by presenting 
the results of a few typical simulations. 
2 ADJOINT OPERATORS 
Consider, for the sake of simplicity, that a problem of interest is represented by the 
following system of N coupled nonlinear equations 
(a,p) = 0 (2.1) 
where  denotes a nonlinear operator  . Let fi and i5 represent the N-vector of 
dependent state variables and the M-vector of system parameters, respectively. We 
will assume that generally M >> N and that elements of  are, in principle, inde- 
pendent. Furthermore, we will also assume that, for a specific choice of parameters, 
a unique solution of Eq. (2.1) exists. Hence,  is an implicit function of i5. A 
system "response", R, represents any result of the calculations that is of interest. 
Specifically 
R = R(,) (2.2) 
i.e., R is a known nonlinear function ofp and  and may be calculated from Eq. (2.2) 
when the solution fi in Eq. (2.1) has been obtained for a given p. The problem of 
interest is to compute the "sensitivities" of R, i.e., the derivatives of R with respect 
to parameters p, it = 1,. -., M. By definition 
dR OR OR O 
= + (2.3) 
dp, Opt, O Opt, 
lff differential operators appear in Eq. (2.1), then a corresponding set of boundary and/or 
initial conditions to specify the domain of qo must also be provided. In general an inhomogeneous 
"source" term can also be present. The learning model discussed in this paper focuses on the 
adiabatic approximation only. Nonadiabatic learning algorithms, wherein the response is defined 
as a functional, will be discussed in a forthcoming article. 
500 Barhen, Toomarian and Gulati 
Since the response R is known analytically, the computation of OR/Opt, and OR/O 
is straightforward. The quantity that needs to be determined is the vector O/Op,. 
Differentiating the state equations (2.1), we obtain a set of equations to be referred 
to as "forward" sensitivity equations 
0 0a 
= (2.4) 
c9 fi O p . c9 p , 
To simplify the notations, we are omitting the "transposed" sign and denoting the 
N by N forward sensitivity matrix 0/0 by A, the N-vector O/Opt, by  and 
the "source" N-vector -O/Opt, by g. Thus 
(2.5) 
Since the source term in Eq. (2.5) explicitly depends on /, computing dR/dpt, , 
requires solving the above system of N algebraic equations for each parameter pu. 
This difficulty is circumvented by introducing adjoint operators. Let A* denote the 
formal adjoint 2 of the operator A. The adjoint sensitivity equations can then be 
expressed as 
A* '* = '*. (2.6) 
By definition, for algebraic operators 
.q* (A"q) = .q* = (A* "q*) = us* (2.7) 
Since Eq. (2.3), can be rewritten as 
dR OR OR 
= + -- (2.8) 
dp, c9p, 
if we identify 
OR 
= 'S* = * (2.9) 
0 - - 
we observe that the source term for the adjoint equations is independent of the 
specific parameter p,. Hence, the solution of a single set of adjoint equations will 
provide all the information required to compute the gradient of R with respect to all 
parameters. To underscore that fact we shall denote ' as 0. Thus 
dR OR 
dp, = Opt, o Opt, (2.10) 
We will now apply this computational framework to a CGH network enhanced with 
terminal attractor dynamics. The model developed in the sequel differs from our 
2 Adjoint operatom can only be considered for densely defined linear operatom on Banach spaces 
(see e.g., Cacuci, 1980). For the neural application under consideration we will limit ourselves to 
real Hilbert spaces. Such spaces are self-dual. Furthermore, the domain of an adjoint operator is 
determined by selecting appropriate adjoint boundary conditions . The associated bilinear form 
evaluated on the domain boundary must thus be also generally included. 
Adjoint Operator Algorithms 501 
earlier formulations (Barhen, Gulati and Zak, 1989; Barhen, Zak and Gulati, 1989) 
in avoiding the use of constraints in the neuromorphic energy timetlon, thereby 
eliminating the need for differential equations to evolve the concomitant Lagrange 
multipliers. Also, the usual activation dynamics is transformed into a set of equiv- 
alent equations which exhibit more "congenial" numerical properties, such as "con- 
traction". 
3 APPLICATIONS TO NEURAL LEARNING 
We formalize a neural network as an adaptive dynamical system whose temporal 
evolution is governed by the following set of coupled nonlinear differential equations 
+ = > r,.. + (3.1) 
where z, represents the mean soma potential of the nth neuron and , denotes the 
synaptic coupling from the m-th to the n-th neuron. The weighting factor w, 
enforces topological considerations. The constant ,, characterizes the decay of neu- 
ron activity. The sigmoidM function g,(.) modulates the neural response, with gain 
given by 7; typically, g,(z) = tanh(7z). The "source" term I,, which includes 
dimensional considerations, encodes contribution in terms of attractor coordinates 
of the k-th training sample via the following expression 
{ [an] - [an - gv(zn) ] if n  Sx (3.2) 
I = 0 ifnSUSv 
The topographic input, output and hidden network partitions Sx, Sv and St are 
architectural requirements related to the encoding of mapping-type problems for 
which a number of possibilities exist (Barhen, Gulati and Zak, 1989; Barhen, Zak 
and Gulati, 1989). In previous articles (ibid; Zak, 1989) we have demonstrated that 
in general, for fi = (2i+ 1) - and i a strictly positive integer, such attractors have 
infinite local stability and provide opportunity for learning in real-time. Typically, 
 can be set to 1/3. Assuming an adiabatic framework, the fixed point equations 
at equilibrium, i.e., as i  0, yield 
g u,, + L (3.3) 
where u, = g,(z,) represents the neural response. The superscript  denotes 
quantities evaluated at steady state. Operational network dynamics is then given 
by 
- -- w. u,, I. (3.4) 
n  un -- gv n m n 
To proceed formally with the development of a supervised learning algorithm, we 
consider an approach based upon the minimization of a constrained "neuromorphic" 
energy function E given by the following expression 
1 
E(fi,p) =   [fin - an ]2 Vn e Sx VSy (3.5) 
k n 
502 Barhen, Toomarian and Gulati 
We relate adjoint theory to neural learning by identifying the neuromorphic energy 
function, E in Eq. (3.5), with the system response R. Also, let p denote the following 
system parameters: 
The proposed objective function enforces convergence of every neuron in $x and 
$- to attractor coordinates corresponding to the components in the input-output 
training patterns, thereby prompting the network to learn the embedded invari- 
ances. Lyapunov stability requires an energy-like function to be monotonically de- 
creasing in time. Since in our model the internal dynamical parameters of interest 
are the synaptic strengths T,m of the interconnection topology, the characteristic 
decay constants tn and the gain parameters 7n this implies that 
For each adaptive system parameter, p,, Lyapunov stability will be satisfied by the 
following choice of equations of motion 
Examples include 
dE 
p = --p (3.7) 
dp, 
dE dE dE 
n, - -Vr dTn, ;  -- r. d7 ; k - -r d 
where the time-scale parameters rT, r and rv > 0. Since E depends on p 
both directly and indirectly, previous methods required solution of a system of N 
equations for each parameter p to obtain dE/dp from d/dp. Our methodology 
(based on adjoint operators), yields all derivatives dE/dp,V p , by solving a 
single set of N linear equations. 
The nonlinear neural operator for each training pattern k, k - 1,-.. K, at equi- 
librium is given by 
[ 1 ] 
I + =o (a.8) 
o (,p) = g n  
where, without loss of generality we have set 7n to unity. So, in principle '5n - 
}fin [T, , , an,'"]. Using Eqs. (3.8), the forward sensitivity matrix can be 
computed and compactly expressed as 
Adjoint Operator Algorithms 503 
where 
[ka"l/3 , [a,  fi, ]--/3 if n  Sx 
1 if n  S U Sv 
Above, } represents the derivative of  with respect to }fi, i.e., if   
then 
}, = 1 - [}g,]' where }g' = g t- m 
(3.10) 
tanh, 
(3.11) 
Recall that the formal adjoint equation is given as A*O = g*; here 
1 
Am = ,. wren Tmn - r/m m. (3.12) 
m 
Using Eqs. (2.9) and (3.5), we can compute the formal adjoint source 
}s = OE { }fi, - }a, ifnESxOSr (3.13) 
- 0}fl. - 0 ifnESu 
The system of adjoint fixed-point equations can then be constructed using Eqs. 
(3.12) and (3.1a), to yield: 
5 1 }Ow,,T._},5.}$ = }s[ (3.14) 
m m 
Notice that the above coupled system, (3.14), is linear in }. rthermore, it 
h the same mathematical characteristics  the operational dynamics (3.4). Its 
components can be obtained  the equilibrium points, (i.e., 9i  0) of the adjoint 
neural dynamics 
m 
As an implementation example, let us conclude by deriving the learning equations 
for the synaptic strengths, T,. Recall that 
dE OE 
- + E  ''  # = (i,j) (3.16) 
We differentiate the steady state equations (3.8) with respect to Tij, to obtain the 
forward source term, 
tk Sn 
= - }. [ -- y w. . ,j }a + 0 1 
1 ,. 5i. w. uj (3.17) 
/n 
504 Barhen, Toomarian and Gulati 
Since by definition, 
obtained as 
OE/OT,, = 
nm '-- --7'7' 
0 , the explicit energy gradient contribution is 
[ wnm E n }0n }tim ] (3.18) 
It is straightforward to obtain learning equations for 7,, and n,, in a similar fashion. 
4 ADAPTIVE TIME-SCALES 
So far the adaptive learning rates, i.e., r r in Eq.(3.7), have not been specified. Now 
we will show that, by an appropriate selection of these parameters the convergence 
of the corresponding dynamical systems can be considerably improved. Without 
loss of generality, we shall assume rT = r : r v ---- r, and we shall seek r in the 
form (Barhen et al, 1989; Zak 1989) 
r I VEI (4.1) 
where VE denotes the vector with components 7TE, 7vE and VE. It is straight- 
forward to show that 
d 
I VEI = -x I VEI (4.2) 
dt 
as VE tends to zero, where X is an arbitrary positive constant. If we evaluate the 
relaxation time of the energy gradient, we find that 
Thus, for  <_ 0 the relaxation time is infinite, while for  > 0 it is finite. The 
dynamical system (3.19) suffers a qualitative change for  > 0  it loses uniqueness 
of solution. The equilibrium point I VE I = 0 becomes a singular solution being 
intersected by all the transients, and the Lipschitz condition is violated, as one can 
see from 
where I VE I tends to zero, while  is strictly positive. Such infinitely stable points 
are "terminal attractors". By analogy with our previous results we choose  = 2/3, 
which yields 
-- 
r - E [VrE 12m + Y-[V*EI2 + E [V 1 (4.5) 
The introduction of these adaptive time-scales dramatically improves the conver- 
gence of the corresponding learning dynamical systems. 
Adjoint Operator Algorithms 505 
5 SIMULATIONS 
The computational framework developed in the preceding section has been ap- 
plied to a number of problems that involve learning nonlinear mappings, including 
Exclusive-OR, the hyperbolic tangent and trignometric functions, e.g., sin. Some of 
these mappings (e.g., XOR) have been extensively benchmarked in the literature, 
and provide an adequate basis for illustrating the computational efficacy of our pro- 
posed formulation. Figures l(a)-l(d) demonstrate the temporal profile of various 
network elements during learning of the XOR function. A six neuron feedforward 
network was used, that included self-feedback on the output unit and bias. Fig. 
l(a) shows the LMS error during the training phase. The worst-case convergence of 
the output state neuron to the presented attractor is displayed in Fig. l(b). Notice 
the rapid convergence of the input state due to the terminal attractor effect. The 
behavior of the adaptive time-scale parameter r is depicted in Fig. l(c). Finally, 
Fig. l(d) shows the evolution of the energy gradient components. 
The test setup for signal processing applications, i.e., learning the sin function and 
the tanh sigmoidal nonlinearlity, included a 8-neuron fully connected network with 
no bias. In each case the network was trained using as little as 4 randomly sampled 
training points. Efficacy of recall was determined by presenting 100 random sam- 
ples. Fig. (2) and (3b) illustrate that we were able to approximate the sin and the 
hyperbolic tangent functions using 16 and 4 pairs respectively. Fig. 3(a) demon- 
strates the network performance when 4 pairs were used to learn the hyperbolic 
tangent. 
We would like to mention that since our learning methodology involves terminal 
attractors, extreme caution must be exercised when simulating the algorithms in 
a digital computing environment. Our discussion on sensitivity of results to the 
integration schemes (Barhen, Zak and Gulati, 1989) emphasizes that explicit meth- 
ods such as Euler or Runge-Kutta shall not be used, since the presence of terminal 
attractors induces extreme stiffness. Practically, this would require an integration 
time-step of infinitesimal size, resulting in numerical round-off errors of unaccept- 
able magnitude. Implicit integration techniques such as the Kaps-Rentrop scheme 
should therefore be used. 
6 CONCLUSIONS 
In this paper we have presented a theoretical framework for faster learning in dy- 
namical neural networks. Central to our approach is the concept of adjoi7,! operators 
which enables computation of network neuromorphic energy gradients with respect 
to all system parameters using the solution of a single set of linear equations. If 
CF and CA denote the computational costs associated with solving the forward and 
adjoint sensitivity equations (Eqs. 2.5 and 2.6), and if M denotes the number of 
parameters of interest in the network, the speedup achieved is 
M C- 
,F---. A -- 
CA 
506 Barhen, Toomarian and Gulati 
If we assume that Ci - CA and that M = N 2 + 2N + ..., we see that the lower 
bound on speedup per learning iteration is O(N2). Finally, particular care must be 
execrcised when integrating the dynamical systems of interest, due to the extreme 
stiffness introduced by the terminal attractor constructs. 
Acknowledgement s 
The research described in this paper was performed by the Center for Space Mi- 
croelectronics Technology, Jet Propulsion Laboratory, California Institute of Tech- 
nology, and was sponsored by agencies of the U.S. Department of Defense, and 
by the Office of Basic Energy Sciences of the U.S. Department of Energy, through 
interagency agreements with NASA. 
References 
R.G. Alsmiller, J. Barhen and J. Horwedel. (1984) "The Application of Adjoint 
Sensitivity Theory to a Liquid Fuels Supply Model", Energy, 9(3), 239-253. 
J. Barhen, D.G. Cacuci and J.J. Wagschal. (1982) "Uncertainty Analysis of Time- 
Dependent Nonlinear Systems", Nucl. $ci. Eng., 81, 23-44. 
J. Barhen, S. Gulati and M. Zak. (1989) "Neural Learning of Constrained Nonlinear 
Transformations", IEEE Computer, 22(6), 67-76. 
J. Barhen, M. Zak and S. Gulati. (1989)" Fast Neural Learning Algorithms Using 
Networks with Non-Lipschitzian Dynamics", in Proc. Neuro-Nimes '89, 55-68, EC2, 
Nanterre, France. 
D.G. Cacuci, C.F. Weber, E.M. Oblow and J.H. Marable. (1980) "Sensitivity The- 
ory for General Systems of Nonlinear Equations", Nucl. $ci. Eng., 75, 88-110. 
E.M. Oblow. (1977) "Sensitivity Theory for General Non-Linear Algebraic Equa- 
tions with Constraints", ORNL/TM-5815, Oak Ridge National Laboratory. 
B.A. Pearlmutter. (1989) "Learning State Space Trajectories in Recurrent Neural 
Networks", Neural Computation, 1(3), 263-269. 
F.J. Pineda. (1988) "Dynamics and Architecture in Neural Computation", Journal 
of Complexity, 4, 216-245. 
D.E. Rumelhart and J.L. Mclelland. (1986) Parallel and Dislribuled Procesing, MIT 
Press, Cambridge, MA. 
N. Toomarian, E. Wacholder and S. Kaizerman. (1987) "Sensitivity Analysis of 
Two-Phase Flow Problems", Nucl. $ci. Eng., 99(1), 53-81. 
R.J. Williams and D. Zipser. (1989) "A Learning Algorithm for Continually Run- 
ning Fully Recurrent Neural Networks", Neural Compulatio, 1(3), 270-280. 
M. Zak. (1989) "Terminal Attractors", Neural Networks, 2(4), 259-274. 
Adjoint Operator Algorithms 507 
(a) 
(b) 
4 
iterations 
150 
iterations 10 
iterations 
150 
(c) 
(d) 
i,'i,,re 1 (a)-(d). 
Learning the Exch, sive-OR function using a 6-neuron 
(including bias) feedforward dynamical network with 
self-feedback on the output unit. 
508 Barhen, Toomarian and Gulati 
o.ooo t 
-0.500- 
- 1.000 I I 
- 1.000 -0.500 0.000 0.500 1.000 
Figure 2. 
Learning the Sin function using a fully connected, 8-neuron 
network with no bias. The training set comprised of 
4 points that were randomly selected. 
3(a) 
1.000 - 
0,00- 
o ooo- 
-0.500 
- 1 ooo 
- 1 .ooo 
-0.500 0.000 0.500 1.000 
3(b) 
 000- 
0.500 
0.000  
-0.500 
- I.OOO 
- i.oco 
-0.500 0.000 0.500 1.000 
Figure 3. 
Learning the Hyperbolic Tangent function using a fully connected, 
8-neuron network with no bias. (a) using 4 randomly selected 
training samples; (b) using 16 rnndomly selected training samples. 
