Gradient and Hamiltonian Dynamics 
Applied to Learning in Neural Networks 
James W. Howse 
Chaouki T. Abdallah 
Gregory L. Heileman 
Department of Electrical and Computer Engineering 
University of New Mexico 
Albuquerque, NM 87131 
Abstract 
The process of machine learning can be considered in two stages: model 
selection and parameter estimation. In this paper a technique is presented 
for constructing dynamical systems with desired qualitative properties. The 
approach is based on the fact that an rt-dimensional nonlinear dynamical 
system can be decomposed into one gradient and (rt - 1) Hamiltonian sys- 
tems. Thus, the model selection stage consists of choosing the gradient and 
Hamiltonian portions appropriately so that a certain behavior is obtainable. 
To estimate the parameters, a stably convergent learning rule is presented. 
This algorithm has been proven to converge to the desired system trajectory 
for all initial conditions and system inputs. This technique can be used to 
design neural network models which are guaranteed to solve the trajectory 
learning problem. 
I Introduction 
A fundamental problem in mathematical systems theory is the identification of dy- 
namical systems. System identification is a dynamic analogne of the functional ap- 
proximation problem. A set of input-output pairs (u(t), y(t)) is given over some time 
interval t E IT/, 7'f]. The problem is to find a model which for the given input sequence 
returns an approximation of the given output sequence. Broadly speaking, solving an 
identification problem involves two steps. The first is choosing a class of identifica- 
tion models which are capable of emulating the behavior of the actual system. The 
second is selecting a method to determine which member of this class of models best 
emulates the actual system. In this paper we present a class of nonlinear models and 
a learning algorithm for these models which are guaranteed to learn the trajectories 
of an example system. Algorithms to learn given trajectories of a continuous time 
system have been proposed in [6], [8], and [7] to name only a few. To our knowledge, 
no one has ever proven that the error between the learned and desired trajectories 
vanishes for any of these algorithms. In our trajectory learning system this error is 
guaranteed to vanish. Our models extend the work in [1] by showing that Cohen's 
systems are one instance of the class of models generated by decomposing the dynam- 
ics into a component normal to some surface and a set of components tangent to the 
same surface. Conceptually this formalism can be used to design dynamical systems 
with a variety of desired qualitative properties. Furthermore, we propose a provably 
convergent learning algorithm which allows the parameters of Cohen's models to be 
learned from examples rather than being programmed in advance. The algorithm is 
Gradient and Hamiltonian Dynamics Applied to Learning in Neural Networks 2 75 
convergent in the sense that the error between the model trajectories and the de- 
sired trajectories is guaranteed to vanish. This learning procedure is related to one 
discussed in [5] for use in linear system identification. 
2 Constructing the Model 
First some terminology will be defined. For a system of rt first order ordinary differ- 
ential equations, the phase space of the system is the a-dimensional space of all state 
components. A solution trajectory is a curve in phase space described by the differ- 
ential equations for one specific starting point. At every point on a trajectory there 
exists a tangent vector. The space of all such tangent vectors for all possible solution 
trajectories constitutes the vector field for this system of differential equations. 
The trajectory learning models in this paper are systems of first order ordinary dif- 
ferential equations. The form of these equations will be obtained by considering the 
system dynamics as motion relative to some surface. At each point in the state space 
an arbitrary system trajectory will be decomposed into a component normal to this 
surface and a set of components tangent to this surface. This approach was suggested 
to us by the results in [4], where it is shown that an arbitrary a-dimensional vector 
field can be decomposed locally into the sum of one gradient vector field and (rt - 1) 
Hamiltonian vector fields. The concept of a potential function will be used to de- 
fine these surfaces. A potential function V(x) is any scalar valued function of the 
system states a: = [x,x2,... ,xt] t which is at least twice continuously differentiable 
(i.e. V(a:) E C r  r >_ 2). The operation [.It denotes the transpose of the vector. If 
there are rt components in the system state, the function V(a:), when plotted with 
respect all of the state components, defines a surface in an (rt + 1)-dimensional space. 
There are two curves passing through every point on this potential surface which are 
of interest in this discussion, they are illustrated in Figure l(a). The dashed curve is 
- v.v()l.o =o 
0.. c 
2 
(a) (b) 
Figure 1: (a) The potential function V(x) = x (x -1) 2 +x plotted versus its two depen- 
dent variables x and x2. The dashed curve is called a level surface and is given 
by V(x) = 0.5. The solid curve follows the path of steepest descent through 0. 
(b) The partitioning of a 3-dimensional vector field at the point 0 into a 1- 
dimensional portion which is normal to the surface V() =/C and a 2-dimensional 
portion which is tangent to V () =/C. The vector -VV(x)I0 is the normal vec- 
tor to the surface V(a:) =/C at the point a:0. The plane (a:- a:0) t VV(a:)[ 0 = 0 
contains all of the vectors which are tangent to V (a:) =/C at a:0. Two linearly 
independent vectors are needed to form a basis for this tangent space, the pair 
Q2(a:) VV(a:)[ 0 and Qa(a:) VV(a:)[ 0 that are shown are just one possibility. 
referred to as a level surface, it is a surface along which V(a:) =/C for some constant 
K. Note that in general this level surface is an n-dimensional object. The solid curve 
2 76 J.W. HOWSE, C. T. ABDALLAH, G. L. HEILEMAN 
moves downhill along V(a) following the path of steepest descent through the point 
a:0. The vector which is tangent to this curve at a:0 is normal to the level surface 
at a:0. The system dynamics will be designed as motion relative to the level surfaces 
of V(a). The results in [4] require rt different local potential functions to achieve 
arbitrary dynamics. However, the results in [1] suggest that a considerable number 
of dynamical systems can be achieved using only a single global potential function. 
A system which is capable of traversing any downhill path along a given potential 
surface V(a:), can be constructed by decomposing each element of the vector field 
into a vector normal to the level surface of V(a:) which passes through each point 
and a set of vectors tangent to the level surface of V(a:) which passes through the 
same point. So the potential function V(a) is used to partition the rt-dimensional 
phase space into two subspaces. The first contains a vector field normal to some 
level surface V(a:) = /C for /C E 1 while the second subspace holds a vector field 
tangent to V() = /C. The subspace containing all possible norma] vectors to the 
rt-dimensional level surface at a given point, has dimension one. This is equivalent 
to the statement that every point on a smooth surface has a unique normal vector. 
Similarly, the subspace containing all possible tangent vectors to the level surface at 
a given point has dimension (rt - 1). An example of this partition in the case of a 
3-dimensional system is shown in Figure l(b). Since the space of all tangent vectors 
at each point on a level surface is (rt - 1)-dimensional, (rt - 1) linearly independent 
vectors are required to form a basis for this space. 
Mathematically, there is a straightforward way to construct dynamical systems which 
either move downhill along V() or remain at a constant height on V(). In this 
paper, dynamical systems which always move downhill along some potential surface 
are called gradient-like systems. These systems are defined by differential equations 
of the form 
 = -P(a:) X7=V(), (1) 
where P(a) is a matrix function which is symmetric (i.e. pt = p) and positive 
definite at every point a:, and where X7_V(a:) - [oV ov ov It. These systems 
' 0z2 ''" ' 
are similar to the gradient flows discussed in [2]. The trajectories of the system 
formed by Equation (1) always move downhill along the potential surface defined by 
V(x). This can be shown by taking the time derivative of V(x) which is /(x) = 
-[VV()] t P()[VV(x)] _( 0. Because P(x) is positive definite, /(x) can only be 
zero where 7=V(x) - 0, elsewhere (x) is negative. This means that the trajectories 
of Equation (1) always move toward a level surface of V(x) formed by "slicing" V(x) 
at a lower height, as pointed out in [2]. It is also easy to design systems which remain 
at a constant height on V(x). Such systems will be denoted Hamiltonjan-like systems. 
They are specified by the equation 
k = Q(x) V=V(x), (2) 
where Q(x) is a matrix function which is skew-symmetric (i.e. Qt = _Q) at every 
point a:. These systems are similar to the Hamiltonian systems defined in [2]. The 
elements of the vector field defined by Equation (2) are always tangent to some level 
surface of V (a:). Hence the trajectories of this system remain at a constant height on 
the potential surface given by .V(a). Again this is indicated by the time derivative 
of V(), which in this case is V() = [V=V()] t Q(x)[V=V()] = 0. This indicates 
that the trajectories of Equation (2) always remain on the level surface on which the 
system starts. So a model which can follow an arbitrary downhill path along the 
potential surface V(x) can be designed by combining the dynamics of Equations (1) 
and (2). The dynamics in the subspace normal to the level surfaces of V() can be 
Gradient and Hamiltonian Dynamics Applied to Learning in Neural Networks 2 77 
defined using one equation of 
subspace tangent to the level 
of the form in Equation (2). Hence the total dynamics for the model are 
the form in Equation (1). Similarly the dynamics in the 
surfaces of (a:) can be defined using (rt- 1) equations 
 = -P(=) + Q,(=) 
(3) 
i----2 
For this model the number and location of equilibria is determined by the function 
V(a:), while the manner in which the equilibria are approached is determined by the 
matrices P(:r) and Qi(az). 
If the potential function V(a:) is bounded below (i.e. V(a:) > /3t  a: 6 iR a, where 
/3t is a constant), eventually increasing (i.e. limllll_o V(a:) -+ oo) , and has only 
a finite number of isolated local maxima and minima (i.e. in some neighborhood 
of every point where 7._V(x) = 0 there are no other points where the gradient 
vanishes), then the system in Equation (3) satisfies the conditions of Theorem 10 
in [1]. Therefore the system will converge to one of the points where 7V(a:) = 0, 
called the critical points of V(a:), for all initial conditions. Note that this system 
is capable of all downhill trajectories along the potential surface only if the (rt - 1) 
vectors Qi(az)7V(a:)  i - 2, o.., t are linearly independent at every point a:. It 
is shown in [1] that the potential function 
/:1  [1 1/ 1 ] 
V(x)=C Ll(?)d?+ E (xi-Li(x,))2 + Li(?)[L'i(?)]  d? (4) 
1 i----2 
satisfies these three criteria. In this equation Li(x)  i = 1,... ,rt are interpolation 
polynomials,  is a real positive constant, 3Ji  i = 1,..., a are real constants chosen 
dIL i . 
so that the integrals are positive valued, and L' i (Xl) --- d 
3 The Learning Rule 
In Equation (3) the number and location of equilibria can be controlled using the 
potential function V (x), while the manner in which the equilibria are approached can 
be controlled with the matrices P(x) and Qi (x)- If it is assumed that the locations 
of the equilibria are known, then a potential function which has local minima and 
maxima at these points can be constructed using Equation (4). The problem of 
trajectory learning is thereby reduced to the problem of parameterizing the matrices 
P(z) and Qi(z) and finding the parameter values which cause this model to best 
emulate the actual system. If the elements P(x) and Qi(x) are correctly chosen, 
then a learning rule can be designed which makes the model dynamics converge to 
that of the actual system. Assume that the dynamics given by Equation (3) are a 
parameterized model of the actual dynamics. Using this model and samples of the 
actual system states, an estimator for states of the actual system can be designed. The 
behavior of the model is altered by changing its parameters, so a parameter estimator 
must also be constructed. The following theorem provides a form for both the state 
and parameter estimators which guarantees convergence to a set of parameters for 
which the error between the estimated and target trajectories vanishes. 
Theorem 3.1. Given the model system 
k 
k =  A, f,(x) + B g(u) (5) 
i----1 
where Ai 6 iRrtxrt and B 6 xm are unknown, and f i(') and g(.) are known smooth 
functions such that the system has bounded solutions for bounded inputs u(t). Choose 
2 78 J.W. HOWSE, C. T. ABDALLAH, G. L. HEILEMAN 
a state estimator of the 
+ 
where 7 is an 
left half plane, and 
parameter estimators of the form 
A, = 
fi 
k 
i----1 
matrix of real constants whose eigenvalues must all be in the 
and 1 are the estimates of the actual parameters. Coose 
 i = 1,...,k 
(7) 
where 7Lp is an (rt x rt) matrix of real constants which is symmetric and positive 
definite, and (5: - m)[-] denotes an outer product. For these choices of state and 
parameter estimators limt-. (5:(t) -x(t)) = 0 for all initial conditions. Furthermore, 
this remains true if any of the elements of -i or 1 are set to O, or if any of these 
matrices are restricted to being symmetric or skew-symmetric. 
The proof of this theorem appears in [3]. Note that convergence of the parameter 
estimates to the actual parameter values is not guaranteed by this theorem. The 
model dynamics in Equation (3) can be cast in the form of Equation (5) by choosing 
each element of P() and Qi () to have the form 
x [-1 x [--1 
j:l k:0 j:l k:0 
where (0o(xj),O(xj),...,O_(xj)) and (00(xj), 0(xy),...,0-(xj)) e a set of 
orthogonal polynomis which depend on the state xj. There is a set of such poly- 
nomis for every state xj, j = 1, 2,...,a. The constants rsjk and Arsjk determine 
the contribution of the kth polynomi which depends on the jth state to the vue 
of Ps and Qs respectively. In this ce the dynamics in Equation (3) become 
 = + + (9) 
j=l k=0 
where Ejk is the (a x a) matrix of all values sjk which have the same vue of j d 
k. Likewise Aiyk is the (a x a) matr of all values Asjk, having the se vue of 
j and k, which e sociated with the ith matrix Qi(x). This system h m inputs, 
which may explicitly depend on time, that e represented by the m-element vector 
function u(t). The m-element vector function g(-) is a smooth, possibly nonlinear, 
trsformation of the input function. The matrix T is  (a x m) parameter matrix 
which determines how much of input s  l,...,m) effects state r 
Appropriate state and pareter estimators can be designed bed on Equations (6) 
and (7) respectively. 
4 Simulation Results 
Now an example is presented in which the pareters of the model in Equation (9) 
e trYned, using the leaning rule in Equations (6) and (7), on one input signal d 
then are tested on a different input signal. The actu system h three equilibrium 
points, two stable points located at (1, 3) d (3, 5), d a saddle point located at 
(2 -  4  ). In this exple the dynamics of both the actu system d the 
3  
model are ven by 
= OV 
Gradient and Hamiltonian Dynamics Applied to Learning in Neural Networks 2 79 
where V(a:) is defined in Equation (4) and u(t) is a time varying input. For the actual 
system the parameter values were Pl = 7)4 = -4, P2 = P5 = -2, Ps = P6 = -1, 
P? = 1, Ps = 3, P9 = 5, and P10 = 1. In the model the 10 elements Pi are 
treated as the unknown parameters which must be learned. Note that the first matrix 
function is positive definite if the parameters Pl-P6 axe all negative valued. The 
second matrix function is skew-symmetric for all values of P7-7)9. The two input 
signals used for training and testing were al : 10000 (sin  1000t + sin  1000t) and 
u2 = 5000 sin 1000 t. The phase space responses of the actual system to the inputs ul 
and u axe shown by the solid curves in Figures 3(b) and 3(a) respectively. Notice that 
both of these inputs produce a periodic attractor in the phase space of Equation (10). 
In order to evaluate the effectiveness of the learning algorithm the Euclidean distance 
between the actual and learned state and parameter values was computed and plotted 
versus time. The results axe shown in Figure 2. Figure 2(a) shows these statistics when 
IIAll, IIA'll } 
17.5 
2.5 
?.5 
{ IIA11, IIA'11 } 
L 5o ).oo ).5o 2oo 25o 3o0 t it. --- 
so oo so oo so 0t 
(a) (b) 
Figure 2: (a) The state and parameter errors for training using input signal u. The solid 
curve is the Euclidean distance between the state estimates and the actual states 
as a function of time. The dashed curve shows the distance between the estimated 
and actual parameter values versus time. 
(b) The state and par_ameter errors for training using input signal u2. 
training with input Ul, while Figure 2(b) shows the same statistics for input u2. The 
solid curves are the Euclidean distance between the learned and actual system states, 
and the dashed curves axe the distance between the learned and actual parameter 
values. These statistics have two noteworthy features. First, the error between the 
learned and desired states quickly converges to very small values, regardless of how 
well the actual parameters are learned. This result was guaranteed by Theorem 3.1. 
Second, the final error between the learned and desired parameters is much lower when 
the system is trained with input Ul. Intuitively this is because input Ul excites more 
frequency modes of the system than input u2. Recall that in a nonlinear system the 
frequency modes excited by a given input do not depend solely on the input because 
the system can generate frequencies not present in the input. The quality of the 
learned parameters can be qualitatively judged by comparing the phase plots using 
the learned and actual parameters for each input, as shown in Figure 3. In Figure 3(a) 
the system was trained using input Ul and tested with input u2, while in Figure 3(b) 
the situation was reversed. The solid curves are the system response using the actual 
parameter values, and the dashed curves axe the response for the learned parameters. 
The Euclidean distance between the taxget and test trajectories in Figure 3(a) is in 
the range (0, 0.64) with a mean distance of 0.21 and a standard deviation of 0.14. The 
distance between the the target and test trajectories in Figure 3(b) is in the range 
(0, 4.53) with a mean distance of 0.98 and a standard deviation of 1.35. Qualitatively, 
both sets of learned parameters give an accurate response for non-training inputs. 
280 J.W. HOWSE, C. T. ABDALLAH, G. L. HEILEMAN 
0 I 2 3 
(a) 
-5 '!' 
-10 : 
-15 
-2 -1 0 I 2 3 4 
Figure 3: (a) A phase plot of the system response when trained with input u and tested 
with input u2. The solid line is the response to the test input using the actual 
parameters. The dotted line is the system response using the learned parameters. 
(b) A phase plot of the system response when trained with input u2 and tested 
with input u. 
Note that even when the error between the learned and actual parameters is large, 
the periodic attractor resulting from the learned parameters appears to have the same 
"shape" as that for the actual parameters. 
5 Conclusion 
We have presented a conceptual framework for designing dynamical systems with 
specific qualitative properties by decomposing the dynamics into a component normal 
to some surface and a set of components tangent to the same surface. We have 
presented a specific instance of this class of systems which converges to one of a finite 
number of equilibrium points. By parameterizing these systems, the manner in which 
these equilibrium points are approached can be fitted to an arbitrary data set. We 
present a learning algorithm to estimate these parameters which is guaranteed to 
converge to a set of parameter values for which the error between the learned and 
desired trajectories vanishes. 
Acknowledgments 
This research was supported by a grant from Boeing Computer Services under Contract 
W-300445. The authors would like to thank Vangelis Coutsias, Tom Caudell, and Bill 
Horne for stimulating discussions and insightful suggestions. 
References 
[1] M.A. Cohen. The construction of arbitrary stable dynamics in nonlinear neural networks. 
Neural Networks, 5(1):83-103, 1992. 
[2] M.W. Hirsch and S. Smale. Differential equations, dynamical systems, and linear algebra, 
volume 60 of Pure and Applied Mathematics. Academic Press, Inc., San Diego, CA, 1974. 
[3] J.W. Howse, C.T. Abdallah, and G.L. Heileman. A gradient-h_amiltonian decomposition 
for designing and learning dynamical systems. Submitted to Neural Computation, 1995. 
[4] R.V. Mendes and J.T. Duarte. Decomposition of vector fields and mixed dyn_amics. 
Journal of Mathematical Physics, 22(7):1420-1422, 1981. 
[5] K.S. Narendra and A.M. Annaswamy. Stable adaptive systems. Prentice-Hall, Inc., En- 
glewood Cliffs, N J, 1989. 
[6] B.A. Pearlmutter. Learning state space trajectories in recurrent neural networks. Neural 
Computation, 1(2):263-269, 1989. 
[7] D. Saad. Training recurrent neural networks via trajectory modification. Complez Sys- 
tems, 6(2):213-236, 1992. 
[8] M.-A. Sato. A real time learning algorithm for recurrent analog neural networks. Bio- 
logical Cybernetics, 62(2):237-241, 1990. 
