Solvable 
Models of Artificial Neural 
Networks 
Sumio Watanabe 
Information and Communication R & D Center 
Ricoh Co., Ltd. 
3-2-3, Shin-Yokohama, Kohoku-ku, Yokohama, 222 Japan 
su nfio @ip e. rd c. ricoh. co.j p 
Abstract 
Solvable models of nonlinear lem'ning machines m'e proposed, and 
learning in artificial neural networks is studied based on the theory 
of ordinary differential equations. A learning algorithm is con- 
structed, by which the optimal parameter can be found without 
any recursive procedure. The solvable models enable us to analyze 
the reason why experimental results by the error backpropagation 
often contradict the statistical learning theory. 
1 INTRODUCTION 
Recent studies have shown that learning in artificial neural networks can be under- 
stood as statistical parametric estimation using the maxbnum likelihood method 
[1], and that their generalization abilities can be estimated using the statistical 
asymptotic theory [2]. However, as is often reported, even when the number of 
parameters is too large, the error for the testing sample is not so large as the theory 
predicts. The reason for such inconsistency has not yet been clarified, because it is 
difficult for the artificial neural network to find the global optimal parameter. 
On the other hand, in order to analyze the nonlinear phenomena, exactly solvable 
models have been playing a central role in mathematical physics, for example, the 
K-dV equation, the Toda lattice, and some statistical models that satisfy the Yang- 
423 
424 Watanabe 
Baxter equation[3]. 
This paper proposes the first solvable models in the nonlinear learning problem. We 
consider simple three-layered neural networks, and show that the parameters from 
the inputs to the hidden units determine the function space that is characterized 
by a differential equation. This fact means that optimization of the parameters 
is equivalent to optimization of the differential equation. Based on this property, 
we construct a learning algorithm by which the optimal parameters can be found 
without any recursive procedure. Experimental result using the proposed algorithm 
shows that the maximum likelihood estimator is not always obtained by the error 
backpropagation, and that the conventional statistical learning theory leaves much 
to be improved. 
2 The Basic Structure of Solvable Models 
Let us consider a function fc,w(X) given by a simple neural network with 1 input 
unit, H hidden units, and 1 output unit, 
H 
fc,w(x) -  ciTw(x), (1) 
i=1 
where both c -- {ci} and tv -- {wi} are parameters to be optimized, Tw,(x) is the 
output of the i-th hidden unit. 
;e assume that {i(x) -- w,(X) is a set of independent functions in CH-class. 
The following theorem is the start point of this paper. 
Theorem 1 
Iution is {T(x) and whose H-th order coefficient is I is uniquely given by 
(Owg)(x) = (-1)" W,+ (g, 1, 2, ..., H 
where I/VH is the H-th order Wronskian, 
The H-th order differential equation whose fundamental system of so- 
(2) 
WH(gz, ;z, ..., T/) = det 
! ''. 
(H-I) (H-l) -1) 
1 2 ' ' ' (H H 
For proof, see [4]. From this theoreln, we have the following corollary. 
Corollary I Let g(x) be a CH-class function. Then the following conditions for 
g(x) and w = {wi} are equivalent. 
here eists a set c: (c,} such that 
(2) = O. 
Solvable Models of Artificial Neural Networks 425 
Example 1 Let us consider a case, Ow,(X) = exp0v,x). 
H 
g(x) =  c, exp(w,x) 
i=1 
is equivalent to {D n + pD n- + p2D n-2 +...+pn}g(x) = 0, where D = (d/dx) 
and a set {pi} is determined from {wi} by the relation, 
H 
z" +z "-1 +pz "- +...+p, = (z-w,) (Vz  c). 
i:1 
Example 2 (RBF) A hnction g(x) is given by radial bis functions, 
H 
g(x) =  ci exp{-(x- wi)2}, 
i=1 
if and only if e-'{D n +pD n- +p2D n-2 +...+pn}(e'g(x)) = 0, where aset 
{pi} is deternfined from {wi} by the relation, 
H 
z" +pz "- +p"-+...+p,, = (z- 2,) (Vz c). 
i=1 
Figure 1 shows a learning algorithm for the solvable models. When a target function 
g(x) is given, let us consider the following function approximation problem. 
H 
(') =  m,() + (). 
i=1 
Learning in the neural network is optinfizing both {ci} and {wi} such that (x) is 
minimized for some error function. From the definition of D, eq. (3) is equivMent 
to (Dmg)(z) = (D)(x), where the term (Dwg)(z) is independent of ci. Therefore, 
if we adopt IID11 s the error function to be minimized, {wi} is optimized by 
minimizing IIDgll, independenfiy of {ci}, where I111  = f I()ld. After 
is mininfized, we have (Dm.g)(z)  0, where w* is the optimized parameter. From 
the corollary 1, there exists a set {c} such that g(x)   cm(z), where {c} 
can be found using the ordinary least square method. 
3 Solvable Models 
For a general function Tw, the differential operator Dw does not always have such 
a simple form as the above examples. In this section, we consider a linear operator 
L such that the differential equation of Lw has a simple form. 
Definition A neural network  cioi (x) is called solvable if there exist functions 
a, b, and a linear operator L such that 
(Lpw,)(x) = exp(a(tvi)x + bOvi)). 
The following theore,n shows that the optimal parmneter of the solvable models can 
be found using the same algo,'ithm as Figure 1. 
426 Watanabe 
) is givenj 
g(x) = Z c q)w (x) +(x) 
i=l i 
It is difficult to optimize wi 
independently of ci 
There exits C i s.t. 
H 
equiv. I 
g(x) ilCi%?(X) r equiv. 
I 
Least Square Method .-- C i  optimized 
" I 
g(x)   d i %, (x) ( END ) 
i=l i [ 
D w g(x)= D w (x) 
I 
: minimized  W: optimized 
g( ) -' 
D, x =0 
w 
Figure 1: Structure of Solvable Models 
Theorem 2 For a solvable model of a neural network, the .following conditions are 
equivalent when wi  wj (i  j). 
(1) There exist both {ci} and {wi} such that g(x) - iH= CiTw,(X). 
(2) There exists {Pl} such that {D t + pD t- + p2D t-2 +... + pt}(Lg)(x)= O. 
(3) For arbitrary Q > O, we define a sequence {yn} by y, = (Lg)(nQ). Then, there 
exists {qi} such that Yn + qlYn- + q2Yn-2 + '" + qHYn-H = O. 
Note that {IDwLgll 2 is a quadratic form for {pi}, xvhich is easily nfininfized by the 
least square method. 'n lY' + qlY,- +'" + qHY,-Y{ 2 is also a quadratic form for 
Theorem 3 
relations. 
z H q-pz H- 
The sequences {wi), {Pi), and {qi} in the theorem 2 have the following 
q- p2z H-2 q-  .. q- P, 
z t + qz It- + q2z ti-2 + "' + qIt = 
H 
I-I(z -- a(wi)) (Vz  C), 
i=1 
H 
1-I(z- exp(a(wi)Q)) (Vz  C). 
i=1 
For proofs of the above theorelns, see [5]. These theorems sho,v that, if {Pi} or 
Solvable Models of Artificial Neural Networks 427 
{qi) is optimized for a given function g(x), then {a(wi)} can be found as a set of 
solutions of the algebraic equation. 
Suppose that a target function g(x) is given. Then, from the above theorems, 
the globally optimal parameter w* = {w r } can be found by mininfizing 
independently of {ci}. Moreover, if the hnction a(w) is a one-to-one mapping, then 
there exists w* uniquely without permutation of {w }, if and only if the quadratic 
form II{D" + p DH- + "' + p-}gll 2 is not degenerate[4]. (Remk that, if it is 
degenerate, we can use another neural network with the smMler number of hidden 
units.) 
Example 3 A neurM network without scMing 
H 
= + (4) 
i=I 
is solvable when (a)(x)  0 (a.e.), where  denotes the Fourier transform. Define 
a linear operator L by (Lg)(x)= (g)(x)/(a)(x), then, it follows that 
H 
(Lf,,c)(X) =  ci exp(- bi x). (5) 
i=1 
By the Theorem 2, the optimM {hi} can be obtMned by using the differentiM or the 
sequential equation. 
Example 4 (MLP) 
is solvable. 
that 
A three-layered perceptron 
H 
fb,c(X) --  Ci tan -1 (x q- bi ), (6) 
ai 
i--1 
Define a linear operator L by (Lg)(x) = x. (g)(x), then, it folloxvs 
H 
(Lfb,c)(X) =  ciexp(-(ai + V/ bi)x + (ai, bi)) (x >_ 0). (7) 
i=1 
where o(ai, hi) is some function of ai and hi. Since the function tan -l(x) is mono- 
tone increasing and bounded, we can expect that a neural network given by eq. 
(6) has the same ability in the function approximation problem as the ordinary 
three-layered perceptton using the sigmoid function, tanh(x). 
Example 5 (Finite Wavelet Decomposition) A finite wavelet decomposition 
H 
f,c(X) = x + ), (8) 
ai 
i----1 
is solvable when a(x) - (d/dx)"(1/(1 +/2)) (n _> 1). Define a linear operator L by 
(g)(x) = x -n. (.T'g)(x) then, it follows that 
H 
(Lfb,c)(X) = y. ciexp(--(ai + vf bi)x + t(al,bi)) (x _> 0). (9) 
i--1 
428 Watanabe 
where (ai, bi) is some function of ai and b. Note that a(x) is an analyzing wavelet, 
and that this example shows a method how to optimize parameters for the finite 
wavelet decomposition. 
4 Learning Algorithm 
We construct a learning algorithm for solvable models, as shown in Figure 1. 
< <Learning Algorithm> > 
(0) A target function g(x) is given. 
(1) {Ym} is calculated by Ym - (Lg)(mQ). 
(2) {qi} is optimized by mininfizing Y']m lyre + qlYm- q- q2Ym-2 q-'" q- qHYm-H[ 2. 
(3) {Zi} is calculated by solving z H + qlz -1 + q2 zH-2 q-... q- qH -- O. 
(4) {ivi} is determined by a(wi)= (1/Q)log zi. 
(5) {ci} is optimized by minimizing Yj(g(xj)- Yicio,(xj)) 2. 
Strictly speaking, g(x) should be given for arbitrary x. However, in the practical 
application, if the number of training samples is sufficiently large so that (Lg)(x) 
can be ahnost precisely approximated, this algorithm is available. In the third 
procedure, to solve the algebraic equation, the DKA method is applied, for example. 
5 Experimental Results and Discussion 
5.1 The backpropagation and the proposed method 
For experiments, we used a probability density function and a regression function 
given by 
i exp(- (y - h(x))2 
Q(y]x)- 2 2 2 2 ) 
I tan_(x - 1/3 i tan_(x - 2/3 
= ' + 0.02 ) 
where rr = 0.2. One hundred input samples were set at the same interval in [0,1), 
and output samples were taken from the above conditional distribution. 
Table 1 shows the relation between the number of hidden units, training errors, 
and regression en'ors. In the table, the training error in the backpropagation shows 
the square error obtained after 100,000 training cycles. The training error in the 
proposed method shows the square error by the above algorithm. And the regres- 
sion error shows the square error between the true regression curve h(x) and the 
estimated curve. 
Figure 2 shows the true and estimated regression lines: (0) the true regression 
line and sample points, (1) the estimated regression line with 2 hidden units, by 
the BP (the error backpropagation) after 100,000 training cycles, (2) the estimated 
regression line with 12 hidden units, by the BP after 100,000 training cycles, (3) the 
Solvable Models of Artificial Neural Networks 429 
Table 1: Training errors and regression errors 
Hidden Backpropagation Proposed Method 
Units Training Regression Training Regression 
2 4.1652 0.7698 4.0889 0.3301 
4 3.3464 0.4152 3.8755 0.2653 
6 3.3343 0.4227 3.5368 0.3730 
8 3.3267 0.4189 3.2237 0.4297 
10 3.3284 0.4260 3.2547 0.4413 
12 3.3170 0.4312 3.1988 0.5810 
estimated line with 2 hidden units by the proposed method, and (4) the estimated 
line with 12 hidden units by the proposed method. 
5.2 Discussion 
When the number of hidden units was small, the training errors by the BP were 
smaller, but the regression errors were larger. 5Yhen the number of hidden units 
was taken to be larger, the training error by the BP didn't decrease so much as the 
proposed method, and the regression error didn't increase so much as the proposed 
method. 
By the error backpropagation, parameters didn't reach the maximum likelihood 
estimator, or they fell into local minima. However, when the number of hidden 
units was large, the neural network without the maximum likelihood estimator 
attained the better generalization. It seems that parameters in the local minima 
were closer to the true parameter than the nmxinum likelihood estimator. 
Theoretically, in the case of the layered neural networks, the maximum likelihood 
estimator may not be subject to asymptotically normal distribution because the 
Fisher information matrix may be degenerate. This can be one reason why the 
experimental results contradict the ordinary statistical theory. Adding such a prob- 
lem, the above experimental results show that the local minimum causes a strange 
problem. In order to construct the more precise learning theory for the backprop- 
agation neural network, and to choose the better parameter for generalization, we 
maybe need a method to analyze learning and inference with a local minimum. 
6 Conclusion 
We have proposed solvable models of artificial neural networks, and studied their 
learning structure. It has been shown by the experimental results that the proposed 
method is useful in analysis of the neural network generalization problem. 
430 Watanabe 
(0) True Curve and Samples. 
Sample error sum = 3.6874 
(1) BP after 100,000 cycles 
H = 2, Etrain -- 4.1652, E,.e9 = 0.7698 
(3) Proposed Method 
H = 2, Etrain = 4.0889, Ec9 = 0.3301 
H: the number of hidden units 
Etrain: the training error 
E,.eg: the regression error 
(2) BP after 100,000 cycles 
H = 12, Et,.ain -- 3.3170, Ereg -- 0.4312 
(4) Proposed Method 
H = 12, Et,.,i, = 3.1988, Eea = 0.5810 
Figure 2: Experimental Results 
Peferences 
[1] H. White. (1989) Learning in artificial neural networks: a statistical perspective. 
Neural Computation, 1,425-464. 
[2] N.Murata, S.Yoshizawa, and S.-I.Amari.(1992) Learning Curves, Model Selection 
and Complexity of Neural Networks. Advances in Neural Information Processing 
Systems 5, San Mateo, Morgan I<aufinan, pp.607-614. 
[3] R. J. Baxter. (1982) Exactly Solved Models in Statistical Mechanics, Academic 
Press. 
[4] E. A. Coddington. (1955) Theory of ordinary differential equations, the McGraw- 
Hill Book Company, New York. 
[5] S. Watanabe. (1993) Function approximation by neural networks and solution 
spaces of differential equations. Submitted to Neural Networks. 
