72 
ANALYSIS AND COMPARISON OF DIFFERENT LEARNING 
ALGORITHMS FOR PATTERN ASSOCIATION PROBLEMS 
J. Bernasconi 
Brown Boveri Research Center 
CH-5405 Baden, Switzerland 
ABSTRACT 
We investigate the behavior of different learning algorithms 
for networks of neuron-like units. As test cases we use simple pat- 
tern association problems, such as the XOR-problem and symmetry de- 
tection problems. The algorithms considered are either versions of 
the Boltzmann machine learning rule or based on the backpropagation 
of errors. We also propose and analyze a generalized delta rule for 
linear threshold units. We find that the performance of a given 
learning algorithm depends strongly on the type of units used. In 
particular, we observe that networks with 1 units quite generally 
exhibit a significantly better learning behavior than the correspon- 
ding 0,1 versions. We also demonstrate that an adaption of the 
weight-structure to the symmetries of the problem can lead to a 
drastic increase in learning speed. 
INTRODUCTION 
In the past few years, a number of learning procedures for 
neural network models with hidden units have been proposed 1'2 They 
can all be considered as strategies to minimize a suitably chosen 
error measure. Most of these strategies represent local optimization 
procedures (e.g. gradient descent) and therefore suffer from all the 
problems with local minima or cycles. The corresponding learning 
rates, moreover, are usually very slow. 
The performance of a given learning scheme may depend critical- 
ly on a number of parameters and implementation details. General 
analytical results concerning these dependences, however, are prac- 
tically non-existent. As a first step, we have therefore attempted 
to study empirically the influence of some factors that could have a 
significant effect on the learning behavior of neural network sys- 
tems. 
Our preliminary investigations are restricted to very small 
networks and to a few simple examples. Nevertheless, we have made 
some interesting observations which appear to be rather general and 
which can thus be expected to remain valid also for much larger and 
more complex systems. 
NEURAL NETWORK MODELS FOR PATTERN ASSOCIATION 
An artificial neural network consists of a set of interconnec- 
ted units (formal neurons). The state of the i-th unit is described 
by a variable So which can be discrete (e.g. So = 0,1 or So = 1) or 
1 
continuous (e.g . 0 < So < 1 or -1 < So < +1) , and each connection 
j+i carries a weighW  hich can be oitive, zero, or negative 
x3 
American Institute of Physics 1988 
The dynamics of the network is determined by a local update 
rule, 
Si(t+l ) = f(l Wij So(t)) (1) 
j  ' 
where f is a nonlinear activation function, specifically a threshold 
function in the case of discrete units and a sigmoid-type function, 
e.g. 
f(x) = 1/(l+e -x) (2) 
or 
f(x) = (1-e-X)/(l+e -x) , (3) 
respectively, in the case of continuous units. The individual units 
can be given different thresholds by introducing an extra unit which 
always has a value of 1. 
If the network is supposed to perform a pattern association 
task, it is convenient to divide its units into input units, output 
units, and hidden units. Learning then consists in adjusting the 
weights in such a way that, for a given input pattern, the network 
relaxes (under the prescribed dynamics) to a state in which the 
output units represent the desired output pattern. 
Neural networks learn from examples (input/output pairs) which 
are presented many times, and a typical learning procedure can be 
viewed as a strategy to minimize a suitably defined error function 
F. In most cases, this strategy is a (stochastic) gradient descent 
method: To a clamped input pattern, randomly chosen from the lear- 
ning examples, the network produces an output pattern {Oi}. This is 
compared with the desired output, say {T.}, and the error F({Oi} , 
1 . 
{Ti}) is calculated. Subsequently, each wetght is changed by an 
amount proportional to the respective gradient of F, 
F 
awij = -" ' (4) 
and the procedure is repeated for a new learning example until F is 
minimized to a satisfactory level. 
In our investigations, we shall consider two different types of 
learning schemes. The first is a deterministic version of the Boltz- 
mann machine learning rule 1 and has been proposed by Yann Le Cun . 
It applies to networks with symmetric weights, Wij = Wji , so that an 
energy 
E(S) = - W.. S.S. (5) 
- (i,j) x3 x 3 
can be associated with each state S = {S:}. If X refers to the net- 
work state when only the input units areXclampe and Y to the state 
when both the input and output units are clamped, the error function 
74 
is defined as 
F = E(_Y) - E(X) 
(6) 
and the gradients are simply given by 
F 
aw.. - Y' Y' -x.x. (7) 
 3  3 
The second scheme, called backpropagation or generalized delta 
rule 1'3, probably represents the most widely used learning algorithm. 
In its original form, it applies to networks with feedforward connec- 
tions only, and it uses gradient descent to minimize the mean squared 
error of the output signal, 
oi )2 
F =  Y (T i - (8) 
1 
For a weight W.o from an (input or hidden) unit j to an output 
unit i, we simply hv 
F 
- (T - O.)f'(Y Wik Sk)S (9) 
8Wij i  k J ' 
where f' is the derivative of the nonlinear activation function 
introduced in Eq. (1), and for weights which do not connect to an 
output unit, the gradients can successively be determined by apply- 
ing the chain rule of differentiation. 
In the case of discrete units, f is a threshold function, so 
that the backpropagation algorithm described above cannot be applied. 
We remark, however, that the perceptron learning rule 4, 
AWi3 = s(T. - O )Sj (10) 
 x i ' 
is nothing else than Eq. (9) with f' replaced by a constant s. 
Therefore, we propose that a generalized delta rule for linear 
threshold units can be obtained if f' is replaced by a constant g in 
all the backpropagation expressions for 3F/3W... This generalization 
of the perceptron rule is, of course, not nque. In layered net- 
works, e.g., the value of the constant which replaces f' need not be 
the same for the different layers. 
ANALYSIS OF LEARNING ALGORITHMS 
The proposed learning algorithms suffer from all the problems 
of gradient descent on a complicated landscape. If we use small 
weight changes, learning becomes prohibitively slow, while large 
weight changes inevitably lead to oscillations which prevent the 
algorithm from converging to a good solution. The error surface, 
moreover, may contain many local minima, so that gradient descent is 
not guaranteed to find a global minimum. 
There are several ways to improve a stochastic gradient descent 
procedure. The weight changes may, e.g., be accumulated over a 
number of learning examples before the weights are actually changed. 
Another often used method consists in smoothing the weight changes 
by overrelaxation, 
F +  AWoo(k) , (11) 
AWij (k+l) = - Wo. 3 
z0 
where AW.o(k) refers to the weight change after the presentation of 
the k-thearning example (or group of learning examples, respecti- 
vely). The use of a weight decay term, 
 = 
AWi 3 8W.. 3 ' 
x3 
prevents the algorithm from generating very large weights which may 
create such high barriers that a solution cannot be found in reason- 
able time. 
Such smoothing methods suppress the occurrence of oscillations, 
at least to a certain extent, and thus allow us to use higher lear- 
ning rates. They cannot prevent, however, that the algorithm may 
become trapped in bad local minimum. An obvious way to deal with the 
problem of local minima is to restart the algorithm with different 
initial weights or, equivalently, to randomize the weights with a 
certain probability p during the learning procedure. More sophisti- 
cated approaches involve, e.g., the use of hill-climbing methods. 
The properties of the error-surface over the weight space not 
only depend on the choice of the error function F, but also on the 
network architecture, on the type of units used, and on possible 
restrictions concerning the values which the weights are allowed to 
a s s u/e. 
The performance of a learning algorithm thus depends on many 
factors and parameters. These dependences are conveniently analyzed 
in terms of the behavior of an appropriately defined learning curve. 
For our small examples, where the learning set always consists of 
all input/output cases, we have chosen to represent the performance 
of a learning procedure by the fraction of networks that are 
"perfect" after the presentation of N input patterns. (Perfect net- 
works are networks which for every input pattern produce the correct 
output). Such learning curves give us much more detailed information 
about the behavior of the system than, e.g., averaged quantities 
like the mean learning time. 
RESULTS 
In the following, we shall present and discuss some represen- 
tative results of our empirical study. All learning curves refer to 
a set of 100 networks that have been exposed to the same learning 
procedure, where we have varied the initial weights, or the sequence 
76 
of learning examples, or both. With one exception (Figure 4), the 
sequences of learning examples are always random. 
A prototype pattern association problem is the exclusive-or 
(XOR) problem. Corresponding networks have two input units and one 
output unit. Let us first consider an XOR-network with only one 
hidden unit, but in which the input units also have direct connec- 
tions to the output unit. The weights are symmetric, and we use the 
deterministic version of the Boltzmann learning rule (see Eqs. (5) 
to (7)). Figure 1 shows results for the case of tabula rasa initial 
conditions, i.e. the initial weights are all set equal to zero. If 
the weights are changed after every learning example, about 2/3 of 
the networks learn the problem with less than 25 presentations per 
pattern (which corresponds to a total number of 4 x 25 = 100 presen- 
tations). The remaining networks (about 1/3), however, never learn 
to solve the XOR-problem, no matter how many input/output cases are 
presented. This can be understood by analyzing the corresponding 
evolution-tree in weight-space which contains an attractor consis- 
ting of 14 "non-perfect" weight-configurations. The probability to 
become trapped by this attractor is exactly 1/3. If the weight 
changes are accumulated over 4 learning examples, no such attractor 
too 
v, 80- 
o 
 80- 
J 0- 
0- 
o 
0 
o 
I I I I 
o o o 
o 
o o 
 oooo 
 o o 
 o 
 o o 
e o  
 o 
o 
o 
i I I I 
20 40 60 80 100 
PRESENTATIONS/PATTERN 
Fig. 1. Learning curves for an XOR-network with one hidden unit 
(deterministic Boltzmann learning, discrete 1 units, initial 
weights zero). Full circles: weights changed after every learning 
example; open circles: weight changes accumulated over 4 learning 
examples. 
?? 
seems to exist (see Fig. 1), but for some networks learning at least 
takes an extremely long time. The same saturation effect is observed 
with random initial weights (uniformly distributed between -1 and 
+1), see Fig. 2. 
Figure 2 also exhibits the difference in learning behavior 
between networks with 1 units and such with 0,1 units. In both 
cases, weight randomization leads to a considerably improved lear- 
ning behavior. A weight decay term, by the way, has the same effect. 
The most striking observation, however, is that 1 networks learn 
much faster than 0,1 networks (the respective average learning times 
differ by about a factor of 5). In this connection, we should mention 
that  = 0.1 is about optimal for 0,1 units and that for 1 networks 
the learning behavior is practically independent of the value of . 
It therefore seems that 1 units lead to a much more well-behaved 
error-surface than 0,1 units. One can argue, of course, that a 
discrete 0,1 model can always be translated into a 1 model, but 
this would lead to an energy function which has a considerably more 
complicated weight dependence than Eq. (5). 
too 
8o 
6o 
4o 
2o 
o 
2 
5 to 2o 50 1oo 200 t000 
PRESENTATIONS / PATTERN 
Fig. 2. Learning curves for an XOR-network with one hidden unit 
(deterministic Boltzmann learning, initial weights random, weight 
changes accumulated over 5 learning examples). Circles: discrete 1 
units,  = 1; triangles: discrete 0,1 units,  = 0.1; broken curves: 
without weight randomization; solid curves: with weight randomiza- 
tion (p = 0.025). 
78 
Figures 3 and 4 refer to a feedforward XOR-network with 3 
hidden units, and to backpropagation or generalized delta rule 
learning. In all cases we have included an overrelaxation (or momen- 
tum) term with  = 0.9 (see Eq. (11)). For the networks with contin- 
uous units we have used the activation functions given by Eqs. (2) 
and (3), respectively, and a network was considered "perfect" if for 
all input/output cases the error was smaller than 0.1 in the 0,1 
case, or smaller than 0.2 in the 1 case, respectively. 
In Figure 3, the weights have been changed after every learning 
example, and all curves refer to an optimal choice of the only 
remaining parameter,  or Q, respectively. For discrete as well as 
for continuous units, the 1 networks again perform much better than 
their 0,1 counterparts. In the continuous case, the average learning 
times differ by about a factor of 7, and in the discrete case the 
discrepancy is even more pronounced. In addition, we observe that in 
1 networks learning with the generalized delta rule for discrete 
units is about twice as fast as with the backpropagation algorithm 
for continuous units. 
too 
I I I 
v 80 
o 
 60 
a 40 
' 20' 
0 I 
,5 10 20 ,50 t00 200 .500 1000 
PRESENTATIONS / PATTERN 
Fig. 3. Learning curves for an XOR-network with three hidden units 
(backpropagation/generalized delta rule, initial weights random, 
weights changed after every learning example). Open circles: discre- 
te 1 units, g = 0.05; open triangles: discrete 0,1 units, g = 0.025; 
full circles: continuous 1 units, Q = 0.125; full triangles; contin- 
uous 0,1 units, Q = 0.25. 
79 
In Figure 4, the weight changes are accumulated over all 4 
input/output cases, and only networks with continuous units are 
considered. Also in this case, the 1 units lead to an improved 
learning behavior (the optimal q-values are about 2.5 and 5.0, 
respectively). They not only lead to significantly smaller learning 
times, but 1 networks also appear to be less sensitive with respect 
to a variation of q than the corresponding 0,1 versions. 
The better performance of the 1 models with continuous units 
can partly be attributed to the steeper slope of the chosen activa- 
tion function, Eq. (3). A comparison with activation functions that 
have the same slope, however, shows that the networks with 1 units 
still perform significantly better than those with 0,1 units. If the 
weights are updated after every learning example, e.g., the reduc- 
tion in learning time remains as large as a factor of 5. In the case 
of backpropagation learning, the main reason for the better perfor- 
mance of 1 units thus seems to be related to the fact that the 
algorithm does not modify weights which emerge from a unit with 
value zero. Similar observations have been made by Stornetta and 
Huberman, s who further find that the discrepancies become even more 
pronounced if the network size is increased. 
1oo 
,, 80 
o 
 60 
u 40 
70 
I 
W=2.5 W=5.0 
2.5 
o I 
0 50 100 150 2.00 2.50 
PRESENTATIONS / PATTERN 
Fi. 4. Learning curves for an XOR-network with three hidden units 
(backpropagation, initial weights random, weight changes accumulated 
over all 4 input/output cases). Circles: continuous 1 units; 
triangles: continuous 0,1 units. 
8O 
In Figure 5, finally, we present results for a network that 
learns to detect mirror symmetry in the input pattern. The network 
consists of one output, one hidden, and four input units which are 
also directly connected to the output unit. We use the deterministic 
version of Boltzmann learning and change the weights after every 
presentation of a learning pattern. If the weights are allowed to 
assume arbitrary values, learning is rather slow and on average 
requires almost 700 presentations per pattern. We have observed, 
however, that the algorithm preferably seems to converge to solu- 
tions in which geometrically symmetric weights are opposite in sign 
and almost equal in magnitude (see also Ref. 3). This means that the 
symmetric input patterns are automatically treated as equivalent, as 
their net input to the hidden as well as to the output unit is zero. 
We have therefore investigated what happens if the weights are 
forced to be antisymmetric from the beginning. (The learning proce- 
dure, of course, has to be adjusted such that it preserves this 
antisymmetry). Figure 5 shows that such a problem-adapted weight- 
structure leads to a dramatic decrease in learning time. 
too 
 80- 
- 60- 
u_ 40- 
M 
I I I ' 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
I I I I 
o 
o 
o 
o 
 
 
 o 
o 
 o 
1o 20 
o 
o 
o 
o 
o 
o 
o 
o 
I 
o 
o 
o 
0 I I I I 
2_ 5 50 100 2_00 500 2_000 
PRESENTATIONS / PATTERN 
Fig. 5. Learning curves for a symmetry detection network with 4 
input units and one hidden unit (deterministic Boltzmann learning, 
q = 1, discrete 1 units, initial weights random, weights changed 
after every learning example). Full circles: symmetry-adapted 
weights; open circles: arbitrary weights, weight randomization 
(p = 0.05). 
81 
CONCLUSIONS 
The main results of our empirical study can be summarized as 
follows: 
- Networks with 1 units quite generally exhibit a significantly 
faster learning than the corresponding 0,1 versions. 
- In addition, 1 networks are often less sensitive to parameter va- 
riations than 0,1 networks. 
- An adaptation of the weight-structure to the symmetries of the 
problem can lead to a drastic improvement of the learning behavior. 
Our qualitative interpretations seem to indicate that the ob- 
served effects should not be restricted to the small examples consi- 
dered in this paper. It would be very valuable, however, to have 
corresponding analytical results. 
REFERENCES 
1. "Parallel Distributed Processing: Explorations in the Microstruc- 
ture of Cognition", vol. 1: "Foundations", ed. by D.E. Rumelhart 
and J.L. McClelland (MIT Press, Cambridge), 1986, Chapters 7 & 8. 
2. Y. le Cun, in "Disordered Systems and Biological Organization", 
ed. by E. Bienenstock, F. Fogelman Souli, and G. Weisbuch (Sprin- 
ger, Berlin), 1986, pp. 233-240. 
3. D.E. Rumelhart, G.E. Hinton, and R.J. Williams, Nature 323, 533 
(1986). 
4. M.L. Mnsky and S. Papeft, "Percepttons" (IT Press, Cambrdge), 
1969. 
5. W.S. Stornetta and B.A. Huberman, IEEE Conference on "Neural Net- 
works", San Dego, Calforna, 21-24 3une 1987. 
