850 
Strategies for Teaching Layered Networks 
Classification Tasks 
Ben S. Wittner I and John S. Denker 
AT&T Bell Laboratories 
Holmdel, New Jersey 07733 
Abstract 
There is a widespread misconception that the delta-rule is in some sense guaranteed to 
work on networks without hidden units. As previous authors have mentioned, there is 
no such guarantee for classification tasks. We will begin by presenting explicit counter- 
examples illustrating two different interesting ways in which the delta rule can fail. We 
go on to provide conditions which do guarantee that gradient descent will successfully 
train networks without hidden units to perform two-category classification tasks. We 
discuss the generalization of our ideas to networks with hidden units and to multi- 
category classification tasks. 
The Classification Task 
Consider networks of the form indicated in figure 1. We discuss vaxious methods for 
training such a network, that is for adjusting its weight vector, w. If we call the input 
v, the output is g(w. v), where g is some function. 
The classification task we wish to train the network to perform is the following. Given 
two finite sets of vectors, F1 and F2, output a number greater than zero when a vector in 
F1 is input, and output a number less than zero when a vector in F2 is input. Without 
significant loss of generality, we assume that g is odd (i.e. g(-s) - -g(s)). In that case, 
the task can be reformulated as follows. Define 2 
F := F U {-v such that v 6 F2} 
(1) 
and output a number greater than zero when a vector in F is input. The former 
formulation is more natural in some sense, but the later formulation is somewhat more 
convenient for analysis and is the one we use. We call vectors in F, training vectors. 
A Class of Gradient Descent Algorithms 
We denote the solution set by 
W := {w such that g(w. v) > 0 for all v  F}, 
Currently at NYNEX Science and Technology, 500 Westchester Ave., White Plains, NY 10604 
2We use both A := B and B =: A to denote "A is by definition B". 
(2) 
American Institute of Physics 1988 
851 
Wl W2 
g(w. v)) 
Figure 1: a simple network 
w 
output 
inputs 
and we are interested in rules for finding some weight vector in W. We restrict our 
attention to rules based upon gradient descent down error functions E(w) of the form 
E(w) = y] h(w. v). (3) 
vEF 
The delta-rule is of this form with 
h(w. v) = hs(w' v):= -(b - g(w. v)) 2 (4) 
for some positive number b called the target (Rumelhart, McClelland, et al.). We call 
the delta rule error function Es. 
Failure of Delta-rule Using Obtainable Targets 
Let g be any function that is odd and differentiable with g'(s) > 0 for all s. In this 
section we assume that the target b is in the range of g. We construct a set F of 
training vectors such that even though W is not empty, there is a local minimum of Es 
not located in W. In order to facilitate visualization, we begin by assuming that g is 
linear. We will then indicate why the construction works for the nonlinear case as well. 
We guess that this is the type of counter-example alluded to by Duda and Hart (p. 151) 
and by Minsky and Papert (p. 15). 
The input vectors are two dimensional. The arrows in figure 2 represent the training 
vectors in F and the shaded region is W. There is one training vector, v , in the second 
quadrant, and all the rest are in the first quadrant. The training vectors in the first 
quadrant are arranged in pairs symmetric about the ray R and ending on the line L. 
The line L is perpendicular to R, and intersects R at unit distance from the origin. 
Figure 2 only shows three of those symmetric pairs, but to make this construction work 
we might need many. The point p lies on R at a distance of g-(b) from the origin. 
We first consider the contribution to Es due to any single training vector, v. The 
contribution is 
(1/2)(b - g(w. v)) , (5) 
and is represented in figure 3 in the z-direction. Since g is linear and since b is in the 
82 
L 
-e no. '='{dYCUla 'bUtion .  {or oh-  
oa .-  is - tic ,_ 'gets 
Poi  a is t ' the co ,,,e trou_]o ectors , u a lin  
atp  ,,esu . batrib .... 
 oo it is - ol t% .. "?oa to t e first o, 'e 'pla- 
Ne, qadratic ratic tro- .error ,. at, then 
_  We _ bowl .,:.. Ughs .,:.. 'Ction 
.uran uaSider  a bott_ta bott 
 qadr..Y, r ,UScon,_..,rarilv_. tp. Sn? vectors :_ 
a; 'tic bo...; o,e coa,_.- "'noOtio C*'teep b.- - 
"um _ ..Y ro  .. o  _ .  due _ . .rilv 
Si- be SUci_, s p, b.- qadra,,  
gradien. o s a o.. ,,e p So - e bowl  
' have coafir,,s ev. Q'E.D. etly st t 
 ied the  o Con,. ' . 
Cceptees 
853 
y-axis 
Figure 3: Error surface 
We now remove the assumption that g is linear. The key observation is that 
dhs/ds = hs'(s) = (b- #(s))(-#'(s)) 
(6) 
still only has a single zero at g-(b) and so h(s) still has a single minimum at g-(b). 
The contribution to Es due to the training vectors in the first quadrant therefore still 
has a global minimum on the xy-plane at the point p. So, as in the linear case, if there 
are enough symmetric pairs of training vectors in the first quadrant, the value of Eo 
at p can be made arbitrarily lower than the value along some circle in the xy-plane 
centered around p, and Es = Eo + E will have a local minimum arbitrarily near p. 
Q.E.D. 
Failure of Delta-rule Using Unobtainable Targets 
We now consider the case where the target b is greater than any number in the range 
of g. The kind of counter-example presented in the previous section no longer exists, 
but we will show that for some choices of g, including the traditional choices, the delta 
rule can still fail. Specifically, we construct a set F of training vectors such that even 
though W is not empty, for some choices of initial weights, the path traced out by going 
down the gradient of Es never enters W. 
854 
V 1 / 
R ,e 
/ L 
2 
x-axis 
Figure 4: Counter-example for unobtainable targets 
We suppose that g has the following property. There exists a number r > 0 such that 
)km - 0. (7) 
An example of such a g is 
2 
g(s)- tanh(s) - i + e -2s 1, (8) 
for which any r greater than 1 will do. 
The solid arrows in figure 4 represent the training vectors in F and the more darkly 
shaded region is W. The set F has two elements, 
vl: [-2] and v2: ()() [ ] (9) 
The dotted ray, R lies on the diagonal {y: 
Since 
Es(w) -- hs(w' v x) + hs(w. v2), (10) 
855 
the gradient descent algorithm follows the vector field 
-VE(w) = -h'(w. vl)v 1 - h'(w. v2)v 2. (11) 
The reader can easily verify that for all w on R, 
W' V 1 = --rw' v 2. (12) 
So by equation (7), if we constrain w to move along R, 
lim -h5(w' vl) 
=0. (13) 
w-.o -/zd(w. 
Combining equations (11) and (13) we see that there is a point q somewhere on R such 
that beyond q, -VE(w) points into the region to the right of R, as indicated by the 
dotted arrows in figure 4. 
Let L be the horizontal ray extending to the right from q. Since for all s, 
g'(s) > 0 and b > g(s), (14) 
we get that 
- h6'(s) = (b- g(s))g'(s) > 0. (15) 
So since both v 1 and v 2 have a positive y-component, -VE(w) also has a positive 
y-component for all w. So once the algorithm following -VE enters the region above 
L and to the right of R (indicated by light shading in figure 4), it never leaves. Q.E.D. 
Properties to Guarantee Gradient Descent Learning 
In this section we present three properties of an error function which guarantee that 
gradient descent will not fail to enter a non-empty W. 
We call an error function of the form presented in equation (3) well formed if h is 
differentiable and has the following three properties. 
1. For all s, -h'(s) > 0 (i.e. h does not push in the wrong direction). 
2. There exists some e > 0 such that -h'(s) >  for all s _< 0 (i.e. h keeps pushing 
if there is a misclassification). 
3. h is bounded below. 
Proposition 1 If the error function is well formed, then gradient descent is guaranteed 
to enter W, provided W is not empty. 
856 
The proof proceeds by contradiction. Suppose for some starting weight vector the path 
traced out by gradient descent never enters W. Since W is not empty, there is some 
non-zero w* in W. Since F is finite, 
A := min{w*. v such that v 5 F} >' 0. 
(16) 
Let w(t) be the path traced out by the gradient descent algorithm. So 
w'(t) = -WE(w(t)): y] -a'(w(t). v)v for all t. (17) 
Since we are assuming that at least one training vector is misclassified at all times, by 
properties 1 and 2 and equation (17), 
So 
w*. w'(t) _> , for all t. 
[w'(t)[ > /]w*[ =:  > 0 for all t. 
(18) 
(19) 
By equations (17) and (19), 
dE(w(t))/dt = VE. w'(t)= -w'(t). w'(t) < _2 < 0 for all t. (20) 
This means that 
E(w(t))  -oe as t  oo. (21) 
But property 3 and the fact that F is finite guarantee that E is bounded below. This 
contradicts equation (21) and finishes the proof. 
Consensus and Compromise 
So far we have been concerned with the case in which F is separable (i.e. W is not 
empty). What kind of behavior do we desire in the non-separable case? One might 
hope that the algorithm will choose weights which produce correct results for as many 
of the training vectors as possible. We suggest that this is what gradient descent using 
a well formed error function does. 
From investigations of many well formed error functions, we suspect the following well 
formed error function is representative. Let !/(s) = s, and for some b > 0, let 
(b-s) 2 ifs<b; 
h(s) = 0 otherwise. (22) 
In all four frames of figure 5 there are three training vectors. Training vectors 1 and 2 
are held fixed while 3 is rotated to become increasingly inconsistent with the others. In 
frames (i) and (ii) F is separable. The training set in frame (iii) lies just on the border 
between separability and non-separability, and the one in frame (iv) is in the interior of 
857 
Figure 5: The transition between seperability and non-seperability 
the non-separable regime. Regardless of the position of vector 3, the global minimum 
of the error function is the only minimum. 
In frames (i) and (ii), the error function is zero on the shaded region and the shaded 
region is contained in W. As we move training vector number 3 towards its position in 
frame (iii), the situation remains the same except the shaded region moves arbitrarily 
far from the origin. At frame (iii) there is a discontinuity; the region on which the 
error function is at its global minimum is now the one-dimensional ray indicated by 
the shading. Once training vector 3 has moved into the interior of the non-separable 
regime, the region on which the error function has its global minimum is a point closer 
to training vectors 1 and 2 than to 3 (as indicated by the "x" in frame (iv)). 
If all the training vectors can be satisfied, the algorithm does so; otherwise, it tries to 
satisfy as many as possible, and there is a discontinuity between the two regimes. We 
summarize this by saying that it finds a consensus if possible, otherwise it devises a 
compromise. 
Hidden Layers 
For networks with hidden units, it is probably impossible to prove anything like propo- 
sition 1. The reason is that even though property 2 assures that the top layer of weights 
858 
gets a non-vanishing error signal for misclassified inputs, the lower layers might still get 
a vanishingly weak signal if the units above them are operating in the saturated regime. 
We believe it is nevertheless a good idea to use a well formed error function when 
training such networks. Based upon a probabilistic interpretation of the output of the 
network, Baum and Wilczek have suggested using an entropy error function (we thank 
J.J. Hopfield and D.W. Tank for bringing this to our attention). Their error function 
is well formed. Levin, Solla, and Fleisher report simulations in which switching to the 
entropy error function from the delta-rule introduced an order of magnitude speed-up 
of learning for a network with hidden units. 
Multiple Categories 
Often one wants to classify a given input vector into one of many categories. One popular 
way of implementing multiple categories in a feed-forward network is the following. Let 
the network have one output unit for each category. Denote by oy(w) the output of 
the j-th output unit when input v is presented to the network having weights w. The 
network is considered to have classified v as being in the k-th category if 
o(w) > o}'(w) for all j  k. (23) 
The way such a network is usually trained is the generalized delta-rule (Rumelhart, 
McClelland, et al.). Specifically, denote by c(v) the desired classification of v and let 
v / b if j = c(v); (24) 
b :- -b otherwise, 
for some target b > 0. One then uses the error function 
:= - 
This formulation has several bothersome aspects. For one, the error function is not will 
formed. Secondly, the error function is trying to adjust the outputs, but what we really 
care about is the differences between the outputs. A symptom of this is the fact that 
the change made to the weights of the connections to any output unit does not depend 
on any of the weights of the connections to any of the other output units. 
To remedy this and also the other defects of the delta rule we have been discussing, we 
suggest the following. For each v and j, define the relative coordinate 
:= (26) 
859 
What we really want is all the fi to be positive, so use the error function 
(27) 
v 
for some well formed h. In the simulations we have run, this does not always help, but 
sometimes it helps quite a bit. 
We have one further suggestion. Property 2 of a well formed error function (and the 
fact that derivatives are continuous) means that the algorithm will not be completely 
satisfied with positive/; it will try to make them greater than zero by some non-zero 
margin. That is a good thing, because the training vectors are only representatives of 
the vectors one wants the network to correctly classify. Margins are critically important 
for obtaining robust performance on input vectors not in the training set. The problem 
is that the margin is expressed in meaningless units; it makes no sense to use the same 
numerical margin for an output unit which varies a lot as is used for an output unit 
which varies only a little. We suggest, therefore, that for each j and v, keep a running 
estimate of cr(w), the variance of fi(w), and replace/(w) in equation (27) by 
(28) 
Of course, when beginning the gradient descent, it is difficult to have a meaningful 
estimate of cry(w) because w is changing so much, but as the algorithm begins to 
converge, your estimate can become increasingly meaningful. 
References 
1. David Rumelhart, James McClelland, and the PDP Research Group, Parallel Dis- 
tributed Processing, MIT Press, 1986 
2. Richard Duda and Peter Hart, Pattern Classification and Scene Analysis, John 
Wiley & Sons, 1973. 
3. Marvin Minsky and Seymour Papeft, "On Percepttons", Draft, 1987. 
4. Eric Baum and Frank Wilczek, these proceedings. 
5. Esther Levin, Sara A. Solla, and Michael Fleisher, private communications. 
