760 
A NOVEL NET THAT LEARNS 
SEQUENTIAL DECISION PROCESS 
G.Z. SUN, Y.C. LEE and H.H. CHEN 
Department of Physics and Astronomy 
and 
Institute for Advanced Computer Studies 
UNIVERSITY OF MARYLAND,COLLEGE PARK,MD 20742 
ABSTRACT 
We propose a new scheme to construct neural networks to classify pat- 
terns. The new scheme has several novel features: 
We focus attention on the important attributes of patterns in ranking 
order. Extract the most important ones first and the less important 
ones later. 
2. In training we use the information as a measure instead of the error 
function. 
3. A multi-perceptron-like architecture is formed auomatically. Decision 
is made according to the tree structure of learned attributes. 
This new scheme is expected to self-organize and perform well in large scale 
problems. 
 American Institute of Physics 1988 
761 
I INTRODUCTION 
It is well known that two-layered perceptron with binary connections but no 
hidden units is unsuitable as a classifier due to its limited power [1]. It cannot 
solve even the simple exclusive-or problem. Two extensions have been pro- 
osed to remedy this problem. The first is to use higher order connections 
]. It has been demonstrated that high order connections could in many 
cases solve the problem with speed and high accuracy [3], [4]. The repre- 
sentations in general are more local than distributive. The main drawback 
is however the combinatorial explosion of the number of high-order terms. 
Some kind of heuristic judgement has to be made in the choice of these terms 
to be represented in the network. 
A second proposal is the multi-layered binary network with hidden units 
5]. These hidden units function as features extracted from the bottom input 
ayer to facilitate the classification of patterns by the output units. In order 
to train the weights, learning algorithms have been proposed that back- 
propagate the errors from the visible output layer to the hidden layers for 
eventual adaptation to the desired values. The multi-layered networks enjoy 
great popularity in their flexibility. 
However, there are also problems in implementing the multi-layered nets. 
Firs fly, there is the problem of allocating the resources. Namely, how many 
hidden units would be optimal for a particular problem. If we allocate too 
many, it is not only wasteful but also could negatively affect the performance 
of the network. Since too many hidden units implies too many free param- 
eters to fit specifically the training patterns. Their ability to generalize to 
noval test patterns would be adversely affected. On the other hand, if too 
few hidden units were allocated then the network would not have the power 
even to represent the trainig set. How could one judge beforehand how many 
are needed in solving a problem? This is similar to the problem encountered 
in the high order net in its choice of high order terms to be represented. 
Secondly, there is also the problem of scaling up the network. Since the 
network represents a parallel or coorperative process of the whole system, 
each added unit would interact with every other units. This would become 
a serious problem when the size of our patterns becomes large. 
Thirdly, there is no sequential communication among the patterns in the 
conventional network. To accomplish a cognitive function we would need 
the patterns to interact and communicate with each other as the human 
reasoning does. It is difficult to envision such an interacton in current systems 
which are basically input-output mappings. 
2 THE NEW SCHEME 
In this paper, we would like to propose a scheme that constructs a network 
taking advantages of both the parallel and the sequential processes. 
We note that in order to classify patterns, one has to extract the intrinsic 
features, which we call attributes. For a complex pattern set, there may 
be a large number of attributes. But differnt attributes may have different 
762 
ranking of importance. Instead of extracing them all simultaneously it may 
be wiser to extract them sequentially in order of its importance [6], [7]. Here 
the importance of an attribute is determined by its ability to partition the 
pattern set into sub-categories. A measure of this ability of a processing unit 
should be based on the extracted information. For simplicity, let us assume 
that there are only two categories so that the units have only binary output 
values 1 and 0 ( but the input patterns may have analog representations). We 
call these units, including their connection weights to the input layer, nodes. 
For given connection weights, the patterns that are classified by a node as 
in category i may have their true classifications either i or O. Similarly, the 
patterns that are classified by a node as in category 0 may also have their 
true classifications either I or O. As a result, four groups of patterns are 
formed: (1,1), (0,0), (1,0), (0,1). We then need to judge on the efficiency of 
the node by its ability to split these patterns optimally. To do this we shall 
construct the impurity fuctions for the node. Before splitting, the impurity 
of the input patterns reaching the node is given by 
Ib -' _pb logpb _ po b logpo b (1) 
where P} = N/N is the probability of being truely classified as in category 
1, and P0  - No/N is the probability of being truely classified as in category 
O. After splitting, the patterns are channelled into two branches, the impurity 
becomes 
I. =-P  P(j, 1)logP(j, 1)- P  P(j,O)logP(j,O) (2) 
j=O,1 j=O,l 
where P -- Ni/N is the probability of being classified by the node as in 
category 1, P = NobiN is the probability of being classified by the node as 
in category O, and P(j, i) is the probability of a pattern, which should be in 
category j, but is classified by the node as in category i. The difference 
AI -- h -/' (3) 
represents the decrease of the impurity at the node after splitting. It is the 
quantity that we seek to optimize at each node. The logarithm in the im- 
purity function come from the information entropy of Shannon and Weaver. 
For all practical purpose, we found the. optimization of (3) the same as max- 
imizing the entropy [6] 
S N No 2 (_)2]+No Noo 2 .No)2 
- (00) 
(4) 
where Ni is the number of training patterns classified by the node as in 
category i, Nij is the number of training patterns with true classification in 
category i but classified by the node as in category j. Later we shall call the 
terms in the first bracket Sa and the second S2. Obviously, we have 
Ni = Noi + N i , i -- O, 1 
763 
After we trained the first unit, the training patterns were split into two 
branches by the unit. If the classificaton in either one of these two branches 
is pure enough, or equivalently either one of S and $2 is fairly close to 1, 
then we would terminate that branch ( or branches ) as a leaf of the decision 
tree, and classify the patterns as such. On the other hand, if either branch is 
not pure enough, we add additional node to split the pattern set further. The 
subsequent unit is trained with only those patterns channeled through this 
branch. These operations are repeated until all the branches are terminated 
as leaves. 
3 LEARNING ALGORITHM 
We used the stochastic gradient descent method to learn the weights of each 
node. The training set for each node axe those patterns being channeled to 
this node. As stated in the previous section, we seek to maximize the entropy 
function S. The learning of the weights is therefore conducted through 
Where r/is the learning rate. 
following equation 
os 
Using analog units 
we have 
OS 
,'xws = owj (5) 
The gradient of S can be calculated from the 
1 No:  . ann N; ONo 
N [(1 + (1 
_ 
No: o. ONto No ONoo 
(1 - 2-o)  + (1 - 2o ) ] 
Or _. 
1 
I + exp(-- Z.i WjI) 
O0 r 
_ or(1 - O")q 
(6) 
(7) 
(8) 
Furthermore, let A r = 1 or 0 being the true answer for the input pattern r, 
then 
N 
Nis = r__ [iAr + (1-i)(1- Ar) ] [jO r + (1-j)(1- Or)] (9) 
Substituting these into equation (5), we get 
[ N,, N,o N o N'] O"(1 - Or)I; (10) 
AWj=2r i 2A"(  No)+ No 2 ] 
In applying the formula (10),instead of calculating the whole summation at 
once, we update the weights for each pattern individually. Meanwhile we 
update Nij in accord with equation (9). 
764 
Figure 1: The given classification tree, where 0,02 and Oa are chosen to be 
all zeros in the numerical example. 
4 AN EXAMPLE 
To illustrate our method, we construct an example which is itself a decision 
tree. Assuming there are three hidden variables al, a2, aa, a pattern is given 
by a ten-dimensional vector Ia, I2, ..., Ira, constructed from the three hidden 
variables as follows 
I1 -- al-{-a3 I6 -- 2a3 
12 = 2a -- a2 b -- a3 - al 
I3 -- a3 -- 2a2 Is = 2al + 3a3 
I4 = a + 2a2 + 3a3 I9 = 4a3 -- 3a 
I$ = 5a--4a4 Lo = 2a-{-2a2-{-2a3. 
A given pattern is classified as either 1 (yes) or 0 (no) according to the 
corresponding values of the hidden variables a, a2, aa. The actual decision 
is derived from the decision tree in Fig. 1. 
In order to learn this classification tree, we construct a training set of 5000 
patterns generated by randomly chosen values a, a2, aa in the interval -1 to 
+1. We randomly choose the initial weights for each node, and terminate 
765 
S= 0.79 
S, = 0.60 
S2 = 0.87' 
(2519/55) 
S :0.65 
'Sz = O. 88 
 (1617/il4) 
S = 0.85 
S= 0.75 
n4 
S: 0.90/ 
(92/5) -/ 
S=0.96 
Figure 2: The learned classification tree structure 
a branch as a leaf whenever the branch entropy is greater than 0.80. The 
entropy is started at S = 0.65, and terminated at its maximum value S = 
0.79 for the first node. The two branches of this node have the entropy 
fuction valued at Sa = 0.61, S2 = 0.87 respectively. This corrosponds to 
2446 patterns channeled to the first branch and 2554 to the second. Since 
S2 > 0.80 we terminate the second branch. Among 2554 patterns channeled 
to the second branch there are 2519 patterns with true classification as no and 
35 /es which are considered as errors. After completing the whole training 
process, there are totally four nodes automatically introduced. The final 
result is shown in a tree structure in Fig.2. 
The total errors classified by the learned tree are 3.4 % of the 5000 trainig 
patterns. After trainig we have tested the result using 10000 novel patterns, 
the error among which is 3.2 %. 
5 SUMMARY 
We propose here a new scheme to construct neural network that can au- 
tomatically learn the attributes sequentially to facilitate the classification 
of patterns according to the ranking importance of each attribute. This 
scheme uses information as a measure of the performance of each unit. It is 
766 
self-organized into a presumably optimal structure for a specific task. The 
sequential learning procedure focuses attention of the network to the most 
important attribute first and then branches out' to the less important at- 
tributes. This strategy of searching for attributes would alleviate the scale 
up problem forced by the overall parallel back-propagation scheme. It also 
avoids the problem of resource allocation encountered in the high-order net 
and the multi-layered net. In the example we showed the performance of the 
new method is satisfactory. We expect much better performance in problems 
that demand large size of units. 
6 acknowledgement 
This work is partially supported by AFOSR under the grant 87-0388. 
References 
[1] M. Minsky and S. Papert, Perceptton, MIT Press Cambridge, Ma(1969). 
[2] 
Y.C. Lee, G. Doolen, H.H. Chen, G.Z. Sun, T. Maxwell, H.Y. Lee and 
C.L. Giles, Machine Learning Using A High Order Connection Netwe- 
ork, Physics D22,776-306 (1986). 
[3] 
H.H. Chen, Y.C. Lee, G.Z. Sun, H.Y. Lee, T. Maxwell and C.L. Giles, 
High Order Connection Model For Associate Memory, AIP Proceedings 
Vol.151,p.86, Ed. John Denker (1986). 
[4] 
T. Maxwell, C.L. Giles, Y.C. Lee and H.H. Chen, Nonlinear Dynamics 
of Artificial Neural System, AIP Proceedings Vol.151,p.299, Ed. John 
Denker(1986). 
[5] D. Rummenlhart and J. McClelland, Parallel Distributive Processing, 
MIT Press(1986). 
[6] L. Breiman, J. Friedman, R. Olshen, C.J. Stone, Classification and Re- 
gression Trees,Wadsworth Belmont, California(1984). 
[7] J.R. Quinlan, Machine Learning, Vol.1 No.1(1986). 
