174 
A Neural Network C1A-sifier Based on Coding Theory 
Tzi-Dar Chiueh and Rodney Goodman 
California Institute of Technology, Pasadena, California 91125 
ABSTRACT
The new neural network classifier we propose transforms the 
classification problem into the coding theory problem of decoding a noisy 
codeword. An input vector in the feature space is transformed into an internal 
representation which is a codeword in the code space, and then error correction 
decoded in this space to classify the input feature vector to its class. Two classes 
of codes which give high performance are the Hadamard matrix code and the 
maximal length sequence code. We show that the number of classes stored in an 
N-neuron system is linear in N and significantly more than that obtainable by 
using the Hopfield type memory as a classifier. 
I. INTRODUCTION 
Associative recall using neural networks has recently received a great deal 
of attention. Hopfield in his papers [1,2] describes a mechanism which iterates 
through a feedback loop and stabilizes at the memory element that is nearest the 
input, provided that not many memory vectors are stored in the machine. He has 
also shown that the number of memories that can be stored in an N-neuron 
system is about 0.15N for N between 30 and 100. McEliece et al. in their work [3] 
showed that for synchronous operation of the Hopfield memory about N/(21ogN) 
data vectors can be stored reliably when N is large. Abu-Mostafa [4] has predicted 
that the upper bound for the number of data vectors in an N-neuron Hopfield 
machine is N. We believe that one should be able to devise a machine with M, the 
number of data vectors, linear in N and larger than the 0.15N achieved by the 
Hopfield method. 
N 
Feature Space = B = {-1 ,1 } 
Ae 
eC 
B. 
N L 
L 
Code Space = B = 
Figure 1 (a) Classification problems versus (b) Error control decoding problems 
In this paper we are specifically concerned with the problem of 
classification as in pattern recognition. We propose a new method of building a 
neural network classifier, based on the well established techniques of error 
control coding. Consider a typical classification problem (Fig. l(a)), in which one 
is given apriori a set of classes, C(a), a =1 ..... M. Associated with each class is a 
feature vector which labels the class ( the exemplar of the class ), i.e. it is the 
@ American Institute of Physics 1988 
175 
most representative point in the class region. The input is classified into the 
class with the nearest exemplar to the input. Hence for each class there is a 
region in the N-dimensional binary feature space BN --- [1,-1)N, in which every 
vector will be classified to the corresponding class. 
A similar problem is that of decoding a codeword in an error correcting 
code as shown in Fig. l(b). In this case codewords are constructed by design and 
are usually at least dmin apart. The received corrupted codeword is the input to 
the decoder, which then finds the nearest codeword to the input. In principle 
then, ff the distance between codewords is greater than 2t + 1, it is possible to 
decode (or classify) a noisy codeword (feature vector) into the correct codeword 
(exemplar) provided that the Hamming distance between the noisy codeword and 
the correct codeword is no more than t. Note that there is no guarantee that the 
exemplars are uniformly distributed in BN, consequently the attraction radius 
(the maximum number of errors that can occur in any given feature vector such 
that the vector can still be correctly classified) will depend on the minimum 
distance between exemplars. 
Many solutions to the minimum Hamming distance classification have 
been proposed, the one commonly used is derived from the idea of matched filters 
in communication theory. Lippmann [5] proposed a two-stage neural network 
that solves this classification problem by first correlating the input with all 
exemplars and then picking the maximum by a "winner-take-all" circuit or a 
network composed of two-input comparators. In Figure 2, fl,f2 ..... fN are the N 
input bits, and Sl,S2 .... SM are the matching scores(similarity) of f with the M 
exemplars. The second block picks the maximum of Sl,S2 ..... SM and produces the 
index of the exemplar with the largest score. The main disadvantage of such a 
classifier is the complexity of the maximum-picking circuit, for example a 
"winner-take-all" net needs connection weights of large dynamic range and 
graded-response neurons, whilst the comparator maximum net demands M-1 
comparators organized in log2M stages. 
fl 
S. c(CO+ 
g= e  
fN 
M 
A 
X 
I 
M 
U 
M 
class(f ) 
f= d(a+ ) e 
Featu re Code 
Space Space 
ODER 
DEC 
Fig. 2 A matched filter type classifier Fig. 3 Structure of the proposed classifier 
Our main idea is thus to transform every vector in the feature space to a 
vector in some code space in such a way that every exemplar corresponds to a 
codeword in that code. The code should preferably (but not necessarily) have the 
property that codewords are uniformly distributed in the code space, that is, the 
Hamming distance between every pair of codewords is the same. With this 
transformation, we turn the problem of classification into the coding problem of 
decoding a noisy codeword. We then do error correction decoding on the vector in 
the code space to obtain the index of the noisy codeword and hence classify the 
original feature vector, as shown in Figure 3. 
This paper develops the construction of such a classification machine as 
follows. First we consider the problem of transforming the input vectors from the 
feature space to the code space. We describe two hetero-associative memories for 
doing this, the first method uses an outer product matrix technique similar to 
176 
that of Hopfield's, and the second method generates its matrix by the 
pseudoinverse technique[6,7]. Given that we have transformed the problem of 
associative recall, or classification, into the problem of decoding a noisy 
codeword, we next consider suitable codes for our machine. We require the 
codewords in this code to have the property of orthogonality or 
pseudo-orthogonality, that is, the ratio of the cross-correlation to the 
auto-correlation of the codewords is small. We show two classes of such good 
codes for this particular decoding problem i.e. the Hadamard matrix codes, and 
the maximal length sequence codes[8]. We next formulate the complete decoding 
algorithm, and describe the overall structure of the classifier in terms of a two 
layer neural network. The first layer performs the mapping operation on the 
input, and the second one decodes its output to produce the index of the class to 
which the input belongs. 
The second part of the paper is concerned with the performance of the 
classifier. We first analyze the performance of this new classifier by finding the 
relation between the maximum number of classes that can be stored and the 
classification error rate. We show (when using a transform based on the outer 
product method) that for negligible misclassification rate and large N, a not very 
tight lower bound on M, the number of stored classes, is 0.22N. We then present 
comprehensive simulation results that confirm and exceed our theoretical 
expectations. The simulation results compare our method with the Hopfield 
model for both the outer product and pseudo-inverse method, and for both the 
analog and hard limited connection matrices. In all cases our classifier exceeds 
the performance of the Hopfield memory in terms of the number of classes that 
can be reliably recovered. 
II. TRANSFORM TECHNIQUF_ 
Our objective is to build a machine that can discriminate among input 
vectors and classify each one of them into the appropriate class. Suppose 
d(a) e BN is the exemplar of the corresponding class C(C0, ct = 1,2 ..... M. Given the 
input f, we want the machine to be able to identify the class whose exemplar is 
closest to f, that is, we want to calculate the following function, 
class(f) = a 
iff If-d(COI <lf-d(l 
where   denotes Hamming distance in BN. 
We approach the problem by seeking a transform , that maps each 
exemplar d(a) in BN to the corresponding codeword w(a) in BL. And an input 
feature vector f = d() + e is thus mapped to a noisy codeword g = w() + e' where e 
is the error added to the exemplar, and e' is the corresponding error pattern in the 
code space. We then do error correction decoding on g to get the index of the 
corresponding codeword. Note that e' may not have the same Hamming weight as 
e, that is, the transformation , may either generate more errors or eliminate 
errors that are present in the original input feature vector. We require , to 
satisfy the following equation, 
a=o, 1 ..... M-1 
and , will be implemented using a single-layer feedfonvard network. 
177 
Thus we first construct a matrix according to the sets of d(a)'s and w(a)'s, call it T, 
and define  as 
where sgn is the threshold operator that maps a vector in RL to BL and R is the 
field of real numbers. 
Let D be an N x M matrix whose ath column is d(a) and W be an L x M 
matrix whose [th column is w([). The two possible methods of constructing the 
matrix for  are as follows: 
Scheme A [outer product mth0cl) [3,6]: In this scheme the matrix T is 
defined as the sum of outer products of all exemplar-codeword pairs, i.e. 
M-1 
T (A)ij = ,, wi(a)' dj(a) 
a=0 
or equivalently, 
T(A) = WDt 
Scheme B (pseudo-inverse method) [6,7]  We want to find a matrix T(B) 
satisfying the following equation, 
In general D is not a square matrix, moreover D may be singular, so D-1 
may not exist. To circumvent this difficulty, we calculate the pseudo-inverse 
(denoted DS) of the matrix D instead of its real inverse, let DS-- (DtD)-lDt. T(B) can 
be formulated as, 
T(B) = W D = W (Dr D)-IDt 
IlL CODES 
The codes we are looking for should preferably have the property that its 
codewords be distributed uniformly in BL, that is, the distance between each two 
codewords must be the same and as large as possible. We thus seek classes of 
equidistant codes. Two such classes are the Hadamard matrix codes, and the 
maximal length sequence codes. 
First der'me the word pseudo-orthogonal. 
Definition: Let w(a) = (w0(a),Wl(a) ....... wL_l(a)) E BL be the ath codeword of 
code C, where cc = 1,2 ..... M. Code C is said to be pseudo-orthogonal iff 
L-1 
(w(a),w()) =  
i=0 
w(a) w() 
where E << L 
where (,) denotes inner product of two vectors. 
Hadamard Matrices: An orthogonal code of length L whose L codewords are 
rows or columns of an L x L Hadamard matrix. In this case e = 0 and the 
distance between any two codewords is L/2. It is conjectured that there exist such 
codes for all L which are multiples of 4, thus providing a large class of codes[8). 
178 
Maximal Length Sequence Codes: There exists a famfiy of maximal length 
sequence (also called pseudo-random or PN sequence) codes[8], generated by shift 
registers, that satisfy pseudo-orthogonality with e = -1. Suppose g (x) is a 
primitive polynomial over GF (2) of degree D, and let L = 2D - 1, and if 
f(x) = 1/g (x) = Z Ck' xk 
k=0 
then co,c 1 ....... is a periodic sequence of period L ( since g (x) I x L _ 1). If code C is 
made up of the L cyclic shifts of 
c = (1-2c0,1-2Cl .... 1-2CL_l) 
then code C satisfies pseudo-orthogonality with e = - 1. One then easily sees that 
the minimum distance of this code is (L - 1)/2 which gives a correcting power of 
approximately L/4 errors for large L. 
IV. OVERAI.I. CLASSIFIER STRUCTURE 
We shall now describe the overall classifier structure, essentially it 
consists of the mapping  followed by the error correction decoder for the 
maximal length sequence code or Hadamard matrix code. The decoder operates 
by correlating the input vector with every codeword and then thresholding the 
result at (L + e)/2. The rationale of this algorithm is as follows, since the distance 
between every two codewords in this code is exactly (L - e)/2 bits, the decoder 
should be able to correct any error pattern with less than (L - e)/4 errors if the 
threshold is set halfway between L and e i.e. (L + e )/2. 
Suppose the input vector to the decoder is g = w(a) + e and e has Hamming 
weight s (i.e. s nonzero components) then we have 
(g,w(C0) = L- 2s 
(g,w{5) < 2s+ e 
where 
From the above equation, if g is less than (L- e)/4 errors away from w(a) 
{i.e. s < {L - e)/4 ) then ( g, w(a)) will be more than {L + e)/2 and (g, w([)) wfil be 
less than (L + e)/2, for all [ g= a. As a result, we arrive at the following decoding 
algorithm, 
decode{ = sgn{Wtg -{(L+ /2)J) 
where J = [ 1 1 ..... 1 ]t, which is an M x 1 vector. 
In the case when e = -1 and less than (L+l)/4 errors in the input, the output 
will be a vector in BM ----- {1,-1}M with only one component positive (+1), the index 
of which is the index of the class that the input vector belongs. However if there 
are more than (L+1)/4 errors, the output can be either the all negative(- 1) vector 
(decoder failure) or another vector with one positive component{decoder error). 
The function class can now be defined as the composition of g and decode, 
the overall structure of the new classifier is depicted in Figure 4. It can be viewed 
as a two-layer neural network with L hidden units and M output neurons. The 
first layer is for mapping the input feature vector to a noisy codeword in the code 
space ( the "internal representation" ) while the second one decodes the first's 
output and produces the index of the class to which the input belongs. 
179 
fl 
 
 
 
fN-1 
fN 
Figure 4 
or (B) W t 
T (A) T 
g 
gL 
hi 
 
 
 
Overall architecture of the new neural network classifier 
V. PERFORMANCE ANALYSIS 
From the previous section, we know that our classifier will make an error 
only if the transformed vector in the code space, which is the input to the decoder, 
has no less than (L - e)/4 errors. We now proceed to find the error rate for this 
classifier in the case when the input is one of the exemplars (i.e. no error), say 
f = d() and an outer product connection matrix for . Following the approach of 
McEliece et. al.[3], we have 
N-1 M-1 
(d()),= ( Z Z wi(a) dj(a)dj() ) 
j=o (I=o 
sgn( Nw( + 
N-1 M-1 
j=o a=o 
Assume without loss of generality that wi() = - 1, and ff 
wl(a) dj(a) dj() ) 
then 
N-1 M-1 
X ---- Z Z wi(a) dj(a)dj() -> N 
j=O a=0 
( d{))i  wi{) 
Notice that we assumed all d{a)'s are random, namely each component of 
any d(a) is the outcome of a Bernoulli trial, accordingly, X is the sum of N(M-1) 
independent identically distributed random variables with mean 0 and variance 
1. In the asymptotic case, when N and M are both very large, X can be 
approximated by a normal distribution with mean 0, variance NM. Thus 
P 
--- Pt{ ( d(j))i  wi(O) } 
1 t2/2 
where 9(x) = - e 
dt 
180 
Next we calculate the mtsclassfftcatlon rate of the new classifier as follows 
(assuming e << L), 
L 
PE = =L/4 (L)pk(1-p)L-k 
k j k 
where L J is the integer floor. Since in general it is not possible to express the 
summation explicitly, we use the Chernoff method to bound Pe from above. 
Multiplying each term in the summation by a number larger than unity 
( et(k - L/4) with t > 0 ) and summing from k = 0 instead of k = LL/4J, 
L 
Pe < Z (L)pk(1.p)L-ket(k-L/4) = e-Lt/4(l_p+pet)L 
k=0 k 
Differentiating the RIdS of the above equation w.r.t. t and set it to 0, we find 
the optimal to as eto = (1-p)/3p. The condition that to > 0 implies that p < 1/4, 
and since we are dealing with the case where p is small, it is automatically 
satisfied. Substituting the optimal to, we obtain 
Pe < c L .pL/4. (1_p)3Iy4 where c = 4/(33/4 ) = 1.7547654 
From the expression for Pe, we can estimate M, the number of classes that 
can be classffied with negligible misclasstfication rate, in the following way, 
suppose Pe = / where / << land p << 1, then 
/4/L< C 4 'P .{1_p)3  p = Q( N > c -4  (l-p)-3 . 4/L 
For small x we have Q-I(z) - /21og(l/z) and since 6 is a fixed value, as L 
approaches infinity, we have 
M > N = N 
81ogc 4.5 
From the above lower bound for M, one easily see that this new machine is able to 
classify a constant times N classes, which is better than the number of memory 
items a Hopfield model can store i.e. N/(21ogN). Although the analysis is done 
assuming N approaches infinity, the simulation results in the next section show 
that when N is moderately large (e.g. 63) the above lower bound applies. 
VI. SIMULATION RESULTS AND A CHARACTER RECOGNITION EXAMPI E 
We have simulated both the Hopfield model and our new machine(using 
maximal length sequence codes) for L = N = 31, 63 and for the following four cases 
respectively. 
(i) connection matrix generated by outer product method 
(ii) connection matrix generated by Dseudo-inverse mth0d 
(iii) connection matrix generated by outer t)roduct method, the components of the 
connection matrix are hard limited. 
(iv) connection matrix generated by pseudo-inverse method, the components of 
the connection matrix are hard limited. 
181 
For each case and each choice of N, the program fixes M and the number of 
errors in the input vector, then randomly generates 50 sets of M exemplars and 
computes the connection matrix for each machine. For each machine it 
randomly picks an exemplar and adds noise to it by randomly complementing 
the specified number of bits to generate 20 trial input vectors, it then simulates 
the machine and checks whether or not the input is classified to the nearest class 
and reports the percentage of success for each machine. 
The simulation results are shown in Figure 5, in each graph the horizontal 
axis is M and the vertical axis is the attraction radius. The data we show are 
obtained by collecting only those cases when the success rate is more than 98%, 
that is for fixed M what is the largest attraction radius (number of bits in error of 
the input vector) that has a success rate of more than 98%. Here we use the 
attraction radius of -1 to denote that for this particular M, with the input being 
an exemplar, the success rate is less than 98/6 in that machine. 
-e- Hopfield Model o- New Classifier(OP) -- New Classffier(PI) [ 
.o  o,[. N=31  1 
  : .t Binary Connection Matrix   
,  , ,t'-. , ,.. 
'IV '" 
M 
(a) 
N=31 
Analog Connection Matrix 
M 
 23 
="  x Binary Connection Matrix 
.   . 
 13' N=63 
i Analog Connection Matrix 
3 7 11 15 19 23 27 31 35 39 43 47 51 55 59 63 
(c) (d) 
Figure 5 Simulation results of the Hopfield memory and the new classifier 
182 
I-*- Hopfield Model - New Classifier(OP,L=63) - New Classifier(OP,L=31) I 
Figure 6 Performance of the new classifier using codes of different lengths 
In all cases our classifier exceeds the performance of the Hopfield model 
in terms of the number of classes that can be reliably recovered. For example, 
consider the case of N = 63 and a hard limited connection matrix for both the new 
classifier and the Hopfield model, we find that for an attraction radius of zero, 
that Is, no error in the input vector, the Hopfield model has a classification 
capacity of approximately 5, while our new model can store 47. Also, for an 
attraction radius of 8, that is, an average of N/8 errors in the Input vector, the 
Hopfield model can reliably store 4 classes while our new model stores 27 
classes. Another simulation (Ftg. 6) using a shorter code (L = 31 instead of L = 63) 
reveals that by shortening the code, the performance of the classifier degrades 
only slightly. We therefore conjecture that it is possible to use traditional error 
correcting codes (e.g. BCH code) as internal representations, however, by going to 
a higher rate code, one Is trading minimum distance of the code (error tolerance) 
for complexity (number of hidden units), which implies possibly poorer 
performance of the classifier. 
We also notice that the superiority of the pseudoinverse method over the 
outer product method appears only when the connection matrices are hard 
limited. The reason for this is that the pseudoinverse method is best for 
decorrelating the dependency among exemplars, yet the exemplars in this 
simulation are generated randomly and are presumably independent, 
consequently one can not see the advantage of pseudoInverse method. For 
correlated exemplars, we expect the pseudoinverse method to be clearly better 
(see next example). 
Next we present an example of applying this classifier to recognizing 
characters. Each character is represented by a 9 x 7 pixel array, the input Is 
generated by flipping every pixel with 0.1 and 0.2 probability. The input is then 
passed to five machines: Hopfield memory, the new classfflei  with either the 
pseudoinverse method or outer product method, and L = 7 or L = 31. Figure 7 and 8 
show the results of all 5 machines for 0.1 and 0.2 pixel flipping probability 
respectively, a blank output means that the classifier refuses to make a decision. 
First note that the L = 7 case is not necessarily worse than the L = 31 case, this 
confirms the earlier conjecture that fewer hidden units (shorter code) only 
degrades performance slightly. Also one easily sees that the pseudoinverse 
method is better than the outer product method because of the correlation 
between exemplars. Both methods outperform the Hopfield memory since the 
latter mixes exemplars that are to be remembered and produces a blend of 
exemplars rather than the exemplars themselves, accordingly it cannot classify 
the input without mistakes. 
183 
Figure 7 The character recognition 
example with 10% pixel reverse 
probabfiity (a) input (b) correct 
output {c) Hopfield Model (d)-(g) new 
classifier (d) OP, L = 7 (e)OP, L = 31 
(i) PI, L= 7 (g) PI, L= 31 
Figure 8 The character recognition 
example with 20% pixel reverse 
probability (a) input (b) correct 
output (c) Hopfield Model (d)-(g) new 
classifier (d) OP, L = 7 (e)OP, L = 31 
(0 PI, L= 7 {g} PI, L = 31 
VII. CONCLUSION 
In this paper we have presented a new neural network classifier design 
based on coding theory techniques. The classifier uses codewords from an error 
correcting code as its internal representations. Two classes of codes which give 
high performance are the Hadamard matrix codes and the maximal length 
sequence codes. In performance terms we have shown that the new machine is 
significantly better than using the Hopfield model as a classifier. We should also 
note that when comparing the new classifier with the Hopfield model, the 
increased performance of the new classifier does not entail extra complexity, 
since it needs only L + M hard limiter neurons and L(N + M) connection weights 
versus N neurons and N2 weights in a Hopfield memory. 
In conclusion we believe that our model forms the basis of a fast, practical 
method of classification with an efficiency greater than other previous neural 
network techniques. 
References
[1] J.J. Hopfield, Proc. Nat. Acacl. ScL USA, Vol. 79, pp. 2554-2558 (1982). 
[2] J.J. Hopfield, Proc. Nat. Acad. ScL USA, Vol. 81, pp. 3088-3092 (1984). 
[3] R.J. McEliece, et. al,/EEE Tran. on Information Theory, Vol. IT-33, 
pp. 461-482 (1987). 
[4] Y. S. Abu-Mostafa and J. St. Jacques, IEEE Tran. on Information Theory , 
Vol. IT-31, pp. 461-464 (1985). 
[5] R. Lippmann, IEEEASSPMagazine, Vol. 4, No. 2, pp. 4-22 (Aprfi 1987). 
[6] T. Kohonen, Associative Memory- A System-Theoretical Approach 
(Springer-Verlag, Berlin Heidelberg, 1977). 
[7] S.S. Venkatesh,Linear Map with Point Rules, Ph.D Thesis, Caltech, 1987. 
[8] E. R- Berlekamp, Algebraic Coding Theory , Aegean Park Press, 1984. 
