338 
The Connectivity Analysis of Simple Association 
How Many Connections Do You Need? 
Dan Hammerstrom * 
Oregon Graduate Center, Beaverton, OR 97006 
ABSTRACT 
The efficient realization, using current silicon technology, of Very Large Connection 
Networks (VLCN) with more than a billion connections requires that these networks exhibit 
a high degree of communication locMity. Real neural networks exhibi't significant locality, 
yet most connectionist/neurM network models have little. In this paper, the connectivity 
requirements of a simple associative network are analyzed using communication theory. 
Several techniques based on communication theory are presented that improve the robust- 
ness of the network in the face of sparse, local interconnect structures. Also discussed are 
some potential problems when information is distributed too widely. 
INTRODUCTION 
Connectionist/neural network researchers are learning to program networks that exhi- 
bit a broad range of cognitive behavior. Unfortunately, existing computer systems are lim- 
ited in their ability to emulate such networks efficiently. The cost of emulating a network, 
whether with special purpose, highly parallel, silicon-based architectures, or with traditional 
parallel architectures, is directly proportional to the number of connections in the network. 
This number tends to increase geometrically as the number of nodes increases. Even with 
large, massively parallel architectures, connections take time and silicon area. Many exist- 
ing neural network models scale poorly in learning time and connections, precluding large 
implementations. 
The connectivity 'costs of a network are directly related to its locality. A network 
exhibits locality of communication 1 if most of its processing elements connect to other physi- 
cally adjacent processing elements in any reasonable mapping of the elements onto a planar 
surface. There is much evidence that real neural networks exhibit locality 2 . In this paper, 
a technique is presented for analyzing the effects of locality on the process of association. 
These networks use a complex node similar to the higher-order learning units of Maxwell et 
al. 3 
NETWORK MODEL 
The network model used in this paper is now defined (see Figure 1). 
Definition 1: A recursive neural network, called a c-graph is 
F( V,E, C), where: 
 
a graph structure, 
There is a set of CNs (network nodes), V, whose outputs can take a range of positive 
real values, vi, between 0 and 1. There are N nodes in the set. 
There is a set of codohs, E, that can take a range of positive real values, eii (for 
codon ] of node i), between 0 and 1. There are N codons dedicated to each CN (the 
output of each codon is only used by its local CN), so there are a total of N N codohs 
in the network. The fan-in or order of a codon is re. It is assumed that f is the 
same for each codon, and N is the same for each CN. 
*This work was supported in part by the Semiconductor Research Corporation contract no. 86-10-097, and 
jointly by the Office of Naval Research and Air Force Office of Scientific Research, ONR contract no. N00014 87 K 
0259. 
 American Institute of Physics 1988 
339 
f i codon 
v 
Figure 1 - A CN 
 ciik 6 C is a set of connections of CNs to codons, l<i,k_<N, and i<_j<_N, ciik can 
take two values {0,1} indicating the existence of a connection from CN k to codon j 
of CN i. [] 
Definition : The value of CN i is 
v, = F 0+e// (1) 
The function, F, is a continuous non-linear, monotonic function, such as the sigmoid func- 
tion. [] 
Definition 3: Define a mapping, D(i,j,T)--*', where T is an input vector to F and ' is 
the f element input vector of codon j of CN i. That is, ' has as its elements those ele- 
ments of x} of  where cii}=l , V k. [] 
The D function indicates the subset of  seen by codon j of CN i. Different input vec- 
tors may map to the same codon vectors, e.g., D(i,j,T)--* and D(i,j,z--., where TT. 
Definition 6: The codon values ei are determined as follows. Let -rn) be input vector 
m of the M learned input vectors for CN i. For codon eii of CN i, let Tq be the set of f- 
dimensional vectors such that Tii(m)ETii , and D(i,j,-m))-.Tq(m). That is, each vector, 
Tq(m) in Tq consists of those subvectors of -rn) that are in codon ij's receptive field. 
The variable I indexes the L(i,]) vectors of Tii. The number of distinct vectors in 
may be less than the total number of learned vectors (L(i,j)_<M). Though the -rn) are 
distinct, the subsets, Tii(rn), need not be, since there is a possible many to one mapping of 
the T vectors onto each vector 
Let X 1 be the subset of vectors where vi=l (CN i is supposed to output a 1), and X  be 
those vectors where vl--O , then define 
n.(/) = size_of{D(i,j,x-m))st vi=q} (2) 
for q=0,1, and t m that map to this I. That is, %.(l) is the number of ' vectors that map 
340 
into 'q(!) where v-0 and ni.(/) is the number of Y vectors that map into '/j(l), where vi---1. 
The compression of a codon for a vector'q(l) then is defined as 
= (s) 
(HCq(!)O when both n , n=0.) The output of codon ij, q, is the maximum-likelyhd 
decoding 
= (4) 
ere HC indicates the likelyhd of vi=l when  vector Y that mps to g is input, nd g 
is that vector g) where min[d(g),] V l, D(i,j,)y, nd Y is the current input vec- 
tor.  other words, g is that vector (of the set of subset learned vectors that codon ij 
receives) that is closest (using distance measure d) to  (the subset input vector). n 
The output of  eodon is the "most-likely" output ccording to its inputs. For exm- 
pie, when there is no code compreion t  codon, ff=l, if the "closest" (in ter of some 
measure of vector distance, e.g. Hmming distance) subvector in the receptive field of the 
eodon belongs to  learned vector where the CN is to output  1. The codons described here 
re ve similar to those propped by Marr 4 and implement nearest-neighbor classification. 
It is umed that eodon function is determined stticlly prior to network operation, that 
is, the desired ctegofies hve lredy been learned. 
To measure performance, network epacity is used. 
Definition 5: The input noie, fl, is the verge d between n input vector nd the 
eldest (nimum d) learned vector, where d is  measure of the "difference" between two 
vectors - for bit vectors this cn be Hmming distance. The output nois, flo, is the verge 
distance between network output nd the learned output vector ssocited with the closest 
learned input vector. The information gain, G, is just 
Definition 6: The capacity of a network is the maximum number of learned vectors such 
that the information gain, Gi, is strictly positive (0). [] 
COMMUNICATION ANALOGY 
Consider a single connection network node, or CN. (The remainder of this paper will 
be restricted to a single CN.) Assume that the CN output value space is restricted to two 
values, 0 and 1. Therefore, the CN must decide whether the input it sees belongs to the 
class of "0" codes, those codes for which it remains off, or the class of "1" codes, those codes 
for which it becomes active. The inputs it sees in its receptive field constitute a subset of 
the input vectors (the D(...) function) to the network. It is also assumed that the CN is an 
ideal 1-NN (Nearest Neighbor) classifier or feature detector. That is, given a particular set 
of learned vectors, the CN will classify an arbitrary input according to the class of the 
nearest (using d h as a measure of distance) learned vector. This situation is equivalent to 
the case where a single CN has a single codon whose receptive field size is equivalent to that 
of the CN. 
Imagine a sender who wishes to send one bit of information over a noisy channel. The 
sender has a probabilistic encoder that choses a code word (learned vector) according to 
some probability distribution. The receiver knows this code set, though it has no knowledge 
of which bit is being sent. Noise is added to the code word during its transmission over the 
341 
channel, which is analogous to applying an input vector to a network's inputs, where the 
vector lies within some learned vector's region. The "noise" is represented by the distance 
(dh) between the input vector and the associated learned vector. 
The code word sent over the channel consists of those bits that are seen in the recep- 
tive field of the CN being modeled. In the associative mapping of input vectors to output 
vectors, each CN must respond with the appropriate output (0 or 1) for the associated 
learned output vector. Therefore, a CN is a decoder that estimates in which class the 
received code word belongs. This is a classic block encoding problem, where increasing the 
field size is equivalent to increasing code length. As the receptive field size increases, the 
performance of the decoder improves in the presence of noise. Using communication theory 
then, the trade-off between interconnection costs as they relate to field size and the func- 
tionality of a node as it relates to the correctness of its decision making process (output 
errors) can be characterized. 
As the receptive field size of a node increases, so does the redundancy of the input, 
though this is dependent on the particular codes being used for the learned vectors, since 
there are situations where increasing the field size provides no additional information. 
There is a point of diminishing returns, where each additional bit provides ever less reduc- 
tion in output error. Another factor is that interconnection costs increase exponentially 
with field size. The result of these two trends is a cost performance measure that has a sin- 
gle global maximum value. In other words, given a set of learned vectors and their proba- 
bilities, and a set of interconnection costs, a "best" receptive field size can be determined, 
beyond which, increasing connectivity brings diminishing returns. 
SINGLE CODON, WITH NO CODE COMPRESSION 
A single neural element with a single codon and with no code compression can be 
modelled exactly as a communication channel (see Figure 2). Each network node is assumed 
to have a single codon whose receptive field size is equal to that of the receptive field size of 
the node. 
noisy  
sender receiver 
encoder transmitter receiver 
decoder 
CN 
Figure 2 - A Transmission Channel 
342 
The operation of the channel is as follows. A bit is input into the channel encoder, 
which selects a random code of length N and transmits that code over the channel The 
receiver then, using nearest neighbor classification, decides if the original message was either 
a0oral. 
Let M be the number of code words used by the encoder. The rate* then indicates the 
density of the code space. 
Definition 7: The rate, R, of a communication channel is 
1oM 
R -- N (6) 
The block length, N, corresponds directly to the receptive field size of the codon, i.e., 
N=f. The derivations in later sections use a related measure: 
Definition 8: The code utilization, b, is the number of learned vectors assigned to a par- 
ticular code or 
M 
b ----- 2N (7) 
b can be written in terms of R 
b = 2N(R-) (S) 
As b approaches 1, code compression increases. b is essentially unbounded, since M may be 
significantly larger than 2 N. [] 
The decode error (information loss) due to code compression is a random variable that 
depends on the compression rate and the a priori probabilities, therefore, it will be different 
with different learned vector sets and codons within a set. As the average code utilization 
for all codons approaches 1, code compression occurs more often and codon decode error is 
unavoidable. 
Let  be the vector output of the encoder, and the input to the channel, where each 
element of // is either a 1 or 0. Let  be the vector output of the channel, and the input to 
the decoder, where each element is either a 1 or a 0. The Noisy Channel Coding Theorem is 
now presented for a general case, where the individual M input codes are to be dis- 
tinguished. The result is then extended to a CN, where, even though M input codes are 
used, the CN need only distinguish those codes where it must output a 1 from those where it 
must output a 0. The theorem is from Gallager (5.6.1) 5 . Random codes are assumed 
throughout. 
Theorem i: Let a discrete memoryless channel have transition probabilities P(j/k) 
and, for any positive integer N and positive number R, consider the ensemble of (N,R) 
block codes in which each letter of each code word is independently selected according to 
the probability assignment Q(k). Then, for each message m, lm[eR], and all p, 
0_p_l, the ensemble average probability of decoding error using maximum-likelyhood 
decoding satisfies 
where 
*In the definitions given here and the theorems below, the notation of Ga]]ager 5 is used. Many of the 
definitions and theorems are also from Gallager. 
343 
k P 
Q() (Y/ 
(lO) 
These results are now adjusted for our special case. 
Theorem 2: For a single CN, the average channel error rate for random code vectors is 
P_<2q(1-q )t.m (11) 
where q=Q(k) V k is the probability of an input vector bit being a 1. 121 
These results cover a wide range of models. A more easily computable expression can 
be derived by recognizing some of the restrictions inherent in the CN model. First, assume 
that all channel code bits are equally likely, that is,  k, Q(k)--q, that the error model is 
the Binary Symmetric Channel (BSC), and that the errors are identically distributed and 
independent -- that is, each bit has the same probability, e, of being in error, independent 
of the code word and the bit position in the code word. 
A simplified version of the above theorem can be derived. Maximizing p gives the 
tightest bounds: 
Pc, < 0.5 max?(p) (12 
-- 0_p< 
where (letting codon input be the block length, N=f) 
lz(p) _ exp{-f [Eo(p)-pR] } (13) 
The minimum value of this expression is obtained when p=l (for q=0.5): 
E o = --log 2 0.5N/-+0.5 (14) 
SINGLE-CODON WITH CODE COMPRESSION 
Unfortunately, the implementation complexity of a codon grows exponentially with the 
size of the codon, which limits its practical size. An alternative is to approximate single 
codon function of a single CN with many smaller, overlapped codohs. The goal is to main- 
tain performance and reduce implementation costs, thus improving the cost/performance of 
the decoding process. As codohs get smaller, the receptive field size becomes smaller relative 
to the number of CNs in the network. When this happens there is codon compression, or 
vector aliasing, that introduces its own errors into the decoding process due to information 
loss. Networks can overcome this error by using multiple redundant codohs (with overlap- 
ping receptive fields) that tend to correct the compression error. 
Compression occurs when two code words requiring different decoder output share the 
same representation (within the receptive field of the codon). The following theorem gives 
the probability of incorrect codon output with and without compression error. 
Theorem 3: For a BSC model where q=0.5, the codon receptive field is re, the code util- 
ization is b, and the channel bits are selected randomly and independently, the probability 
of a codon decoding error when b>l is approximately 
Pc,,, --< (1-e)/:r- [1-(1-6)1'] 0.5 (15) 
where the expected compression error per eodon is approximated by 
344 
 = 0.5- 2'X/bq(1-q) 
nd from equations 13-14, when b<l 
Proof is given in Hmmerstrom  . [] 
As b grows,  approaches 0.5 sympoically. Thus, he performance of a single codon 
degrades rapidly in the presence of even small amounts of compreion. 
MULTIPLE CODONS WITH CODE COMPRESSION 
The use of multiple small codons is more efficient thn a few large codons, but there 
are some fundamental performance constraints. When a codon is split into two or more 
smaller codohs (and the original receptive field is subdivided accordingly), there are several 
effects to be considered. First, the error rate of each new codon increases due to a decrease 
in receptive field size (the codoh'S block code length). The second effec is ha the code 
uilizaion, b, will increase for each codon, since he same number of learned vectors is 
mapped ino a smaller receptive field. This change al increases he error rae per codon 
due o code compreion. In fc, as he individual codon receptive fields ge smaller, 
significan code compreion occurs. For higher-order inpu codes, here is an added error 
tha occurs when he order of he individual codohs is decreased (since random codes are 
being umed, his effec is no considered here). The hird effec is he ma action of 
large numbers of codons. Even hough individual codohs may be in error, if he majority 
are correct, hen he CN will have correc output. This effec decreases he oal error rae. 
ume ha ech CN hs more han one codon, c 1. The union of he receptive fields 
for hese codohs is the receptive field for he CN wih no no restrictions on he degree of 
overlap of he vrious codon receptive fields within or between CNs. For a CN wih a large 
number of codohs, he codon overlap will generally be random and uniformly distributed. 
 assume ha he ransion errors seen by differen receptive fields are independent. 
Now consider wha happens o a codon's compreion error rae (ignoring ransmission 
error for he ime being) when a codon is replaced by wo or more smaller codohs covering 
he same receptive field. This replacemen process cn continue until here are only 1- 
codohs, which, incidenMly, is analogous o mos curren neural models. For  multiple 
codon CN, aume ha each codon voes a I or 0. The summation uni hen otals his 
information and outputs a I if he majority of codohs voe for a 1, ec. 
Theorem 4: The probability of a CN error due o compreion error is 
f 
V(-) 
where  is given in equation 16 and q=0.5. 
P incorporates the wo effects of moving o multiple smaller codohs and adding more 
codohs. Using equation 17 gives he oal error probability (per bi), Po: 
Pon= P,+P-P,P (19) 
Prf i in Hmmerrom  .  
345 
For networks that perform association as defined in this paper, the connection weights 
rapidly approach a single uniform value as the size of the network grows. In information 
theoretic terms, the information content of those weights approaches zero as the compres- 
sion increases. Why then do simple non-conjunctive networks (1-codon equivalent) work at 
all? In the next section I define connectivity cost constraints and show that the answer to 
the first question is that the general associative structures defined here do not scale cost- 
effectively and more importantly that there are limits to the degree of distribution of infor- 
mation. 
CONNECTIVITY COSTS 
It is much easier to assess costs if some implementation medium is assumed. I have 
chosen standard silicon, which is a two dimensional surface where CN's and codohs take up 
surface area according to their receptive field sizes. In addition, there is area devoted to 
the metal lines that interconnect the CNs. A specific VLSI technology need not be assumed, 
since the comparisons are relative, thus keeping CNs, codohs, and metal in the proper pro- 
portions, according to a standard metal width, m (which also includes the inter-metal 
pitch). For the analyses performed here, it is assumed that m e levels of metal are possible. 
In the previous section I established the relationship of network performance, in terms 
of the transmission error rate, , and the network capacity, M. In this section I present an 
implementation cost, which is total silicon area, A. This figure can then be used to derive a 
cost/performance figure that can be used to compare such factors as codon size and recep- 
tive field size. There are two components to the total area: AcN , the area of a CN, and 
AMi , the area of the metal interconnect between CNs. AcN consists of the silicon area 
requirements of the codons for all CNs. The metal area for local, intra-CN interconnect is 
considered to be much smaller than that of the codohs themselves and of that of the more 
global, inter-CN interconnect, and is not considered here. The area per CN is roughly 
AN-- cLmc(--, ) 2 (20) 
where me is the maximum number of vectors that each codon must distinguish, for b_l, 
me = 2 
Theorem 5: Assume a rectangular, unbounded* grid of CNs (all CNs are equi-distant 
from their four nearest neighbors), where each CN has a bounded receptive field of its 
cf 
nearest CNs, where nc v is the receptive field size for the CN, nc v - R ' where c is the 
number of codons, and R is the intra-CN redundancy, that is, the ratio of inputs to 
synapses (e.g., when R=I each CN input is used once at the CN, when R--2 each input is 
used on the average at two sites). The metal area required to support each CN's receptive 
field 
is (proof is giving by Hammerstrom ): 
m. 
AM! = '--"--+ 2 '+9nCN2 
The total area per CN, A, then is 
*Another implementation strategy is to place all CNs along a diagonal, which gives n 2 area. However, this 
technique only works for a bounded number of CNs and when dendritic computation can be spread over a large 
area, which limits the range of possible CN implementations. The theorem stated here covers an infinite plane of 
CNs each with a bounded receptive field. 
346 
A -- (AMi+AcN) = O(nN ) (22) 
Even with the assumption of maximum locality, the total metal interconnect area 
increases as the cube of the per CN receptive field size! 
SINGLE CN SIMULATION 
What do the bounds tell us about CN connectivity requirements? From simulations, 
increasing the CN's receptive field size improves the performance (increases capacity), but 
there is also an increasing cost, which increases faster than the performance! Another 
observation is that redundancy is quite effective as a means for increasing the effectiveness 
of a CN with constrained connectivity. (There are some limits to R, since it can reach a 
point where the intra-CN connectivity approaches that of inter-CN for some situations.) 
With a fixed ncv, increasing cost-effectiveness (A/m) is possible by increasing both order 
and redundancy. 
In order to verify the derived bounds, I also wrote a discrete event simulation of a CN, 
where a random set of learned vectors were chosen and the CN's codohs were programmed 
according to the model presented earlier. Learned vectors were chosen randomly and sub- 
jected to random noise, e. The CN then attempted to categorize these inputs into two 
major groups (CN output = I and CN output = 0). For the most part the analytic bounds 
agreed with the simulation, though they tended to be optimistic in slightly underestimating 
the error. These differences can be easily explained by the simplifying assumptions that 
were made to make the analytic bounds mathematically tractable. 
DISTRIBUTED S. LOCALIZED 
Throughout this paper, it has been tacitly assumed that representations are distributed 
across a number of CNs, and that any single CN participates in a number of representa- 
tions. In a local representation each CN represents a single concept or feature. It is the dis- 
tribution of representation that makes the CN's decode job difficult, since it is the cause of 
the code compression problem. 
There has been much debate in the connectionist/neuromodelling community as to the 
advantages and disadvantages of each approach; the interested reader is referred to Hin- 
ton ? , Baum et al. s, and Ballard 9 . Some of the results derived here are relevant to this 
debate. As the distribution of representation increases, the compression per CN increases 
accordingly. It was shown above that the mean error in a codoh's response quickly 
approaches 0.5, independent of the input noise. This result also holds at the CN level. For 
each individual CN, this error can be offset by adding more codohs, but this is expensive 
and tends to obviate one of the arguments in favor of distributed representations, that is, 
the multi-use advantage, where fewer CNs are needed because of more complex, redundant 
encodings. As the degree of distribution increases, the required connectivity and the code 
compression increases, so the added information that each codon adds to its CN's decoding 
process goes to zero (equivalent to all weights approaching a uniform value). 
SUMMARY AND CONCLUSIONS 
In this paper a single CN (node) performance model was developed that was based on 
Communication Theory. Likewise, an implementation cost model was derived. 
The communication mode] introduced the codon as a higher-order decoding element 
and showed that for small codons (much less than total CN fan-in, or convergence) code 
compression, or vector aliasing, within the codon's receptive field is a severe problem for 
347 
large networks. As code compression increases, the information added by any individual 
codon to the CN's decoding task rapidly approaches zero. 
The cost model showed that for 2-dimensional silicon, the area required for inter-node 
metal connectivity grows as the cube of a CN's fan-in. 
The combination of these two trends indicates that past a certain point, which is 
highly dependent on the probability structure of the learned vector space, increasing the 
fan-in of a CN (as is done, for example, when the distribution of representation is increased) 
yields diminishing returns in terms of total cost-performance. Though the rate of diminish- 
ing returns can be decreased by the use of redundant, higher-order connections. 
The next step is to apply these techniques to ensembles of nodes (CNs) operating in a 
competitive learning or feature extraction environment. 
[3] 
[4] 
[5] 
[9] 
REFERENCES 
J. Bailey, "A VLSI Interconnect Structure for Neural Networks," Ph.D. 
Dissertation, Department of Computer Science/Engineering, OGC. In Preparation. 
V. B. Mountcastle, "An Organizing Principle for Cerebral Function: The Unit 
Module and the Distributed System," in The Mindful Brain, MIT Press, Cambridge, 
M_A, 1977. 
T. Maxwell, C. L. Giles, Y. C. Lee and H. H. Chen, "Transformation Invariance 
Using High Order Correlations in Neural Net Architectures," Proceedings 
International Conf. on Systems, Man, and Cybernetics, 1986. 
D. Marr, "A Theory for Cerebral Neocortex," Proc. Roy. Soc. London, vol. 
176(1970), pp. 161-234. 
R. G. Gallager, Information Theory and Reliable Communication, John Wiley and 
Sons, New York, 1968. 
D. Hammerstrom, "A Connectivity Analysis of Recursice, Auto-Associative 
Connection Networks," Tech. Report CS/E-86-009, Dept. of Computer 
Science/Engineering, Oregon Graduate Center, Beaverton, Oregon, August 1986. 
G. E. Hinton, "Distributed Representations," Technical Report CMU-CS-84-157, 
Computer Science Dept., Carnegie-Mellon University, Pittsburgh, PA 15213, 1984. 
E. B. Baum, J. Moody and F. Wilczek, "Internal Representations for Associative 
Memory," Technical Report NSF-ITP-86-138, Institute for Theoretical Physics, 
Santa Barbara, CA, 1986. 
D. H. Ballard, "Cortical Connections and Parallel Processing: Structure and 
Function," Technical Report 133, Computer Science Department, Rochester, NY, 
January 1985. 
