A Real Time Clustering CMOS 
Neural Engine 
T. Serrano-Gotarredona, B. Linares-Barranco, and J. L. Huertas 
Dept. of Analog Design, National Microelectronics Center (CNM), Ed. CICA, Av. Reina 
Mercedes s/n, 41012 Sevilla, SPAIN. Phone: (34)-5-4239923, Fax: (34)-5-4624506, 
E-mail: bernabe@cnm. us. es 
Abstract 
We describe an analog VLSI implementation of the ART1 algorithm 
(Carpenter, 1987). A prototype chip has been fabricated in a standard low 
cost 1.5gm double-metal single-poly CMOS process. It has a die area of 
lcrn 2 and is mounted in a 120-pins PGA package. The chip realizes a 
modified version of the original ART1 architecture. Such modification 
has been shown to preserve all computational properties of the original 
algorithm (Serrano, 1994a), while being more appropriate for VLSI 
realizations. The chip implements an ART1 network with 100 F1 nodes 
and 18 F2 nodes. It can therefore cluster 100 binary pixels input patterns 
into up to 18 different categories. Modular expansibility of the system is 
possible by assembling an NxM array of chips without any extra 
interfacing circuitry, resulting in an F1 layer with 100xN nodes, and an 
F2 layer with 18xM nodes. Pattern classification is performed in less 
than 1.8gs, which means an equivalent computing power of 2.2x109 
connections and connection-updates per second. Although internally the 
chip is analog in nature, it interfaces to the outside world through digital 
signals, thus having a true asynchrounous digital behavior. Experimental 
chip test results are available, which have been obtained through test 
equipments for digital chips. 
1 INTRODUCTION 
The original ART1 algorithm (Carpenter, 1987) proposed in 1987 is a massively parallel 
architecture for a self-organizing neural binary-pattern recognition machine. In response 
to arbitrary orderings of arbitrarily many and complex binary input patterns, ART1 is 
756 T. Serrano-Gotarredona, B. Linares-Barranco, J. L. Huertas 
Initialize weights: 
zij= 1 
I = (11, 12, '"IN) 
Winner-Take-All: 
yj = 1 if 
yj = 0 ifj:J 
I 
Update weights: I 
Zj -' I(zj 
new old 
Fig. 1: Modified Fast Learning or Type-3 ART1 
implementation algorithm 
capable of learning, in an unsupervised way, stable recognition codes. The ART1 
architecture is described by a set of Short Term Memory (STM) and another set of Long 
Term Memory (LTM) time domain nonlinear differential equations. It is valid to assume 
that the STM equations settle much faster (instantaneously) than the LTM equations, so 
that the STM differential equations can be substituted by nonlinear algebraic equations 
that describe the steady-state of the STM differential equations. Furthermore, in the 
fast-learning mode (Carpenter, 1987), the LTM differential equations are as well 
substituted by their corresponding steady-state nonlinear algebraic equations. This way, 
the ART1 architecture can be behaviorally modelled by the sequential application of 
nonlinear algebraic equations. Three different levels of ART1 implementations (both in 
software and in hardware) can therefore be distinguished: 
Type-I: Full Model Implementation: both STM and LTM time-domain differential 
equations are realized. This implementation is the most expensive (both in software 
and in hardware), and requires a large amount of computational power. 
Type-2: STM steady-state Implementation: only the LTM time-domain differential 
equations are implemented. The STM behavior is governed by nonlinear algebraic 
equations. This implementation requires less resources than the previous one. 
However, a proper sequencing of STM events has to be introduced artificially, which 
is architecturally implicit in the Type-1 implementation. 
Type-3: Fast Learning Implementation: STM and LTM is implemented with algebraic 
equations. This implementation is computationally the less expensive one. In this case 
an artificial sequencing of STM and LTM events has to be done. 
The implementation presented in this paper realizes a modified version of the original 
ART1 Type-3 algorithm, more suitable for VLSI implementations. Such modified ART1 
system has been shown to preserve all computational properties of the original ART1 
architecture (Serrano, 1994a). The flow diagram that describes the modified ART1 
architecture is shown in Fig. 1. Note that there is only one binary-valued weight template 
.. ), instead of the two weight templates (one binary-valued and the other real-valued) of 
 original ART1. For a more detailed discussion of the modified ART1 algorithm refer to 
(Serrano, 1994a, 1994b). 
In the next Section we will provide an analog current-mode based circuit that implements 
in hardware the flow diagram of Fig. 1. Note that, although internally this circuit is analog 
in nature, from its input and output signals point of view it is a true asynchronous digital 
A Real Time Clustering CMOS Neural Engine 75 7 
circuit, easy to interface with any conventional digital machine. Finally, in Section 3 we 
will provide experimental results measured from the chip using a digital data acquisition 
test equipment. 
2 CIRCUIT DESCRIPTION 
The ART1 chip reported in this paper has an F1 layer with 100 neurons and an F2 layer 
with 18 neurons. This means that it can handle binary input patterns of 100 pixels each, 
and cluster them into up to 18 different categories according to a digitally adjustable 
vigilance parameter p. The circuit architecture of the chip is shown in Fig. 2(a). It consists 
of an array of 18x100 synapses, a lx100 array of "vigilance synapses", a unity gain 
18-outputs current mirror, an adjustable gain 18-outputs current mirror (with p=0.0, 0.1, ... 
0.9)1, 18 current-comparator-controlled switches and an 18-input-currents 
Winner-Take-All (WTA) (Serrano, 1994b). The inputs to the circuit are the 100 binary 
digital input voltages li, and the outputs of the circuit are the 18 digital output voltages yj. 
External control signals allow to change parameters p, L A , L, and Lt. Also, extra 
circuitry has been added for reading the internal weights zij while the system is learning. 
Each row of synapses generates two currents, 
100 100 
5 = LA  zijli-LB  ziJ + LM 
i=1 i=1 
00 (1) 
Vj = L A  zijl i 
i=1 
while the row of the "vigilance synapses" generates the current 
100 
Vp = m A  [i (2) 
i=1 
Each of the current comparators compares the current V. versus p V , and allows current 
J P . 
T. to reach the WTA only if p V <Vj This way competition and mgilance occur 
./ .p- 
smultaneously and in parallel, speeding up significantly the search process. 
Fig. 2(b) shows the content of a synapse in the 18x100 array. It consists of three current 
sources with switches, two digital AND gates and a flip-flop. Each synapse receives two 
input voltages I, and y,, and two global control voltages q, (to enable/disable learning) 
and reset (to iitializ all weights zij to '1'). Each syntpse generates two currents 
L_l.z..- L-z.. and L_l.z.. ,which will be summed up for all the synapses in the same row to 
1 t tJ 1 tJ 1 t tJ 
generate the currents T. and V.. If learning is enabled (q. = 1 ) the value of z.. will 
change to l.z.. if y. = 1 The vigilance synapses" consist each of a current-source of 
t tj j ' 
value L A with a switch controlled by the input voltage I i . The current comparators are 
those proposed in (Dominguez-Castro, 1992), the WTA used is reported in (Lazzaro, 
1989), and the digitally adjustable current mirror is based on (Loh, 1989), while its 
continuous gain fine tuning mechanism has been taken from (Adams, 1991). 
1. An additional pin of the chip can fine-tune p between 0.9 and 1.0. 
758 T. Serrano-Gotarredona, B. Linares-Barranco, J. L. Huertas 
I 1 12 I10 0 
(a) (b) 
t_--:-  ...... _-. ........ 
q*l reset I i 
Fig. 2: (a) System Diagram of Current-Mode ART1 Chip, (b) Circuit Diagram of Synapse 
A 
 10[ (600) 
(3600) 
Fig. 3: Tree based current-mirror scheme for matched current sources 
The circuit has been designed in such a way that the WTA operates with a precision 
around 1.5% (-6 bits). This means that all L A and L B current sources have to match 
within an error of less than that. From a circuit implementation point of view this is not 
easy to achieve, since there are 5500 current sources spread over a die area of lcm 2. 
Typical mismatch between reasonable size MOS transistors inside such an area extension 
can be expected to be above 10% (Pelgrom, 1989). To overcome this problem we 
implemented a tree-based current mirror scheme as is shown in Fig. 3. Starting from a 
unique current reference, and using high-precision 10(or less)-outputs current mirrors 
(each yielding a precision around 0.2%), only up to four cascades are needed. This way, 
the current mismatch attained at the synapse current sources was around 1% for currents 
between LAIn = 5gA and LAIn = 10gA. This is shown in Fig. 4, where the measured dc 
output current-error (in %) versus input current of the tree based configuration for 18 of 
the 3600 L A synapse sources is depicted. 
A Real Time Clustering CMOS Neural Engine 759 
Fig. 4: Measured current mirror cascade missmatch (1%/div) for L A for currents below 10gA 
3 EXPERIMENTAL RESULTS 
Fig. 5 shows a microphotograph of a prototype chip fabricated in a standard digital 
double-metal, single-poly 1.5gm low cost CMOS process. The chip die area is 1 cm 2, and 
it is mounted in a 120-pins PGA package. Fig. 6 shows a typical training sequence 
accomplished by the chip and obtained experimentally using a test equipment for digital 
chips. The only task performed by the test equipment was to provide the input data 
patterns I (first column in Fig. 6), detect which of the output nodes became '1' (pattern 
with a vertical bar to its right), and extract the learned weights. Each 10x 10 square in Fig. 
6 represents either a 100-pixels input vector I, or one row of 100-pixels synaptic weights 
z.-- (z.., z., z .... ) . Each row of squares in Fig 6 represents the input pattern (first 
j j z '" _uuj ' 
square) an the 18 vectors zj after learning has been performed for this input pattern. The 
sequence shown in Fig. 6 hts been obtained for p = 0.7, L A = 10gA, L B = 9.5gA, and 
L M = 950gA. Only two iterations of input patterns presentations were necessary, in this 
case, for the system to learn and self-organize in response to these 18 input patterns. 
The last row in Fig. 6 shows the final learned templates. Fig. 7 shows final learned 
templates for different values of p. The numbers below each square indicate the input 
patterns that have been clustered into each zj category. 
Delay time measurements have been performed for the feedforward action of the chip 
(establishment of currents T., V., and V , and their competitions until the WTA settles), 
J J p 
and for the updating of weights. The feedforward delay is pattern and bias currents (L A , 
L, L M) dependent, but has been measured to be always below 1.6gs. The learning time 
is constant and is around 180ns. Therefore, throughput time is less than 1.8gs. A digital 
neuroprocessor able to perform a connections/s, b connection-updates/s, and with a 
dedicated WTA section with a c seconds delay, must satisfy 
760 T. Serrano-Gotarredona, B. Linares-Barranco, J. L. Huertas 
Fig. 5: Microphotograph of ART1 chip 
3700 100 
--+ +c = 1.Sgs (3) 
to meet the performance of our lrototype chip. If a = b and c = lOOns, the equivalent 
speed would be a = b = 2.2 x 10' connections and connection-updates per second. 
4 CONCLUSIONS 
A high speed analog current-mode categorizer chip has been built using a standard low 
cost digital CMOS process. The high performance of the chip is achieved thanks to a 
simplification of the original ART1 algorithm. The simplifications introduced are such that 
all the original computational properties are preserved. Experimental chip test results are 
provided. 
A Real Time Clustering CMOS Neural Engine 761 
Fig. 6: Test sequence obtained experimentally for p=0.7, LA= 10gA, LB=9.5gA, and 
LM=950gA 
762 T. Serrano-Gotarredona, B. Linares-Barranco, J. L. Huertas 
Fig. 7: Categorization of the input patterns for LA=3.2tA, L]=3.01.LA, LM=4001.LA, and 
different values of p 
References 
W. J. Adams and J. Ramfmez-Angulo. (1991 ) "Extended Transconductance Adjustment/Lineari sation 
Technique," Electronics Letters, vol. 27, No. 10, pp. 842-844, May 1991. 
G. A. Carpenter and S. Grossberg. (1987) "A Massively Parallel Architecture for a Self-Organizing 
Neural Pattern Recognition Machine," Computer Vision, Graphics, and Image Processing, vol. 37, 
pp. 54-115, 1987. 
R. Domfnguez-Castro, A. Roddguez-Vquez, F. Medeiro, and J. L. Huertas. (1992) "High 
Resolution CMOS Current Comparators," Proc. of the 1992 European Solid-State Circuits 
Conference (ESSCIRC'92), pp. 242-245, 1992. 
J. Lazzaro, R. Ryckebusch, M. A. Mahowald, and C. Mead. (1989) "Winner-Take-All Networks of 
O(n) Complexity," in Advances in Neural Information Processing Systems, vol. 1, D. S. Touretzky 
(Ed.), Los Altos, CA: Morgan Kaufmann, 1989, pp. 703-711. 
K. Loh, D. L. Hiser, W. J. Adams, and R. L. Geiger. (1989) "A Robust Digitally Programmable and 
Reconfigurable Monolithic Filter Structure," Proc. of the 1989 Int. Symp. on Circuits and Systems 
(ISCAS'89), Portland, Oregon, vol. 1, pp. 110-113, 1989. 
M. J. Pelgrom, A. C. J. Duinmaijer, and A. P. G. Welbers. (1989) "Matching Properties of MOS 
Transistors," IEEE Journal of Solid-State Circuits, vol. 24, No. 5, pp. 1433-1440, October 1989. 
T. Serrano-Gotarredona and B. Linares-Barranco. (1994a) "A Modified ART1 Algorithm more 
suitable for VLSI Implementations," submitted for publication (journal paper). 
T. Serrano-Gotarredona and B. Linares-Barranco. (1994b) "A Real-Time Clustering Microchip 
Neural Engine," submitted for publication (journal paper). 
