A generative model for attractor dynamics 
Richard S. Zemel 
Department of Psychology 
University of Arizona 
Tucson, AZ 85721 
zemel@u. arizona. edu 
Michael C. Mozer 
Department of Computer Science 
University of Colorado 
Boulder, CO 80309-0430 
mozer@colorado. edu 
Abstract 
Attractor networks, which map an input space to a discrete out- 
put space, are useful for pattern completion. However, designing 
a net to have a given set of attractors is notoriously tricky; training 
procedures are CPU intensive and often produce spurious attrac- 
tors and ill-conditioned attractor basins. These difficulties occur 
because each connection in the network participates in the encod- 
ing of multiple attractors. We describe an alternative formulation 
of attractor networks in which the encoding of knowledge is local, 
not distributed. Although localist attractor networks have similar 
dynamics to their distributed counterparts, they are much easier 
to work with and interpret. We propose a statistical formulation of 
localist attractor net dynamics, which yields a convergence proof 
and a mathematical interpretation of model parameters. 
Attractor networks map an input space, usually continuous, to a sparse output 
space composed of a discrete set of alternatives. Attractor networks have a long 
history in neural network research. 
Attractor networks are often used for pattern completion, which involves filling in 
missing, noisy; or incorrect features in an input pattern. The initial state of the 
attractor net is typically determined by the input pattern. Over time, the state is 
drawn to one of a predefined set of statesrathe attractors. Attractor net dynam- 
ics can be described by a state trajectory (Figure la). An attractor net is generally 
implemented by a set of visible units whose activity represents the instantaneous 
state, and optionally, a set of hidden units that assist in the computation. Attractor 
dynamics arise from interactions among the units. In most formulations of attrac- 
tor nets, e,3 the dynamics can be characterized by gradient descent in an energy 
landscape, allowing one to partition the output space into attractor basins. Instead 
of homogeneous attractor basins, it is often desirable to sculpt basins that depend 
on the recent history of the network and the arrangement of attractors in the space. 
In psychological models of human cognition, for example, priming is fundamental: 
after the model visits an attractor, it should be faster to fall into the same attractor 
in the near future, i.e., the attractor basin should be broadened. l, 6 
Another property of attractor nets is key to explaining behavioral data in psycho- 
logical and neurobiological models: the gang effect, in which the strength of an 
attractor is influenced by other attractors in its neighborhood. Figure lb illustrates 
the gang effect: the proximity of the two rightmost attractors creates a deeper at- 
tractor basin, so that if the input starts at the origin it will get pulled to the right. 
Generaave Model for ,4ttractor Dynamics 81 
Figure 1: (a) A two-dimensional space can be carved into three regions (dashed 
lines) by an attractor net. The dynamics of the net cause an input pattern (the X) 
to be mapped to one of the attractors (the O's). The solid line shows the tempo- 
ral trajectory of the network state. (b) the actual energy landscape for a localist 
attractor net as a function of ,, when the input is fixed at the origin and there are 
three attractors, w = ((-1, 0), (1, 0), (1,-.4)), with a uniform prior. The shapes of 
attractor basins are influenced by the proximity of attractors to one another (the 
gang effect). The origin of the space (depicted by a point) is equidistant from the 
attractor on the left and the attractor on the upper right, yet the origid dearly lies 
in the basin of the right attractors. 
This effect is an emergent property of the distribution of attractors, and is the basis 
for interesting dynamics; it produces the mutuall reinforcing or inhibitory influ- 
Y 
ence of similar items in domains such as semantics, 9 memory, ' 2 and olfaction. 4 
Training an attractor net is notoriously tricky. Training procedures are CPU inten- 
sive and often produce spurious attractors and ill-conditioned attractor basins. 5.   
Indeed, we are aware of no existing procedure that can robustly translate an arbi- 
trary specification of an attractor landscape into a set of weights. These difficulties 
are due to the fact that each connection partialpates in the specification of multiple 
attractors; thus, knowledge in the net is distributed over connections. 
We describe an alternative attractor network model in which knowledge is local- 
ized, hence the name localist attractor network. The model has many virtues, includ- 
ing: a trivial procedure for wiring up the architechn given an attractor landscape; 
eliminating spurious attractors; achieving gang effects; providing a dear mathe- 
matical interpretation of the model parameters, which clarifies how the parameters 
control the qualitative behavior of the model (e.g., the magnitude of gang effects); 
and proofs of convergence and stability. 
A localist attractor net consists of a set of n state units and m attractor units. Pa- 
rameters associated with an attractor unit i encode the location of the attractor, 
denoted wi, and its "pull"or strength, denoted rri, which influence the shape of 
the attractor basin. Its activity at time t, qi (t), reflects the normalized distance from 
the attractor center to the current state, y(t), weighted by the attractor strength: 
qi(t) = ti#(y(t),wi,(t)) (1) 
y'j rd#(y(t), wj, r(t)) 
g(y,w,r) = exp(-ly-wl2/2r 2) (2) 
Thus, the attractors form a layer of normalized radial-basis-function units. 
The input to the net, , serves as the initial value of the state, and thereafter the 
state is pulled toward attractors in proportion to their activity. A straightforward 
82 R. S. Zemel and M. C. Mozer 
expression of this behavior is: 
y(t + 1) = a(t) + (1 - a(t)) y qi(t)wi. 
i 
(3) 
where a(1) = 1 on the first update and a(t) = 0 fort > 1. More generally, however, 
one might want to gradually reduce a over time, allowing for a persistent effect of 
the external input on the asymptotic state. The variables tr(t) and a(t) are not free 
parameters of the model, but can be derived from the formalism we present below. 
The localist attractor net is motivated by a generative model of the input based on 
the attractor distribution, and the network dynamics corresponds to a search for 
a maximum likelihood interpretation of the observation. In the following section, 
we derive this result, and then present simulation studies of the architecture. 
1 A MAXIMUM LIKELIHOOD FORMULATION 
The starting point for the statistical formulation of a localist attractor network is 
a mixture of Gaussians model. A standard mixture of Gaussians consists of m 
Gaussian density functions in n dimensions. Each Gaussian is parameterized by 
a mean, a covariance matrix, and a mixture coefficient. The mixture model is gen- 
erative, i.e., it is considered to have produced a set of observations. Each obser- 
vation is generated by selecting a Gaussian based on the mixture coefficients and 
then stochastically selecting a point from the corresponding density function. The 
model parameters are adjusted to maximize the likelihood of a set of observations. 
The Expectation-Maximization (EM) algorithm provides an efficient procedure for 
estimating the parameters.The Expectation step calculates the posterior probabil- 
ity qi of each Gaussian for each observation, and the Maximization step calculates 
the new parameters based on the previous values and the set of qi. 
The mixture of Gaussians model can provide an interpretation for a localist attrac- 
tor network, in an unorthodox way. Each Gaussian corresponds to an attractor, and 
an observation corresponds to the state. Now, however, instead of fixing the ob- 
servation and adjusting the Gaussians, we fix the Gaussians and adjust the obser- 
vation. If there is a single observation, and a = 0 and all Gaussians have uniform 
spread tr, then Equation 1 corresponds to the Expectation step, and Equation 3 to 
the Maximization step in this unusual mixture model. 
Unfortunately, this simple characterization of the localist attractor network does 
not produce the desired behavior. Many situations produce partial solutions, in 
which the observation does not end up at an attractor. For example, if two unidi- 
mensional Gaussians overlap significantly, the most likely value for the observa- 
tion is midway between them rather than at the mean of either Gaussian. 
We therefore extend this mixture-of-Gaussians formulation to better characterize 
the localist attractor network. As in the simple model, each of the m attractors is 
a Gaussian generator, the mean of which is a location in the n-dimensional state 
space. The input to the net, , is considered to have been generated by a stochastic 
selection of one of the attractors, followed by the addition of zero-mean Gaussian 
noise with variance specified by the attractor. Given a particular observation , 
the an attractor's posterior probability is the normalized Gaussian probability of 
, weighted by its mixing proportion. This posterior distribution for the attractors 
corresponds to a distribution in state space that is a weighted sum of Gaussians. 
We then consider the attractor network as encoding this distribution over states 
implied by the attractor posterior probabilities. At any one time, however, the 
attractor network can only represent a single position in state space, rather than 
,4 Generative Model for ,4ttractor Dynamics 83 
the entire distribution over states. This restriction is appropriate when the state is 
an n-dimensional point represented by the pattern of activity over n state units. 
To accommodate this restriction, we change the standard mixture of Gaussians 
generative model by interjecting an intermediate level between the attractors and 
the observation. The first generative level consists of the discrete attractors, the 
second is the state space, and the third is the observation. Each observation is 
considered to have been generated by moving down this hierarchy: 
1. select an attractor x = i from the set of attractors 
2. select a state (i.e., a pattern of activity across the state units) based on the 
preferred location of that attractor: y = wi + A/'y 
3. select an observation z = yG + 
The observation z produced by a particular state y depends on the generative 
weight matrix G. In the networks we consider here, the observation and state 
spaces are identical, so G is the identity matrix, but the formulation allows for z 
to lie in some other space. A/'y and ;V'z describe the zero-mean, spherical Gaussian 
noise introduced at the two levels, with deviations try and try, respectively. 
In comparison with the 2-level Gaussian mixture model described above, this 3- 
level model is more complicated but more standard: the observation  is pre- 
served as stable data, and rather than the model manipulating the data here it 
can be viewed as iteratively manipulating an internal representation that fits the 
observation and attractor structure. The attractor dynamics correspond to an iter- 
ative search through state space to find the most likely single state that: (a) was 
generated by the mixture of Gaussian attractors, and (b) in turn generated the ob- 
servation. 
Under this model, one could fit an observation  by finding the posterior distribu- 
tion over the hidden states (X and Y) given the observation: 
p(X = i, Y = ylZ = ) = P(1Y, i)p(Y, i) = p(ly)trip(yli) (4) 
P() fy P([Y) '4 'ip(y[i)dy 
where the conditional distributions are Gaussian: p(Y = yl X = i) - (y[wi, try) 
and p([Y - y) = 6(1Y, trz). Evaluating the distribution in Equation 4 is 
tractable, because the palition function is a sum of a set of Gaussian integrals. 
Due to the restriction that the network cannot represent the entire distribution, we 
do not directly evaluate this distribution but instead adopt a mean-field approach, 
in which we approximate the posterior by another distribution Q(X, Y[). Based 
on this approximation, the network dynamics can be seen as minimizing an objec- 
tive function that describes an upper bound on the negative log probability of the 
observation given the model and mean-field parameters. 
In this approach, one can choose any form of Q to estimate the posterior distri- 
bution, but a better estimate allows the network to approach a maximum like- 
lihood solution? We select a simple posterior: (X, Y) = qid(Y - '), where 
qi -- (X - i) iS the responsibility assigned to attractor i, and : is the estimate of 
the state that accounts for the observation. The delta function over Y is motivated 
by the restriction that the explanation of an input consists of a single state. 
Given this posterior distribution, the objective for the network is to minimize the 
free energy F, described here for a particular input example : 
Q(X=i,Y=y) dy 
F(q,$,[) = E Q(X=i'Y=y)lnp(,X=i,Y=y ) 
- q, ln - lnp([$,)- q, lnp(S,]i) 
7ri 
i i 
84 R. S. Zemel and M. C. Mozer 
where rci is the prior probability (mixture coefficient) associated with attractor i. 
These priors are parameters of the generative model, as are rv, rz, and w. 
F(q, Srl)=Eqilnq-d-i + l-Srl'+v qilSr-wi['+nln(rvrz ) (5) 
i i . 
Given an observation, a good set of mean-field parameters can be determined by 
alternating between updating the generative parameters and the mean-field pa- 
rameters. The update procedure is guaranteed to converge to a m'mimum of F, as 
long as the updates are done asynchronously and each update minimizes F with 
respect to a parameter. 8 The update equations for the mean-field parameters are: 
2 
- (6) 
qi - j wjp(lj) (7) 
In our simulations, we hold most of the parameters of the generative model con- 
stant, such as the priors r, the weights w, and the generative noise in the observa- 
tion, az. The only aspect that changes is the generative noise in the state, r v, which 
is a single parameter shared by all attractors: 
2_ 1 
% - 2 '" qil: - wi (8) 
i 
The updates of Equations 6-8 can be in any order. We typically initialize the state 
, to  at time 0, and then cyclically update the qi, O'y, then ,. 
This generarive model avoids the problem of spurious attractors described above 
for the standard Gaussian mixture model. Intuition into how the model avoids 
spurious attractors can be gained by inspecting the update equations. These equa- 
tions effectively tie together two processes: moving , closer to some wi than the 
others, and increasing the corresponding responsibility qi. As these two processes 
evolve together, they act to descrease the noise ry, which accentuates the pull of 
the attractor. Thus stable points that do not correspond to the attractors are rare. 
2 SIMULATION STUDIES 
To create an attractor net, we specify the parameters (rri, wi) associated with the 
attractors based on the desired structure of the energy landscape (e.g., Figure lb). 
The only remaining free parameter, r, plays an important role in determining 
how responsive the system is to the external input. 
We have conducted several simulation studies to explore properties of localist at- 
tractor networks. Systematic investigations with a 200-dimensional state space and 
200 attractors, randomly placed at comers of the 200-D hypercube, have demon- 
strated that spurious responses are exceedingly rare unless more than 85% of an 
input's features are distorted (Figure 2), and that manipulating parameters such 
as noise and prior probabilities has the predicted effects. We have also conducted 
studies of localist attractor networks in the domain of visual images of faces. These 
simulations have shown that gang effects arise when there is structure among the 
attractors. For example, when the attractor set consists of a single view of several 
different faces, and multiple views of one face, then an input that is a morphed 
face--a linear combination of one of the single-view faces and one view of the gang 
face---will end up in the gang attractor even when the initial weighting assigned 
to the gang face was less than 40%. 
A Generative Model for Attractor Dynamics 85 
o 
/ I '"7 
85 90 95 100 
% Missing features 
95 100 
% Missing features 
Figure 2: The input must be severely corrupted before the net makes spurious (final 
state not at an attractor) or adulterous (final state at a neighbor of the generating 
attractor) responses. (a) The percentage of spurious responses increases as r, is 
increased. (b) The percentage of adulterous responses increases as rz is decreased. 
To test the architecture on a larger, structured problem, we modeled the domain 
of three-letter English words. The idea is to use the attractor network as a content 
addressable memory which might, for example, be queried to retrieve a word with 
P in the third position and any letter but A in the second position, a word such as 
HIP. The attractors consist of the 423 three-letter English words, from ACE to ZOO. 
The state space of the attractor network has one dimension for each of the 26 letters 
of the English alphabet in each of the 3 positions, for a total of 78 dimensions. We 
can refer to a given dimension by the letter and position it encodes, e.g., Pa denotes 
the dimension corresponding to the letter P in the third position of the word. The 
attractors are at the comers of a [-1, +1] TM hypercube. The attractor for a word such 
as HIP is located at the state having value -1 on all dimensions except for Hi, 12, 
and Pa which have value +1. The external input specifies a state that constrains 
the solution. For example, one might specify "P in the third position"by setting 
the external input to +1 on dimension Pa and to -1 on dimensions aa, for all letters 
a other than E One might specify the absence of a constraint in a particular letter 
position, p, by setting the external input to 0 on dimensions ap, for all letters a. 
The network's task is to settle on a state corresponding to one of the words, 
given soft constraints on the letters. The interactive-activation model of word 
perception 7 performs a similar computation, and our implementation exhibits the 
key qualitative properties of their model. If the external input specifies a word, 
of course the attractor net will select that word. Interesting queries are those in 
which the external input underconstrains or overconstrains the solution. We il- 
lustrate with one example of the network's behavior, in which the external input 
specifies D, E2, and Ga. Because DEG is a nonword, no attractor exists for that 
state. The closest attractors share two letters with DEG, e.g., PEG, BEG, DEN, and 
DOG. Figure 3 shows the effect of gangs on the selection of a response, BEG. 
3 CONCLUSION 
Localist attractor networks offer an attractive altemative to standard attractor net- 
works, in that their dynamics are easy to control and adapt. We described a sta- 
tistical formulation of a type of localist attractor, and showed that it provides a 
Lyapunov function for the system as well as a mathematical interpretation for the 
network parameters. The dynamics of this system are derived not from intuitive 
arguments but from this formal mathematical model. Simulation studies show 
that the architecture achieves gang effects, and spurious attractors are rare. This 
approach is ineffident if the attractors have compositional structure, but for many 
applications of pattern recognition or associative memory, the number of items 
86 R. S. Zemel and M. C. Mozer 
Iteration 1 
.:-'{ &:!: -i::;-:-: ,;j!i '..:a.,-: -:E? , x:.-f, L},-Z} .i;{; I1= .!E" i'...Z-i'. ::{.; i : '- ':-.. '.,  ": -"-:' ;' .;. " 
Idon 2 
Ion 3 
8 
Idon 4 
Idon 5 
8 
Figure 3: Simulation of the 3-letter word attractor network, queried with DEG. 
Each frame shows the relative activity of attractor units at various points in pro- 
cessing. Activity in each frame is normalized such that the most active unit is 
printed in black ink; the lighter the ink color, the less active the unit. Only attractor 
units sharing at least one letter with DEG are shown. The selection, BEG, is a prod- 
uct of a gang effect. The gangs in this example are formed by words sharing two 
letters. The most common word beginnings are PE- (7 instances) and DI- (6); the 
most common word endings are -AG (10) and -ET (10); the most common first-last 
pairings are B-G (5) and D-G (3). One of these gangs supports B1, two support E2, 
and three support Ga, hence BEG is selected. 
being stored is small. The approach is especially useful in cases where attractor 
locations are known, and the key focus of the network is the mutual influence of 
the attractors, as in many cognitive modelling studies. 
References 
[1] Becker, S., Moscovitch, IvL, Behrmann, M., & Joordens, S. (1997).Long-term semantic primin A computational 
account and empirical evidence. Journal of Experimental Psychology: Learning, Memory, & Cognition, 23(5), 1059- 
1082. 
[2] Golden, IL (1988). Probabilistic characterization of neural model computations. In D. Z. Anderson (Ed.), Neural 
Infvnnation Processing Systems (pp. 310-316).American Institute of Physics. 
[3] Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. 
Proc__!_ings of the National Academy of Sciences, 79, 2554-2558. 
[4] Kay, L.IvL, Lancaslea; L.R., & Freeman W.J. (1996). R_eafference and attractors in the olfactory system during 
odor recognition. Int J Neural Systems, 7(4), 489-95. 
[5] Mathis, D. (1997). A computational theory of consciousness in cognition. Unpublished Doctoral Dissertation. Boul- 
der, CO: Department of Computer Science, Univemity of Colorado. 
[6] Mathis, D., & Mozer, M. C. (1996). Conscious and unconscious perception: A computational theory. In G. Cot- 
trell (Ed.), Procer'dings of the Eighteenth Annual Conference of the Cognitive Sc/ence Soc/ety (pp. 324-328). Erlbaum. 
[7] McClelland, J. L. & Rumelhart, D. E. (1981). An interactive activation model of context effects in letter percep- 
tion: Part I. An account of basic findings. PsychologicalReview, 88, 375-407. 
[8] Neal, IL IvL & Hinton, G. E. (1998). A view of the EM algorithm that justifies incremental, sparse, and other 
variants. In IvL I. Jordan (Ed.), Lmrning in GraphicaIModels. Kluwer Academic Press. 
[9] McRae, K., de Sa, V. R., & Seldenberg, IvL S. (1997) On the nature and scope of featural repreooentations of word 
meaning. Journal of Experimental Psychology: General, 126(2), 99-130. 
[10] Redish, A.D. & Touretzky, D. S. (1998). The role of the hippocampus in solving the Morris water maze. Neural 
Computation, 10(1), 73-111. 
[11] Rodrigues, N. C., & Fontanari, J. E (1997). Multivalley structure of attractor neural networks. Journal of Physics 
A (Mathematicaland General), 30, 7945-7951. 
[12] Samsonovich, A. & McNaughton, B. L. (1997) Pathintegration and cognitive mapping in a continuous attractor 
neural network model. Journal of Neuroscience, 17(15),5900-5920. 
[13] Saul, L.I<., Jaakkola, T., & Jordan, IvLI. (1996). Mean field theory for sigmoid belief networks. Journal of AI 
Research, 4, 61-76. 
PART II 
NEUROSCIENCE 
