Probabilistic Interpretation of Population 
Codes 
Richard S. Zemel 
zemelu. arizona. edu 
Peter Dayan 
dayanai. mir. edu 
Abstract 
Alexandre Pouget 
alexsalk. edu 
We present a theoretical framework for population codes which 
generalizes naturally to the important case where the population 
provides information about a whole probability distribution over 
an underlying quantity rather than just a single value. We use 
the framework to analyze two existing models, and to suggest and 
evaluate a third model for encoding such probability distributions. 
1 Introduction 
Population codes, where information is represented in the activities of whole pop- 
ulations of units, are ubiquitous in the brain. There has been substantial work on 
how animals should and/or actually do extract information about the underlying 
encoded quantity. 5' 3,1, 9, 2 With the exception of Anderson,  this work has con- 
centrated on the case of extracting a single value for this quantity. We study ways 
of characterizing the joint activity of a population as coding a whole probability 
distribution over the underlying quantity. 
Two examples motivate this paper: place cells in the hippocampus of freely moving 
rats that fire when the animal is at a particular part of an environment, s and cells in 
area MT of monkeys firing to a random moving dot stimulus. 7 Treating the activity 
of such populations of cells as reporting a single value of their underlying variables 
is inadequate if there is (a) insufficient information to be sure (eg if a rat can be 
uncertain as to whether it is in place xA or xz then perhaps place cells for both 
locations should fire; or (b) if multiple values underlie the input, as in the whole 
distribution of moving random dots in the motion display. Our aim is to capture the 
computational power of representing a probability distribution over the underlying 
parameters. 6 
RSZ is at University of Arizona, Tucson, AZ 85721; PD is at MIT, Cambridge, MA 
02139; AP is at Georgetown University, Washington, DC 20007. This work was funded by 
McDonnell-Pew, NIH, AFOSR and startup funds from all three institutions. 
Probabilistic Interpretation of Population Codes 677 
In this paper, we provide a general statistical framework for population codes, use 
it to understand existing methods for coding probability distributions and also to 
generate a novel method. We evaluate the methods on some example tasks. 
2 Population Code Interpretations 
The starting point for almost all work on neural population codes is the neurophys- 
iological finding that many neurons respond to particular variable(s) underlying a 
stimulus according to a unimodal tuning function such as a Gaussian. This char- 
acterizes cells near the sensory periphery and also cells that report the results of 
more complex processing, including receiving information from groups of cells that 
themselves have these tuning properties (in MT, for instance). Following Zemel 
& Hinton's 3 analysis, we distinguish two spaces: the explicit space which consists 
of the activities r = {ri} of the cells in the population, and a (typically low di- 
mensional) implicit space which contains the underlying information A' that the 
population encodes in which they are tuned. All processing on the basis of the 
activities r has to be referred to the implicit space, but it itself plays no explicit 
role in determining activities. 
Figure i illustrates our framework. At the top is the measured activities of a popu- 
lation of cells. There are two key operations. Encoding: What is the relationship 
between the activities r of the cells and the underlying quantity in the world A' 
that is represented? Decoding: What information about the quantity A' can be 
extracted from the activities? Since neurons are generally noisy, it is often con- 
venient to characterize encoding (operations A and B) in a probabilistic way, by 
specifying 7 >[rlA' ]. The simplest models make a further assumption of conditional in- 
dependence of the different units given the underlying quantity >[rlPC] - Hi >[ri[,] 
although others characterize the degree of correlation between the units. If the en- 
coding model is true, then a Bayesian decoding model specifies that the information 
r carries about A' can be characterized precisely as: P[A'lr] or P[rlA']P[A'], where 
P[rt'] is the prior distribution about A' and the constant of proportionality is set 
so that fpc P[A'lr]dA' = 1. Note that starting with a deterministic quantity A' in 
the world, encoding in the firing rates r, and decoding it (operation C) results in 
a probability distribution over A'. This uncertainty arises from the stochasticity 
represented by P[rlA' ]. Given a loss function, we could then go on to extract a 
single value from this distribution (operation D). 
We attack the common assumption that A' is a single value of some variable x, eg 
the single position of a rat in an environment, or the single coherent direction of 
motion of a set of dots in a direction discrimination task. This does not capture 
the subtleties of certain experiments, such as those in which rats can be made to be 
uncertain about their position, or in which one direction of motion predominates yet 
there are several simultaneous motion directions. 7 Here, the natural characterization 
of A' is actually a whole probability distribution P[xlco] over the value of the variable 
x (perhaps plus extra information about the number of dots), where co represents 
all the available information. We can now cast two existing classes of proposals for 
population codes in terms of this framework. 
The Poisson Model 
Under the Poisson encoding model, the quantity A' encoded is indeed one particular 
value which we will call x, and the activities of the individual units are independent, 
678 R. S. Zemel, P Dayan and A. Pouget 
encode 
B 
Figure 1: Left: encoding maps 3:' from the world through tuning functions (A) into mean activ- 
ities (B), leading to Top: observed activities r. We assume complete knowledge of the variables 
governing systematic changes to the activities of the cells. Here 3:' is a single value x* in the space 
of underlying variables. Right: decoding extracts 7[3:']r] (C); a single value can be picked (D) 
from this distribution given a loss function. 
with the terms 7v[rilx] = e-/'(x)(fi(x)) TM/ri!. The activity ri could, for example, be 
the number of spikes the cell emits in a fixed time interval following the stimulus 
onset. A typical form for the tuning function/i(x) is Gaussian fi(x) oc e -(x-x)2/2a2 
about a preferred value xi for cell i. The Poisson decoding model is: 3'11'9'12 
log 7V[xlr] =/C-  fi(x)+  ri log fi(x) 
i 
(1) 
where/C is a constant with respect to x. 
Although simple, the Poisson model makes the the assumption criticized above, 
that A' is just a single value x. We argued for a characterization of the quantity A' 
in the world that the activities of the cells encode as now T[xlw]. We describe below 
a method of encoding that takes exactly this definition of A'. However, wouldn't 
7>[xl r] from Equation i be good enough? Not if fi(x) are Gaussian, since 
log [xlr]-']G l l"i-ri)("irixi) 2 
- x ' 
completing the square, implying that 7>[x[r] is Gaussian, and therefore inevitably 
unimodal. Worse, the width of this distribution goes down with i ri, making it, 
in most practical cases, a close approximation to a delta function. 
The KDE Model 
Anderson x, 2 set out to represent whole probability distributions over x rather than 
just single values. Activities r represent distribution 75r(x) through a linear com- 
bination of basis functions pi(x), ie 75r(x) = Y-i r'opi(x) where r are normalized 
such that 75r(x) is a probability distribution. The kernel functions Pi(x) are not 
Probabilistic Interpretation of Population Codes 679 
the tuning functions fi(x) of the cells that would commonly be measured in an 
experiment. They need have no neural instantiation; instead, they form part of the 
interpretive structure for the population code. If the Pi (x) are probability distribu- 
tions, and so are positive, then the range of spatial frequencies in 7)[xl w] that they 
can reproduce in 75r(x) is likely to be severely limited. 
In terms of our framework, the KDE model specifies the method of decoding, and 
makes encoding its corollary. Evaluating KDE requires some choice of encoding - 
representing 7)[xlco] by 7 r(x) through appropriate r. One way to encode is to use 
the Kullback-Leibler divergence as a measure of the discrepancy between 7 ) [xlco ] and 
Y-i ribi(x) and use the expectation-maximization (EM) algorithm to fit the {ri}, 
treating them as mixing proportions in a mixture model. 4 This relies on {Pi(x)} be- 
ing probability distributions themselves. The projection method  is a one-shot linear 
filtering based alternative using the/22 distance. ri are computed as a projection 
of 7>[xl] onto tuning functions fi(x) that are calculated from p/(x). 
ri = x 7[xlc]fi(x)dx fi(x)= ' AXPj(x) A u = fx bi(x)bj(x)dx 
J 
(2) 
fi(x) are likely to need regularizing,  particularly if the Pi (x) overlap substantially. 
3 The Extended Poisson Model 
The KDE model is likely to have difficulty capturing in 7  (x) probability distribu- 
tions 7)[x[co] that include high frequencies, such as delta functions. Conversely, the 
standard Poisson model decodes almost any pattern of activities r into something 
that rapidly approaches a delta function as the activities increase. Is there any 
middle ground? 
We extend the standard Poisson encoding model to allow the recorded activities r 
to depend on general 7)[x[co], having Poisson statistics with mean: 
(s) 
This equation is identical to that for the KDE model (Equation 2), except that 
variability is built into the Poisson statistics, and decoding is now required to be 
the Bayesian inverse of encoding. Note that since ri depends stochastically on 
P[xlco], the full Bayesian inverse will specify a distribution P[P[xlco]lr ] over possible 
distributions. We summarize this by an approximation to its most likely member-- 
we perform an approximate form of maximum likelihood, not in the value of x, but 
in distributions over x. We approximate P[xlco] as a piece-wise constant histogram 
which takes the value qbj in (xj, xj+], and fi(x) by a piece-wise constant histogram 
that take the values fij in (x/, xj+ ]. Generally, the maximum a posteriori estimate 
for {qbj} can be shown to be derived by maximizing: 
[ 
L({.}) = ri log j. jfij j. (-(5j+ (4) 
i ' ' 
where e is the variance of a smoothness prior. We use a form of EM to maximize 
the likelihood and adopt the crude 'approximation of averaging neighboring values 
680 R. S. Zemel, P. Dayan and A. Pouget 
Operation Extended Poisson KDE (Projection) KDE (EM) 
Encode 
A 0 = f. ,(x)i(x)dx 
Wcod f() to .  f'() = ,',,() f() = ,,() 
'(x)  = f '(x)L(x)dx  i ifo r = rd  ri 
Likelihood L = log P [{i}l{r,}] = , r, log  L = fx P[xlw] log '(x)dx 
[xll 
P(x) 
Table 1: A summary of the key operations with respect to the framework of the 
interpretation methods compared here. h D is a rounding operator to ensure integer 
firing rates, and bi(x) - Af(xi, a) are the kernel functions for the KDE method. 
of j on successive iterations. By comparison with the linear decoding of the KDE 
method, Equation 4 offers a non-linear way of combining a set of activities {ri} to 
give a probability distribution 7r(x) over the underlying variable x. The computa- 
tional complexities of Equation 4 are irrelevant, since decoding is only an implicit 
operation that the system need never actually perform. 
4 Comparing the Models 
We illustrate the various models by showing the faithfulness with which they can 
represent two bimodal distributions. We used a = 0.3 for the kernel functions 
(KDE) and the tuning functions (extended Poisson model) and used 50 units whose 
xi were spaced evenly in the range x -- [-10, 10]. Table i summarizes the three 
methods. 
Figure 2a shows the decoded version of a mixture of two broad Gaussians 
1/2Af[-2,1] + 1/2Af[2,1]. Figure 2b shows the same for a mixture of two nar- 
row Gaussians Af[-2, .2] + Af[2, .2]. All the models work well for representing 
the broad Gaussians; both forms of the KDE model have difficulty with the nar- 
row Gaussians. The EM version of KDE puts all its weight on the nearest kernel 
functions, and so is too broad; the projection version 'rings' in its attempt to rep- 
resent the narrow components of the distributions. The extended Poisson model 
reconstructs with greater fidelity. 
5 Discussion 
Informally, we have examined the consequences of the seemingly obvious step of 
saying that if a rat, for instance, is uncertain about whether it is at one of two places, 
then place cells representing both places could be activated. The complications 
Probabilistic Interpretation of Population Codes 681 
O.2 
-I0 -S 0 S IQ 
X 
0.6 
.o.2 
.io 
-I0 -5 0 
-tO -5 0 
-io 
5 I0 -I0 . 
o 5 
x 
Figure 2: a) (upper) All three methods provide a good fit to the bimodal Gaussian 
distribution when its variance is sufficiently large (r - 1.0). b) (lower) The KDE 
model has difficulty when r = 0.2. 
come because the structure of the interpretation changes - for instance, one can 
no longer think of maximum likelihood methods to extract a single value from the 
code directly. 
One main fruit of our resulting framework is a method for encoding and decoding 
probability distributions that is the natural extension of the (provably inadequate) 
standard Poisson model for encoding and decoding single values. Cells have Pois- 
son statistics about a mean determined by the integral of the whole probability 
distribution, weighted by the tuning function of the cell. We suggested a particular 
decoding model, based on an approximation to maximum likelihood decoding to a 
discretized version of the whole probability distribution, and showed that it recon- 
structs broad, narrow and multimodal distributions more accurately than either the 
standard Poisson model or the kernel density model. Stochasticity is built into our 
method, since the units are supposed to have Poisson statistics, and it is therefore 
also quite robust to noise. The decoding method is not biologically plausible, but 
provides a quantitative lower bound to the faithfulness with which a set of activities 
can code a distribution. 
Stages of processing subsequent to a population code might either extract a single 
value from it to control behavior, or integrate it with information represented in 
other population codes to form a combined population code. Both operations must 
be performed through standard neural operations such as taking non-linear weighted 
sums and possibly products of the activities. We are interested in how much in- 
formation is preserved by such operations, as measured against the non-biological 
682 R. S. Zemel, P Dayan and A. Pouget 
standard of our decoding method. Modeling extraction requires modeling the loss 
function - there is some empirical evidence about this from a motion experiment 
in which electrical stimulation of MT cells was pitted against input from a moving 
stimulus. 1 However, much works remains to be done. 
Integrating two or more population codes to generate the output in the form of 
another population code was stressed by Hinton, 6 who noted that it directly relates 
to the notion of generalized Hough transforms. We axe presently studying how a 
system can learn to perform this combination, using the EM-based decoder to gen- 
erate targets. One special concern for combination is how to understand noise. For 
instance, the visual system can be behaviorally extraordinarily sensitive - detecting 
just a handful of photons. However, the outputs of real cells at various stages in 
the system are apparently quite noisy, with Poisson statistics. If noise is added at 
every stage of processing and combination, then the final population code will not 
be very faithful to the input. There is much current research on the issue of the 
creation and elimination of noise in cortical synapses and neurons. 
A last issue that we have not treated here is certainty or magnitude. Hinton's 6 idea 
of using the sum total activity of a population to code the certainty in the existence 
of the quantity they represent is attractive, provided that there is some independent 
way of knowing what the scale is for this total. We have used this scaling idea in 
both the KDE and the extended Poisson models. In fact, we can go one stage 
further, and interpret greater activity still as representing information about the 
existence of multiple objects or multiple motions. However, this treatment seems 
less appropriate for the place cell system -- the rat is presumably always certain 
that it is somewhere. There it is plausible that the absolute level of activity could 
be coding something different, such as the familiarity of a location. 
An entire collection of cells is a terrible thing to waste on representing just a single 
value of some quantity. Representing a whole probability distribution, at least with 
some fidelity, is not more difficult, provided that the interpretation of the encoding 
and decoding are clear. We suggest some steps in this direction. 
References 
[1] Anderson, CH (1994). International Journal of Modern Physics C, 5, 135-137. 
[2] Anderson, CH & Van Essen, DC (1994). In Computational Intelligence Imitating Life, 213-222. 
New York: IEEE Press. 
[3] Baldi, P & Heiligenberg, W (1988). Biological Cybernetics, 59, 313-318. 
[4] Dempster, AP, Laird, NM & Rubin, DB (1997). Proceedings of the Royal Statistical Society, B 
39, 1-38. 
[5] Georgopoulos, AP, Schwartz, AB & Kettner, RE (1986). Science, 243, 1416-1419. 
[6] Hinton, GE (1992). Scientific American, 267(3) 105-109. 
[7] Newsome, WT, Britten, KH & Movshon, JA (1989). Nature, 341, 52-54. 
[8] O'Keefe, J & Dostrovsky, J (1971). Brain Research, 34, 171-175. 
[9] Salinas, E & Abbott, LF (1994). Journal of Computational Neuroscience, 1, 89-107. 
[10] Salzman, CD & Newsome, WT (1994). Science, 264, 231-237. 
[11] Seung, HS & Sompolinsky, H (1993). Proceedings of the National Academy of Sciences, USA, 
90, 10749-10753. 
[12] Snippe, HP (1996). Neural Computation, 8, 29-37. 
[13] Zemel, RS & Hinton, GE (1995). Neural Computation, 7, 549-564. 
PART V 
IMPLEMENTATION 
