Divisive Normalization, 
Networks and Ideal 
Line Attractor 
Observers 
Sophie Deneve  Alexandre Pouget , and P.E. Latham 2 
1Georgetown Institute for Computational and Cognitive Sciences, 
Georgetown University, Washington, DC 20007-2197 
2Dpt of Neurobiology, UCLA, Los Angeles, CA 90095-1763, U.S.A. 
Abstract 
Gain control by divisive inhibition, a.k.a. divisive normalization, 
has been proposed to be a general mechanism throughout the vi- 
sual cortex. We explore in this study the statistical properties 
of this normalization in the presence of noise. Using simulations, 
we show that divisive normalization is a close approximation to a 
maximum likelihood estimator, which, in the context of population 
coding, is the same as an ideal observer. We also demonstrate ana- 
lytically that this is a general property of a large class of nonlinear 
recurrent networks with line attractors. Our work suggests that 
divisive normalization plays a critical role in noise filtering, and 
that every cortical layer may be an ideal observer of the activity in 
the preceding layer. 
Information processing in the cortex is often formalized as a sequence of a linear 
stages followed by a nonlinearity. In the visual cortex, the nonlinearity is best de- 
scribed by squaring combined with a divisive pooling of local activities. The divisive 
part of the nonlinearity has been extensively studied by Heeger and colleagues [1], 
and several authors have explored the role of this normalization in the computation 
of high order visual features such as orientation of edges or first and second order 
motion[4]. We show in this paper that divisive normalization can also play a role in 
noise filtering. More specifically, we demonstrate through simulations that networks 
implementing this normalization come close to performing maximum likelihood es- 
timation. We then demonstrate analytically that the ability to perform maximum 
likelihood estimation, and thus efficiently extract information from a population of 
noisy neurons, is a property exhibited by a large class of networks. 
Maximum likelihood estimation is a framework commonly used in the theory of 
ideal observers. A recent example comes from the work of Itti et al., 1998, who have 
shown that it is possible to account for the behavior of human subjects in simple 
discrimination tasks. Their model comprised two distinct stages: 1) a network 
Divisive Normalization, Line Attractor Networks and Ideal Observers 105 
which models the noisy response of neurons with tuning curves to orientation and 
spatial frequency combined with divisive normalization, and 2) an ideal observer (a 
maximum likelihood estimator) to read out the population activity of the network. 
Our work suggests that there is no need to distinguish between these two stages, 
since, as we will show, divisive normalization comes close to providing a maximum 
likelihood estimation. More generally, we propose that there may not be any part 
of the cortex that acts as an ideal observer for patterns of activity in sensory areas 
but, instead, that each cortical layer acts as an ideal observer of the activity in the 
preceding layer. 
I The network 
Our network is a simplified model of a cortical hypercolumn for spatial frequency 
and orientation. It consists of a two dimensional array of units in which each unit 
is indexed by its preferred orientation, Oi, and spatial frequency, j. 
1.1 LGN model 
Units in the cortical layer are assumed to receive direct inputs from the lateral 
geniculate nucleus (LGN). Here we do not model explicitly the LGN, but focus 
instead on the pooled LGN input onto each cortical unit. The input to each unit 
is denoted aij. We distinguish between the mean pooled LGN input, fij(O,,k), as 
a function of orientation, 0, and spatial frequency, X, and the noise distribution 
around this mean, P(aijlO, ). 
In response to a stimulus of orientation, 0, spatial frequency, X, and contrast, C, 
the mean LGN input onto unit ij is a circular Gaussian with a small amount of 
spontaneous activity, t,: 
fij(O,,k)-- KCexp (cos(A- Aj)- 1 cos(O-Oi)- 1) 
+ + 
(1) 
where K is a constant. Note that spatial frequency is treated as a periodic variable; 
this was done for convenience only and should have negligible effects on our results 
as long as we keep  far from 27rn, n an integer. 
On any given trial the LGN input to cortical unit i j, aij, is sampled from a Gaussian 
noise distribution with variance aj' 
p(aijlO,,k) = 1 ( [aij -- fij(O"k)]2 ) 
In our simulations, the variance of the noise was either kept fixed (a/2j = a 2) or set 
to the mean activity (a - fij(O, X)). The latter is more consistent with the noise 
that has been measureff experimentally in the cortex. We show in figure 1-A an 
example of a noisy LGN pattern of activity. 
1.2 Cortical Model: Divisive Normalization 
Activities in the cortical layer are updated over time according to: 
106 S. Deneve, A. Pouget and P E. Latham 
A. CORTEX 
LGN t 0 
e o 
o. 0.2 o.3 0.4 o o. o.? o. o.s 
Contrast 
Figure 1: A- LGN input (bottom) and stable hill in the cortical network after 
relaxation (top). The position of the stable hill can be used to estimate orientation 
() and spatial frequency (). B- Inverse of the variance of the network estimate for 
orientation using Gaussian noise with variance equal to the mean as a function of 
contrast and number of iterations (0, dashed; 1, diamond; 2, circle; and 3, square). 
The continuous curve corresponds to the theoretical upper bound on the inverse 
of the variance (i.e. an ideal observer). C- Gain curve for contrast for the cortical 
units after 1, 2 and 3 iterations. 
+ 1) 2 (3) 
uij(t + 1): y Wij,klOkl(t), Oij(t + 1) = $ + I Ekt Ul(t + 1) 2' 
kl 
where {Wij,kl} are the filtering weights, Oij(t) is the activity of unit ij at time t, 
S is a constant, and/ is what we call the divisive inhibition weight. The filtering 
weights implement a two dimensional Gaussian filter: 
Wij,k! = Wi-k,j-I -- Kwexp (cos[2z-(i- k)/P]- 1 cos[2z-(j- l)/P]- 1) 
2 + or2 
O'wO w X 
(4) 
where Kw is a constant, awO and ax control the width of the filtering weights, and 
there are p2 units. 
On each iteration the activity is filtered by the weights, squared, and then nor- 
malized by the total local activity. Divisive normalization per se only involves the 
squaring and division by local activity. We have added the filtering weights to ob- 
tain a local pooling of activity between cells with similar preferred orientations and 
spatial frequencies. This pooling can easily be implemented with cortical lateral 
connections and it is reasonable to think that such a pooling takes place in the 
cortex. 
Divisive Normalization, Line Attractor Networks and Ideal Observers 107 
2 Simulation Results 
Our simulations consist of iterating equation 3 with initial conditions determined by 
the presentation orientation and spatial frequency. The initial conditions are chosen 
as follows: For a given presentation angle, 00, and spatial frequency, 0, determine 
the mean cortical activity, fij(0o,o), via equation 1. Then generate the actual 
cortical activity, {aij }, by sampling from the distribution given in equation 2. This 
serves as our set of initial conditions: oij (t = O) = aij. 
Iterating equation 3 with the above initial conditions, we found that for very low 
contrast the activity of all cortical units decayed to zero. Above some contrast 
threshold, however, the activities converged to a smooth stable hill (see figure 1-A 
for an example with parameters awO = ax = a0 = ax = 1/x/, K = 74, C = 1, 
/: 0.01). The width of the hill is controlled by the width of the filtering weights. 
Its peak, on the other hand, depends on the orientation and spatial frequency of the 
LGN input, 00 and 0. The peak can thus be used to estimate these quantities (see 
figure l-A). To compute the position of the final hill, we used a population vector 
estimator [3] although any unbiased estimator would work as well. In all cases we 
looked at, the network produced an unbiased estimate of 00 and 0. 
In our simulations we adjusted ao and awX so that the stable hill had the same 
profile as the mean LGN input (equation 1). As a result, the tuning curves of the 
cortical units match the tuning curves specified by the pooled LGN input. For this 
case, we found that the estimate obtained from the network has a variance close 
to the theoretical minimum, known as the Cramr-Rao bound [3]. For Gaussian 
noise of fixed variance, the variance of the estimate was 16.6% above this bound, 
compared to 3833% for the population vector applied directly to the LGN input. 
In a 1D network (orientation alone), these numbers go to 12.9% for the network 
versus 613% for population vector. For Gaussian noise with variance proportional 
to the mean, the network was 8.8% above the bound, compared to 722% for the 
population vector applied directly to the input. These numbers are respectively 9% 
and 108% for the 1-D network. The network is therefore a close approximation to 
a maximum likelihood estimator, i.e., it is close to being an ideal observer of the 
LGN activity with respect to orientation and spatial frequency. 
As long as the contrast, C, was superthreshold, large variations in contrast did not 
affect our results (figure l-B). However, the tuning of the network units to contrast 
after reaching the stable state was found to follow a step function whereas, for real 
neurons, the curves are better described by a sigmoid [2]. Improved agreement 
with experiment was achieved by taking only 2-3 iterations, at which point the 
performance of the network is close to optimal (figure l-B) and the tuning curves to 
contrast are more realistic and closer to sigmoids (figure 1-C). Therefore, reaching 
a stable state is not required for optimal performance, and in fact leads to contrast 
tuning curves that are inconsistent with experiment. 
3 Mathematical Analysis 
We first prove that line attractor networks with sufficiently small noise are close 
approximations to a maximum likelihood estimator. We then show how this result 
applies to our simulations with divisive normalization. 
108 S. Deneve, A. Pouget and P. E. Latham 
3.1 General Case: Line Attractor Networks 
Let o, be the activity vector (denoted by bold type) at discrete time, n, for a set 
of P interconnected units. We consider a one dimensional network, i.e., only one 
feature is encoded; generalization to multidimensional networks is straightforward. 
A generic mapping for this network may be written 
On+l -'- a(on) (5) 
where H is a nonlinear function. We assume that this mapping admits a line 
attractor, which we denote G(O), for which G(O) = H(G(O)) where O is a continuous 
variable. 1 Let the initial state of the network be a function of the presentation 
parameter, O0, plus noise, 
o0 = F(00) + N 
(6) 
where F(00) is the function used to generate the data (in our simulations this 
would correspond to the mean LGN input, equation 1). Iterating the mapping, 
equation 5, leads eventually to a point on the line attractor. Consequently, as 
n -> c, o, -> G(0). The parameter 0 provides an estimate of 00. 
To determine how well the network does we need to find (50 _--  - 0o as a function 
of the noise, N, then average over the noise to compute the mean and variance of 
(50. Because the mapping, equation 5, is nonlinear, this cannot be done exactly. For 
small noise, however, we can take a perturbative approach and expand around a 
point on the attractor. For line attractors there is no general method for choosing 
which point on the attractor to expand around. Our approach will be to expand 
around an arbitrary point, G(0), and choose 0 by requiring that the quadratic terms 
be finite. Keeping terms up to quadratic order, equation 6 may be written 
o, - G(0) +(5o,. (7) 
I n-1 
(5o, = J'-(500 +  E (J'' (5Oo)  H". (J'. (5Oo), (8) 
where J(0) -- [0((0)H(G(0))] r is the Jacobian (the subscript T means transpose), 
H" is the Hessian of H evaluated at G(0) and a "." represents the standard dot 
product. 
Because the mapping, equation 5, admits a line attractor, J has one eigenvalue 
equal to 1 and all others less than 1. Denote the eigenvector with eigenvalue 1 as 
v and its adjoint vt: J  v -- v and jT. v t = v t ' It is not hard to show that v = 
c0G(9), up to a multiplicative constant. Since J has an eigenvalue equal to 1, to 
avoid the quadratic term in Eq. 8 approaching infinity as n -  we require that 
lim J  (500 = 0. (9) 
The line attractor is, in fact, an idealization; for P units the attractors associated with 
equation 5 consists of P isolated points. However, for P large, the attractors are spaced 
closely enough that they may be considered a line. 
Divisive Normalization, Line Attractor Networks and Ideal Observers 109 
This equations has an important consequence: it implies that, to linear order, 
lim,_ 5o = 0 (see equation 8), which in turn implies that o = G(0) which, 
finally, implies that 0 - . Consequently we can find the network estimator of 00, 
, by computing 0. We now turn to that task. 
It is straightforward to show that J = vv t. Combining this expression for J with 
equation 9, using equation 7 to express 500 in terms of o0 and G(0), and, finally 
using equation 6 to express o0 in terms of the initial mean activity, F(00), and the 
noise, IN, we find that 
v* (0). [F(0o) - G(0) + IN] = 0. 
Using 0o: 0 - 60 and expanding F(0o) to first order in 60 then yields 
(10) 
v i (0). [N + F(0) - G(0)] 
60 = (11) 
vt(0)  F,(0) 
As long as v* is orthogonal to F(0) - G(0), {60) = 0 and the estimator is unbiased. 
This must be checked on a case by case basis, but for the circularly symmetric 
networks we considered orthogonality is satisfied. 
We can now calculate the variance of the network estimate, {60} 2. Assuming v*  
[F(0) - G(0)]: 0, equation 11 implies that 
vt. R. vt 
(50)2 = [vt. F'] 2 ' (12) 
where a prime denotes a derivative with respect to 0 and R is the covariance matrix 
of the noise, R = (ININ). The network is equivalent to maximum likelihood when 
this variance is equal to the Cramr-Rao bound [3], (50) R. If the noise, IN, is 
Gaussian with a covariance matrix independent of 0, this bound is equal to: 
1 
<50) R -- F t. R_i. Ft. (13) 
For independent Gaussian noise of fixed variance, a 2, and zero covariance, the 
variance of the network estimate, equation 12, becomes a2/(IF'12 cos 2/) where/ 
is the angle between v t and F'. The Cramr-Rao bound, on the other hand, is 
equal to a2/IF'12. These expressions differ only by cos 2/, which is I if F c vt. In 
addition, it is close to I for networks that have identical input and output tunin 
curves, F(0) = G(0), and the Jacobian, J, is nearly symmetric, so that v m v' 
(recall that v - G). If these last two conditions are satisfied, the network comes 
close to being a maximum likelihood estimator. 
3.2 Application to Divisive Normalization 
Divisive normalization is a particular example of the general case considered above. 
For simplicity, in our simulations we chose the input and output tuning curves to 
be equal (F = G in the above notation), which lead to a value of 0.87 for cos2/ 
(evaluated numerically). This predicted a variance 15% above the Cramr-Rao 
110 S. Deneve, A. Pouget and P E. Latham 
bound for independent Gaussian noise with fixed variance, consistent with the 16570 
we obtained in our simulations. The network also handles fairly well other noise 
distributions, such as Gaussian noise with variance proportional to the mean, as 
illustrated by our simulations. 
4 Conclusions 
We have recently shown that a subclass of line attractor networks can be used as 
maximum likelihood estimators[3]. This paper extend this conclusion to a much 
wider class of networks, namely, any network that admits a line (or, by straightfor- 
ward extension of the above analysis, a higher dimensional) attractor. This is true 
in particular for networks using divisive normalization, a normalization which is 
thought to match quite closely the nonlinearity found in the primary visual cortex 
and MT. 
Although our analysis relies on the existence of an attractor, this is not a require- 
ment for obtaining near optimal noise filtering. As we have seen, 2-3 iterations 
are enough to achieve asymptotic performance (except at contrasts barely above 
threshold). What matters most is that our network implement a sequence of low 
pass filtering to filter out the noise, followed by a square nonlinearity to compensate 
for the widening of the tuning curve due to the low pass filter, and a normalization 
to weaken contrast dependence. It is likely that this process would still clean up 
noise efficiently in the first 2-3 iterations even if activity decayed to zero eventually, 
that is to say, even if the hills of activity were not stable states. This would allow us 
to apply our approach to other types of networks, including those lacking circular 
symmetry and networks with continuously clamped inputs. 
To conclude, we propose that each cortical layer may read out the activity in the 
preceding layer in an optimal way thanks to the nonlinear pooling properties of 
divisive normalization, and, as a result, may behave like an ideal observer. It is 
therefore possible that the ability to read out neuronal codes in the sensory cortices 
in an optimal way may not be confined to a few areas like the parietal or frontal 
cortex, but may instead be a general property of every cortical layer. 
References 
[1] D. Heeger. Normalization of cell responses in cat striate cortex. Visual Neuro- 
science, 9:181-197, 1992. 
[2] L. Itti, C. Koch, and J. Braun. A quantitative model for human spatial vi- 
sion threshold on the basis of non-linear interactions among spatial filters. In 
R. Lippman, J. Moody, and D. Touretzky, editors, Advances in Neural Infor- 
mation Processing Systems, volume 11. Morgan-Kaufmann, San Mateo, 1998. 
[3] A. Pouget, K. Zhang, S. Deneve, and P. Latham. Statistically efficient estimation 
using population coding. Neural Computation, 10:373-401, 1998. 
[4] E. Simoncelli and D. Heeger. A model of neuronal responses in visual area MT. 
Vision Research, 38(5):743-761, 1998. 
