Edges are the 'Independent Components' of 
Natural Scenes. 
Anthony J. Bell and Terrence J. Sejnowski 
Computational Neurobiology Laboratory 
The Salk Institute 
10010 N. Torrey Pines Road 
La Jolla, California 92037 
tony@salk.edu, terry@salk.edu 
Abstract 
Field (1994) has suggested that neurons with line and edge selectivities 
found in primary visual cortex of cats and monkeys form a sparse, dis- 
tributed representation of natural scenes, and Barlow (1989) has reasoned 
that such responses should emerge from an unsupervised learning algorithm 
that attempts to find a factorial code of independent visual features. We 
show here that non-linear 'infomax', when applied to an ensemble of nat- 
ural scenes, produces sets of visual filters that are localised and oriented. 
Some of these filters are Gabor-like and resemble those produced by the 
sparseness-maximisation network of Olshausen & Field (1996). In addition, 
the outputs of these filters are as independent as possible, since the info- 
max network is able to perform Independent Components Analysis (ICA). 
We compare the resulting ICA filters and their associated basis functions, 
with other decorrelating filters produced by Principal Components Analysis 
(PCA) and zero-phase whitening filters (ZCA). The ICA filters have more 
sparsely distributed (kurtotic) outputs on natural scenes. They also resem- 
ble the receptive fields of simple cells in visual cortex, which suggests that 
these neurons form an information-theoretic co-ordinate system for images. 
I Introduction. 
Both the classic experiments of Hubel & Wiesel [8] on neurons in visual cortex, and sev- 
eral decades of theorising about feature detection in vision, have left open the question 
most succinctly phrased by Barlow "Why do we have edge detectors?" That is: are there 
any coding principles which would predict the formation of localised, oriented receptive 
832 A. J. Bell and T. J. Sejnowski 
fields? Barlow's answer is that edges are suspicious coincidences in an image. Formalised 
information-theoretically, this means that our visual cortical feature detectors might be the 
end result of a redundancy reduction process [4, 2], in which the activation of each feature 
detector is supposed to be as statistically independent from the others as possible. Such a 
'factorial code' potentially involves dependencies of all orders, but most studies [9, 10, 2] 
(and many others) have used only the second-order statistics required for decorrelating the 
outputs of a set of feature detectors. Yet there are multiple decorrelating solutions, includ- 
ing the 'global' unphysiological Fourier filters produced by PCA, so further constraints are 
required. 
Field [7] has argued for the importance of sparse, or 'Minimum Entropy', coding [4], in which 
each feature detector is activated as rarely as possible. Olshausen & Field demonstrated [12] 
that such a sparseness criterion could be used to self-organise localised, oriented receptive 
fields. 
Here we present similar results using a direct information-theoretic criterion which maximises 
the joint entropy of a non-linearly transformed output feature vector [5]. Under certain 
conditions, this process will perform Independent Component Analysis (or ICA) which is 
equivalent to Barlow's redundancy reduction problem. Since our ICA algorithm, applied to 
natural scenes, does produce local edge filters, Barlow's reasoning is vindicated. Our ICA 
filters are more sparsely distributed than those of other decorrelating filters, thus supporting 
some of the arguments of Field (1994) and helping to explain the results of Olshausen's 
network from an information-theoretic point of view. 
2 Blind separation of natural images. 
A perceptual system is exposed to a series of small image patches, drawn from an ensemble 
of larger images. In the linear image synthesis model [12], each image patch, represented 
by the vector x, has been formed by the linear combination of N basis functions. The basis 
functions form the columns of a fixed matrix, A. The weighting of this linear combination 
(which varies with each image) is given by a vector, s. Each component of this vector 
has its own associated basis function, and represents an underlying 'cause' of the image. 
Thus: x-As. The goal of a perceptual system, in this simplified framework, is to linearly 
transform the images, x, with a matrix of filters, W, so that the resulting vector: u-- Wx, 
recovers the underlying causes, s, possibly in a different order, and rescaled. For the sake 
of argument, we will define the ordering and scaling of the causes so that W - A -1. But 
what should determine their form? If we choose decorrelation, so that (uu T) - I, then the 
solution for W must satisfy: 
wTw-- (xxT) -1 (1) 
There are several ways to constrain the solution to this: 
(1) Principal Components Analysis Wl (PCA), is the Orthogonal (global) solution 
[WW T = I]. The PCA solution to Eq.(1) is Wl = D-E T, where D is the diagonal 
matrix of eigenvalues, and E is the matrix who's columns are the eigenvectors. The fil- 
ters (rows of Wl) are orthogonal, are thus the same as the PCA basis functions, and are 
typically global Fourier filters, ordered according to the amplitude spectrum of the image. 
Example PCA filters are shown in Fig. la. 
(2) Zero-phase Components Analysis Wz (ZCA), is the Symmetrical (local) solution 
[WW T = W2]. The ZCA solution to Eq.(1) is Wz = (xxT) -1/2 (matrix square root). 
ZCA is the polar opposite of PCA. It produces local (centre-surround type) whitening fil- 
Edges are the "Independent Components" of Natural Scenes 833 
ters, which are ordered according to the phase spectrum of the image. That is, each filter 
whitens a given pixel in the image, preserving the spatial arrangement of the image and 
flattening its frequency (amplitude) spectrum. Wz is related to the transforms described 
by Atick & Redlich [3]. Example ZCA filters and basis functions are shown in Fig.lb. 
(3) Independent Components Analysis W (ICA), is the Factorised (semi-local) solution 
[fu(U) - I-Ii fu, (ui)]. Please see [5] for full references. The 'infomax' solution we describe 
here is related to the approaches in [5, 1, 6]. 
As we will show, in Section 5, ICA on natural images produces decorrelating filters which 
are sensitive to both phase (locality) and spatial frequency information, just as in transforms 
involving oriented Gabor functions or wavelets. Example ICA filters are shown in Fig.ld 
and their corresponding basis functions are shown in Fig.le. 
3 An ICA algorithm. 
It is important to recognise two differences between finding an ICA solution, Wt, and other 
decorrelation methods. (1) there may be no ICA solution, and (2) a given ICA algorithm 
may not find the solution even if it exists, since there are approximations involved. In 
these senses, ICA is different from PCA and ZCA, and cannot be calculated analytically, for 
example, from second order statistics (the covariance matrix), except in the gaussian case. 
The approach Which we developed in [5] (see there for further references to ICA) was to 
maximise by stochastic gradient ascent the joint entropy, H[g(u)], of the linear transform 
squashed by a sigmoidal function, g. When the non-linear function is the same (up to scal- 
ing and shifting) as the cumulative density functions (c.d.f.s) of the underlying independent 
components, it can be shown (Nadal & Parga 1995) that such a non-linear 'infomax' pro- 
cedure also minimises the mutual information between the ui, exactly what is required for 
ICA. In most cases, however, we must pick a non-linearity, g, without any detailed knowledge 
of the probability density functions (p.d.f.s) of the underlying independent components. In 
cases where the p.d.f.s are super-gaussian (meaning they are peakier and longer-tailed than 
a gaussian, having kurtosis greater than 0), we have repeatedly observed, using the logistic 
or tanh nonlinearities, that maximisation of H[g(u)] still leads to ICA solutions, when they 
exist, as with our experiments on speech signal separation [5]. Although the infomax algo- 
rithm is described here as an ICA algorithm, a fuller understanding needs to be developed 
of under exactly what conditions it may fail to converge to an ICA solution. 
The basic infomax algorithm changes weights according to the entropy gradient. Defin- 
ing Yi -' g(ui) to be the sigmoidally transformed output variables, the stochastic gradient 
learning rule is: 
AW cr OH(y) = E = E[W -T + $rx T] (2) 
ow ow j 
In this, E denotes expected value, y = [g(ul)... g(uN)] T, and IJI is the absolute value of the 
determinant of the Jacobian matrix: J = det [Oyi/Oxj]ij , and  = [ ... N] T, the elements 
of which depend on the nonlinearity according to: i = O/Oyi(Oyi/Oui). 
Amari, Cichocki & Yang [1] have proposed a modification of this rule which utilises the 
natural gradient rather than the absolute gradient of H(y). The natural gradient exists for 
objective functions which are functions of matrices, as in this case, and is the same as the 
relative gradient concept developed by Cardoso & Laheld (1996). It amounts to multiplying 
834 A. J. Bell and T. J. Sejnowski 
the absolute gradient by wTw, giving, in our case, the following altered version of Eq.(2): 
0H(y) W"W = (I + u')W (3) 
This rule has the twin advantages over Eq.(2) of avoiding the matrix inverse, and of con- 
verging several orders of magnitude more quickly, for data, x, that is not prewhitened. The 
speedup is explained by the fact that convergence is no longer dependent on the condi- 
tioning of the underlying basis function matrix, A. Writing Eq.(3) for one weight gives 
AWij (X Wij q- i Ek WkjUk. This rule is 'almost local' requiring a backwards pass. 
5 
7 
11 
15 
22 
37 
6O 
63 
89 
109 
144 
(a) (b) 
PCA ZCA 
(c) (d) 
W ICA 
(e) 
A 
Figure 1: Selected decorrelating filters 
and their basis functions extracted from 
the natural scene data. Each type of 
decorrelating filter yielded 144 12x12 fil- 
ters, of which we only display a sub- 
set here. Each column contains filters 
or basis functions of a particular type, 
and each of the rows has a number re- 
lating to which row of the filter or basis 
function matrix is displayed. (a) PCA 
(Wp): The 1st, 5th, 7th etc Principal 
Components, showing increasing spatial 
frequency. There is no need to show basis 
functions and filters separately here since 
for PCA, they are the same thing. (b) 
ZCA (Wz): The first 6 entries in this 
column show the one-pixel wide centre- 
surround filter which whitens while pre- 
serving the phase spectrum. All are iden- 
tical, but shifted. The lower 6 entries 
(37, 60 show the basis functions instead, 
which are the columns of the inverse of 
the Wz matrix. (c) W: the weights 
learnt by the ICA network trained on 
Wz-whitened data, showing (in descend- 
ing order) the DC filter, localised oriented 
filters, and localised checkerboard filters. 
(d) W/: The corresponding ICA filters, 
calculated according to W/ = WWz, 
looking like whitened versions of the W- 
filters. (e) A: the corresponding basis 
functions, or columns of W 1. These 
are the patterns which optimally stimu- 
late their corresponding ICA filters, while 
not stimulating any other /CA filter, so 
that WA -- I. 
Edges are the "Independent Components" of Natural Scenes 835 
4 Methods. 
We took four natural scenes involving trees, leafs etc., and converted them to greyscale 
values between 0 and 255. A training set, {x}, was then generated of 17,595 12x12 samples 
from the images. This was 'sphered' by subtracting the mean and multiplying by twice 
the local symmetrical (zero-phase) whitening filter: {x} +- 2Wz({x} - {x}). This removes 
both first and second order statistics from the data, and makes the covariance matrix of x 
equal to 4I. This is an appropriately scaled starting point for further training since infomax 
(Eq.(3)) on raw data, with the logistic function, Yi = (1 + -u,)-l, produces a u-vector 
which approximately satisfies {uu T) = 4I. Therefore, by prewhitening x in this way, we can 
ensure that the subsequent transformation, u -- Wx, to be learnt should approximate an 
orthonormal matrix (rotation without scaling), roughly satisfying the relation WTW - I. 
The matrix, W, is then initialised to the identity matrix, and trained using the logistic 
function version of Eq.(3), in which i = I - 2yi. Thirty sweeps through the data were 
performed, at the end of each of which, the order of the data vectors was permuted to avoid 
cyclical behaviour in the learning. In each sweep, the weights were updated in batches of 50 
presentations. The learning rate (proportionality constant in Eq.(3)) followed 21 sweeps at 
0.001, and 3 sweeps at each of 0.0005, 0.0002 and 0.0001, taking 2 hours running MATLAB 
on a Sparc-20 machine, though a reasonable result for 12x12 filters can be achieved in 30 
minutes. To verify that the result was not affected by the starting condition of W - I, the 
training was repeated with several randomly initialised weight matrices, and also on data 
that was not prewhitened. The results were qualitatively similar, though convergence was 
much slower. 
The full ICA transform from the raw image was calculated as the product of the sphering 
(ZCA) matrix and the learnt matrix: W - WWz. The basis function matrix, A, was 
calculated as W . A PCA matrix, Wp, was calculated. The original (unsphered) data 
was then transformed by all three decorrelating transforms, and for each the kurtosis of each 
of the 144 filters was calculated. Then the mean kurtosis for each filter type (ICA, PCA, 
ZCA) was calculated, averaging over all filters and input data. 
5 Results. 
The filters and basis functions resulting from training on natural scenes are displayed in 
Fig. 1 and Fig.2. Fig. 1 displays example filters and basis functions of each type. The PCA 
filters, Fig. la, are spatially global and ordered in frequency. The ZCA filters and basis 
functions are spatially local and ordered in phase. The ICA filters, whether trained on the 
ZCA-whitened images, Fig. lc, or the original images, Fig.ld, are semi-local filters, most with 
a specific orientation preference. The basis functions, Fig. le, calculated from the Fig.ld ICA 
filters, are not local, and naturally have the spatial frequency characteristics of the original 
images. Basis functions calculated from Fig. ld (as with PCA filters) are the same as the 
corresponding filters since the matrix W (as with W,) is orthogonal. 
In order to show the full variety of ICA filters, Fig.2 shows, with lower resolution, all 144 
filters in the matrix W, in reverse order of the vector-lengths of the filters, so that the 
filters corresponding to higher-variance independent components appear at the top. The 
general result is that ICA filters are localised and mostly oriented. Unlike the basis functions 
displayed in Olshausen & Field (1996), they do not cover a broad range of spatial frequencies. 
However, the appropriate comparison to make is between the ICA basis functions, and the 
basis functions in Olshausen & Field's Figure 4. The ICA basis functions in our Fig. le are 
836 A. J. Bell and T. J. Sejnowski 
oriented, but not localised and therefore it is difficult to observe any multiscale properties. 
However, when we ran the ICA algorithm on Olshausen's images, which were preprocessed 
with a whitening/lowpass filter, our algorithm yielded basis functions which were localised 
multiscale Gabor patches qualitively similar to those in Olshausen's Figure 4. Part of the 
difference in our results is therefore attributable to different preprocessing techniques. 
The distributions (image histograms) produced by PCA, ZCA and ICA are generally double- 
exponential (e -lu' I), or 'sparse', meaning peaky with a long tail, when compared to a gaus- 
sian, as predicted by Field [7]. The log histograms are seen to be roughly linear across 
5 orders of magnitude. The histogram for the ICA filters, however, departs slightly from 
linearity, being more peaked, and having a longer tail than the ZCA and PCA histograms. 
This spreading of the tail signals the greater sparseness of the outputs of the ICA filters, 
and this is reflected in a calculated kurtosis measure of 10.04 for ICA, compared to 3.74 for 
PCA, and 4.5 for ZCA. 
In conclusion, these simulations show that the filters found by the ICA algorithm of Eq.(3) 
with a logistic non-linearity are localised, oriented, and produce outputs distributions of 
very high sparseness. It is notable that this is achieved through an information theoretic 
learning rule which (1) has no noise model, (2) is sensitive to higher-order statistics (spatial 
coincidences), (3) is non-Hebbian (it is closer to anti-Hebbian) and (4) is simple enough to 
be almost locally implementable. Many other levels of higher-order invariance (translation, 
rotation, scaling, lighting) exist in natural scenes. It will be interesting to see if information- 
theoretic techniques can be extended to address these invariances. 
Acknowledgements 
This work emerged through many extremely useful discussions with Bruno Olshausen and 
David Field. We are very grateful to them, and also to Paul Viola and Barak Pearlmutter. 
The work was supported by the Howard Hughes Medical Institute. 
References 
[1] Amari S. Cichocki A. & Yang H.H. 1996. A new learning algorithm for blind signal 
separation, Advances in Neural Information Processing Systems 8, MIT press. 
[2] Atick J.J. 1992. Could information theory provide an ecological theory of sensory pro- 
cessing? Network 3, 213-251 
[3] Atick J.J. & Redlich A.N. 1993. Convergent algorithm for sensory receptive field devel- 
opment, Neural Computation 5, 45-60 
[4] Barlow H.B. 1989. Unsupervised learning, Neural Computation 1,295-311 
[5] Bell A.J. & Sejnowski T.J. 1995. An information maximization approach to blind sep- 
aration and blind deconvolution, Neural Computation, 7, 1129-1159 
[6] Cardoso J-F. & Laheld B. 1996. Equivariant adaptive source separation, IEEE Trans. 
on Signal Proc., Dec.1996. 
[7] Field D.J. 1994. What is the goal of sensory coding? Neural Computation 6, 559-601 
[8] Hubel D.H. & Wiesel T.N. 1968. Receptive fields and functional architecture of monkey 
striate cortex, J. Physiol., 195:215-244 
Edges are the "Independent Components" of Natural Scenes 837 
[9] Linsker R. 1988. Self-organization in a perceptual network. Computer, 21,105-117 
[10] Miller K.D. 1988. Correlation-based models of neural development, in Neuroscience and 
Connectionist Theory, M. Gluck & D. Rumelhart, eds., 267-353, L.Erlbaum, NJ 
[11] Nadal J-P. & Parga N. 1994. Non-linear neurons in the low noise limit: a factorial code 
maximises information transfer. Network, 5, 565-581. 
[12] Olshausen B.A. & Field D.J. 1996. Natural image statistics and efficient coding, Net- 
work: Computation in Neural Systems, 7, 2. 
Figure 2: The matrix of 144 filters obtained by training on ZCA-whitened natural images. 
Each filter is a row of the matrix W, and they are ordered left-to-right, top-to-bottom in 
reverse order of the length of the filter vectors. In a rough characterisation, and more-or-less 
in order of appearance, the filters consist of one DC filter (top left), 106 oriented filters (of 
which 35 were diagonal, 37 were vertical and 34 horizontal), and 37 localised checkerboard 
patterns. The diagonal filters are longer than the vertical and horizontal due to the bias 
induced by having square, rather than circular, receptive fields. 
