Maximum Likelihood Blind Source 
Separation: A ContextsSensitive 
Generalization of ICA 
Barak A. Pearlmutter 
Computer Science Dept, FEC 313 
University of New Mexico 
Albuquerque, NM 87131 
bap@cs.unm.edu 
Lucas C. Parra 
Siemens Corporate Research 
755 College Road East 
Princeton, NJ 08540-6632 
lucas@scr.siemens.com 
Abstract 
In the square linear blind source separation problem, one must find 
a linear unmixing operator which can detangle the result xi(t) of 
mixing n unknown independent sources $i(t) through an unknown 
n x n mixing matrix A(t) of causal linear filters: xi = j aij * $j. 
We cast the problem as one of maximum likelihood density estima- 
tion, and in that framework introduce an algorithm that searches 
for independent components using both temporal and spatial cues. 
We call the resulting algorithm "Contextual ICA," after the (Bell 
and Sejnowski 1995) Infomax algorithm, which we show to be a 
special case of cICA. Because cICA can make use of the temporal 
structure of its input, it is able separate in a number of situations 
where standard methods cannot, including sources with low kur- 
tosis, colored Gaussian sources, and sources which have Gaussian 
histograms. 
I The Blind Source Separation Problem 
Consider a set of n indepent sources si(t),..., sn(t). We are given n linearly dis- 
torted sensor reading which combine these sources, xi - -,j aijsj, where aij is a 
filter between source j and sensor i, as shown in figure la. This can be expressed 
as 
Xi(t) ----   aji(T)$j(t- T) --  aji , $j 
j r=O j 
614 B. A. Pearlmutter and L. C. Parra 
x I x 2 
4 
t)[Y{(t-1) .... ;wO))] Yt 
x I x 2 
 . . x n  . o x n 
Figure 1: The left diagram shows a generative model of data production for blind 
source separation problem. The cICA algorithm fits the reparametrized generative 
model on the right to the data. Since (unless the mixing process is singular) both 
diagrams give linear maps between the sources and the sensors, they are mathe- 
matically equivalent. However, (a) makes the transformation from s to x explicit, 
while (b) makes the transformation from x to y, the estimated sources, explicit. 
or, in matrix notation, x(t) - -.r__0 A(r)s(t - r) = A  s. The square linear blind 
source separation problem is to recover s from x. There is an inherent ambiguity 
' = bi * si where bi(r) is some 
in this, for if we define a new set of sources s ' by s i 
invertable filter, then the various si are independent, and constitute just as good a 
. Similarly the 
solution to the problem as the true si, since xi = -.j(aij * b -) * sj 
sources could be arbitrarily permuted. 
Surprisingly, up to permutation of the sources and linear filtering of the individual 
sources, the problem is well posed--assuming that the sources sj are not Gaussian. 
The reason for this is that only with a correct separation are the recovered sources 
truly statistically independent, and this fact serves as a sufficient constraint. Under 
the assumptions we have made, x and further assuming that the linear transforma- 
tion A is invertible, we will speak of recovering yi(t) -- j wji * xj where these 
Yi are a filtered and permuted version of the original unknown si. For clarity of 
exposition, will often refer to "the" solution and refer to the yi as "the" recovered 
sources, rather than refering to an point in the manifold of solutions and a set of 
consistent recovered sources. 
2 Maximum likelihood density estimation 
Following Pham, Garrat, and Jutten (1992) and Belouchrani and Cardoso (1995), 
we cast the BSS problem as one of maximum likelihood density estimation. In the 
MLE framework, one begins with a probabilistic model of the data production pro- 
cess. This probabilistic model is parametrized by a vector of modifiable parameters 
w, and it therefore assigns a w-dependent probability density p(x0, xx,...; w) to a 
each possible dataset x0, xi, .... The task is then to find a w which maximizes this 
probability. 
There are a number of approaches to performing this maximization. Here we apply 
Without these assumptions, for instance in the presence of noise, even a linear mixing 
process leads to an optimal unmixing process that is highly nonlinear. 
Maximum Likelihood Blind Source Separation: Contextual ICA 615 
the stochastic gradient method, in which a single stochastic sample x is chosen from 
the dataset and -dlogp(x; w)/dw is used as a stochastic estimate of the gradient 
of the negative likelihood t-dlogp(x(t); w)/dw. 
2.1 The likelihood of the data 
The model of data production we consider is shown in figure la. In that model, the 
sensor readings x are an explicit linear function of the underlying sources s. 
In this model of the data production, there are two stages. In the first stage, the 
sources independently produce signals. These signals are time-dependent, and the 
probability density of source i producing value sj(t) at time t is fj(sj(t)lsj(t - 
1), sj (t - 2),...). Although this source model could be of almost any differentiable 
form, we used a generalized autoregressive model described in appendix A. For 
expository purposes, we can consider using a simple AR model, so we model sj(t) - 
bj(1)sj(t - 1) + bj(2)sj(t - 2) +... + bj(T)sj(t- T) + rj, where rj is an lid random 
variable, perhaps with a complicated density. 
It is important to distinguish two different, although related, linear filters. When 
the source models are simple AR models, there are two types of linear convolutions 
being performed. The first is in the way each source produces its signal: as a linear 
function of its recent history plus a white driving term, which could be expressed 
as a moving average model, a convolution with a white driving term, sj - bj  rj. 
The second is in the way the sources are mixed: linear functions of the output of 
each source are added, xi - j aji * $j -- Ej (aji * bj) * rj. Thus, with AR sources, 
the source convolution could be folded into the convolutions of the linear mixing 
process. 
If we were to estimate values for the free parameters of this model, i.e. to estimate 
the filters, then the task of recovering the estimated sources from the sensor output 
would require inverting the linear A ---- (aij), as well as some technique to guarantee 
its non-singularity. Such a model is shown in figure la. Instead, we parameterize 
the model by W -- A -x, an estimated unmixing matrix, as shown in figure lb. 
In this indirect representation, s is an explicit linear function of x, and therefore 
x is only an implicit linear function of s. This parameterization of the model is 
equally convenient for assigning probabilities to samples x, and is therefore suitable 
for MLE. Its advantage is that because the transformation from sensors to sources 
is estimated explicitly, the sources can be recovered directly from the data and the 
estimated model, without invertion. Note that in this inverse parameterization, the 
estimated mixture process is stored in inverse form. The source-specific models fi 
are kept in forward form. Each source-specific model i has a vector of parameters, 
which we denote w (i) . 
We are now in a position to calculate the likelihood of the data. For simplicity we use 
a matrix W of real numbers rather than FIR filters. Generalizing this derivation to 
a matrix of filters is straightforward, following the same techniques used by Lambert 
(1996), Torkkola (1996), A. Bell (1997), but space precludes a derivation here. 
The individual generative source models give 
p(y(t)ly(t- 1), y(t- 2),...) - H fi(yi(t)lyi(t- 1),yi(t- 2),...) 
i 
(1) 
616 B. A. Pearlmutter and L. C. Parra 
where the probability densities fi are each parameterized by vectors w (i). Using 
these equations, we would like to express the likelihood of x(t) in closed form, 
given the history x(t - 1),x(t -2), .... Since the history is known, we therefore 
also know the history of the recovered sources, y(t - 1),y(t - 2), .... This means 
that we can calculate the density p(y(t)ly(t - 1),...). Using this, we can express 
the density of x(t) and expand  = log/5(x;w) - loglW I + -.j logfj(yj(t)lyj(t- 
1),yj(t - 2),... ;w (j)) There are two sorts of parameters which we must take the 
derivative with respect to: the matrix W and the source parameters w(J). The 
source parameters do not influence our recovered sources, and therefore have a 
simple form 
A 
dG dfj(yj;wj)/dwj 
dwj fj(yj;wj) 
However, a change to the matrix W changes y, which introduces a few extra terms. 
Note that dlog IWI/dW -- W -T, the transpose inverse. Next, since y = Wx, we 
see that dyj/dW = (01x10) T, a matrix of zeros except for the vector x in row j. 
Now we note that dfj(.)/dW term has two logical components: the first from the 
effect of changing W upon yj (t), and the second from the effect of changing W upon 
yj (t - 1), yj (t - 2), .... (This second is called the "recurrent term", and such terms 
are frequently dropped for convenience. As shown in figure 3, dropping this term 
here is not a reasonable approximation.) 
dfj(yj(t)lyj(t- 1),...;wj) _ Ofj 
dW Oyj(t) 
Ofj dyj(t- r) 
dyj(t) + E Oyj(t r) dW 
dW r - 
Note that the expression dye(t-r) is the only matrix, and it is zero except for the 
dW 
jth row, which is x(t- r). The expression Ofj/Oyj(t) we shall denote fj(.), and the 
expression OfjOyj(t - r) we shall denote f()(.). We then have 
where (expr(j))j denotes the column vector whose elements are expr(1),..., expr(n). 
2.2 The natural gradient 
Following Amari, Cichocki, and Yang (1996), we follow a pseudogradient. Instead of 
using equation 2, we post-multiply this quantity by wTw. Since this is a positive- 
definite matrix, it does not affect the stochastic gradient convergence criteria, and 
the resulting quantity simplifies in a fashion that neatly eliminates the costly matrix 
inversion otherwise required. Convergence is also accelerated. 
3 Experiments 
We conducted a number of experiments to test the efficacy of the cICA algorithm. 
The first, shown in figure 2, was a toy problem involving a set of processed de- 
liberately constructed to be difficult for conventional source separation algorithms. 
In the second experiment, shown in figure 3, ten real sources were digitally mixed 
with an instantaneous matrix and separation performance was measured as a funci- 
ton of varying model complexity parameters. These sources have are available for 
benchmarking purposes in http://www.cs.unm.edu/~bap/demos.html. 
Maximum Likelihood Blind Source Separation: Contextual ICA 617 
-10 
-2 -1 5 -1 -0.5 0 0.5 1 1 2 
Figure 2: cICA using a history of one time step and a mixture of five logistic densities 
for each source was applied to 5,000 samples of a mixture of two one-dimensional 
uniform distributions each filtered by convolution with a decaying exponential of 
time constant of 99.5. Shown is a scatterplot of the data input to the algorithm, 
along with the true source axes (left), the estimated residual probability density 
(center), and a scatterplot of the residuals of the data transformed into the estimated 
source space coordinates (right). The product of the true mixing matrix and the 
estimated unmixing matrix deviates from a scaling and permutation matrix by 
about 3%. 
Truncated Gradient Full Gradient Noise Model 
100 100 100 
10 10 10 
I I 
0 5 10 15 20 0 5 10 15 20 I 2 
number of AR filter taps number of AR filter taps number of logistics 
Figure 3: The performance of cICA as a function of model complexity and gradient 
accuracy. In all simulations, ten five-second clips taken digitally from ten audio CD 
were digitally mixed through a random ten-by-ten instantanious mixing matrix. The 
signal to noise ratio of each original source as expressed in the recovered sources is 
plotted. In (a) and (b), AR source models with a logistic noise term were used, and 
the number of taps of the AR model was varied. (This reduces to Bell-Sejnowski 
infomax when the number of taps is zero.) Is (a), the recurrent term of the gradient 
was left out, while in (b) the recurrent term was included. Clearly the recurrent 
term is important. In (c), a degenerate AR model with zero taps was used, but the 
noise term was a mixture of logistics, and the number of logistics was varied. 
4 Discussion 
The Infomax algorithm (Baram and Roth 1994) used for source separation (Bell and 
Sejnowski 1995) is a special case of the above algorithm in which (a) the mixing is 
not convolutional, so W(1) = W(2) - ... - 0, and (b) the sources are assumed to 
be iid, and therefore the distributions fi(y(t)) are not history sensitive. Further, 
the form of the fi is restricted to a very special distribution: the logistic density, 
618 B. A. Pearlmutter and L. C. Parra 
the derivative of the sigmoidal function 1/(1 + exp -). Although ICA has enjoyed 
a variety of applications (Makeig et al. 1996; Bell and Sejnowski 1996b; Baram and 
Roth 1995; Bell and Sejnowski 1996a), there are a number of sources which it cannot 
separate. These include all sources with Gaussian histograms (e.g. colored gaussian 
sources, or even speech to run through the right sort of slight nonlinearity), and 
sources with low kurtosis. As shown in the experiments above, these are of more 
than theoretical interest. 
If we simplify our model to use ordinary AR models for the sources, with gaussian 
noise terms of fixed variance, it is possible to derive a closed-form expression for 
W (Hagai Attias, personal communication). It may be that for many sources of 
practical interest, trading away this model accuracy for speed will be fruitful. 
4.1 Weakened assumptions 
It seems clear that, in general, separating when there are fewer microphones than 
sources requires a strong bayesian prior, and even given perfect knowledge of the 
mixture process and perfect source models, inverting the mixing process will be 
computationally burdensome. However, when there are more microphones than 
sources, there is an opportunity to improve the performance of the system in the 
presence of noise. This seems straightforward to integrate into our framework. 
Similarly, fast-timescale microphone nonlinearities are easily incorporated into this 
maximum likelihood approach. 
The structure of this problem would seem to lend itself to EM. Certainly the individ- 
ual source models can be easily optimized using EM, assuming that they themselves 
are of suitable form. 
References 
A. Bell, T.-W. L. (1997). Blind separation of delayed and convolved sources. In 
Advances in Neural Information Processing Systems 9. MIT Press. In this 
volume. 
Amari, S., Cichocki, A., and Yang, H. H. (1996). A new learning algorithm for blind 
signal separation. In Advances in Neural Information Processing Systems 8. 
MIT Press. 
Baram, Y. and Roth, Z. (1994). Density Shaping by Neural Networks with Ap- 
plication to Classification, Estimation and Forecasting. Tech. rep. CIS-94- 
20, Center for Intelligent Systems, Technion, Israel Institute for Technology, 
Haifa. 
Baram, Y. and Roth, Z. (1995). Forecasting by Density Shaping Using Neural 
Networks. In Computational Intelligence for Financial Engineering New York 
City. IEEE Press. 
Bell, 
A. J. and Sejnowski, T. J. (1995). An Information-Maximization Approach 
to Blind Separation and Blind Deconvolution. Neural Computation, 7(6), 
1129-1159. 
Bell, A. J. and Sejnowski, T. J. (1996a). The Independent Components of Natural 
Scenes. Vision Research. Submitted. 
Maximum Likelihood Blind Source Separation: Contextual ICA 619 
Bell, A. J. and Sejnowski, T. J. (1996b). Learning the higher-order structure of a 
natural sound. Network: Computation in Neural Systems. In press. 
Belouchrani, A. and Cardoso, J.-F. (1995). Maximum likelihood source separation 
by the expectation-maximization technique: Deterministic and stochastic im- 
plementation. In Proceedings of 1995 International Symposium on Non-Linear 
Theory and Applications, pp. 49-53 Las Vegas, NV. In press. 
Lambert, R. H. (1996). Multichannel Blind Deconvolution: FIR Matrix Algebra and 
Separation of Multipath Mixtures. Ph.D. thesis, USC. 
Makeig, S., Anllo-Vento, L., Jung, T.-P., Bell, A. J., Sejnowski, T. J., and Hillyard, 
S. A. (1996). Independent component analysis of event-related potentials 
during selective attention. Society for Neuroscience Abstracts, 22. 
Pearlmutter, B. A. and Parra, L. C. (1996). A Context-Sensitive Generaliza- 
tion of ICA. In International Conference on Neural Information Processing 
Hong Kong. Springer-Verlag. Url ftp://ftp.cnl.salk.edu/pub/bap/iconip-96- 
cica.ps.gz. 
Pham, D., Garrat, P., and Jutten, C. (1992). Separation of a mixture of indepen- 
dent sources through a maximum likelihood approach. In European Signal 
Processing Conference, pp. 771-774. 
Torkkola, K. (1996). Blind separation of convolved sources based on information 
maximization. In Neural Networks for Signal Processing VI Kyoto, Japan. 
IEEE Press. In press. 
A Fixed mixture AR models 
The fj (uj; wj) we used were a mixture AR processes driven by logistic noise terms, 
as in Pearlmutter and Parra (1996). Each source model was 
f(u(t)lu(t - 1), uj(t - 2),... ;wj) - E rnjk h((u)(t) - ajk )lrj )lrj 
k 
(3) 
where rjk is a scale parameter for logistic density k of source j and is an element 
of w j, and the mixing coefficients mjk are elements of wj and are constrained by 
-]-k mjk ---- 1. The component means ajk are taken to be linear functions of the 
recent values of that source, 
(4) 
where the linear prediction coefficients ajk(r) and bias bjk are elements of wj. 
The derivatives of these are straightforward; see Pearlmutter and Parra (1996) for 
details. One complication is to note that, after each weight update, the mixing 
coefficients must be normalized, mjk -- mjk / -k' mjk,. 
