A New Learning Algorithm for Blind 
Signal Separation 
S. Amari* 
University of Tokyo 
Bunkyo-ku, Tokyo 113, JAPAN 
amari @ sat. t. u- tokyo. ac.j p 
A. Cichocki 
Lab. for Artificial Brain Systems 
FRP, RIKEN 
Wako-Shi, Saitama, 351-01, JAPAN 
cia@kamo.riken.go.jp 
H. I-I. Yang 
Lab. for Information Representation 
FRP, RIKEN 
Wako-Shi, Saitama, 351-01, JAPAN 
hhy@koala.riken.go.jp 
Abstract 
A new on-line learning algorithm which minimizes a statistical de- 
pendency among outputs is derived for blind separation of mixed 
signals. The dependency is measured by the average mutual in- 
formation (MI) of the outputs. The source signals and the mixing 
matrix are unknown except for the number of the sources. The 
Gram-Charlier expansion instead of the Edgeworth expansion is 
used in evaluating the MI. The natural gradient approach is used 
to minimize the MI. A novel activation function is proposed for the 
on-line learning algorithm which has an equivariant property and 
is easily implemented on a neural network like model. The validity 
of the new learning algorithm are verified by computer simulations. 
I INTRODUCTION 
The problem of blind signal separation arises in many areas such as speech recog- 
nition, data communication, sensor signal processing, and medical science. Several 
neural network algorithms [3, 5, 7] have been proposed for solving this problem. 
The performance of these algorithms is usually affected by the selection of the ac- 
tivation functions for the formal neurons in the networks. However, all activation 
*Lab. for Information Representation, FRP, R/KEN, Wako-shi, Saitama, JAPAN 
758 S. AMARI, A. CICHOCKI, H. H. YANG 
functions attempted are monotonic and the selections of the activation functions 
are ad hoc. How should the activation function be determined to minimize the MI? 
Is it necessary to use monotonic activation functions for blind signal separation? In 
this paper, we shall answer these questions and give an on-line learning algorithm 
which uses a non-monotonic activation function selected by the independent com- 
ponent analysis (ICA) [7]. Moreover, we shall show a rigorous way to derive the 
learning algorithm which has the equivariant property, i.e., the performance of the 
algorithm is independent of the scaling paxmeters in the noiseless case. 
2 PROBLEM 
Let us consider unknown source signals $i(t),i ---- 1,..., n which are mutually in- 
dependent. It is assumed that the sources s'(t) are stationary processes and each 
source has moments of any order with a zero mean. The model for the sensor output 
is 
x(t) = As(t) 
where A E /'x' is an unknown non-singular mixing matrix, s(t) 
[s(t), .-. ,sn(t)] T and x(t) -[x(t),...,xn(t)] T. 
Without knowing the source signals and the mixing matrix, we want to recover the 
original signals from the observations x(t) by the following linear transform: 
y(t) = Wx(t) 
where y(t) = [yX(t),.-. ,yn(t)]T and W  l:l nXn is a de-mixing matrix. 
It is impossible to obtain the original sources $i(t) because they are not identifiable 
in the statistical sense. However, except for a permutation of indices, it is possible 
to obtain i$i(t) where the constants ci are indefinite nonzero scalar factors. The 
source signals are identifiable in this sense. So our goal is to find the matrix W such 
that [y,-.., yn] coincides with a permutation of [sX, --. , s n] except for the scalar 
factors. The solution W is the matrix which finds all independent components in 
the outputs. An on-line learning algorithm for W is needed which performs the 
ICA. It is possible to find such a learning algorithm which minimizes the dependency 
among the outputs. The algorithm in [6] is based on the Edgeworth expansion[8] for 
evaluating the marginal negentropy. Both the Gram-Charlier expansion[8] and the 
Edgeworth expansion[8] can be used to approximate probability density functions. 
We shall use the Gram-Charlier expansion instead of the Edgeworth expansion for 
evaluating the marginal entropy. We shall explain the reason in section 3. 
3 INDEPENDENCE OF SIGNALS 
The mathematical framework for the ICA is formulated in [6]. The basic idea of the 
ICA is to minimize the dependency among the output components. The dependency 
is measured by the Kullback-Leibler divergence between the joint and the product 
of the marginal distributions of the outputs: 
D(W) = / p(y) log 
p(y) 
i_[n=x pa(ya) dy (1) 
where pa(y) is the marginal probability density function (pdf). Note the Kullback- 
Leibler divergence has some invariant properties from the differential-geometrical 
point of view[1]. 
A New Learning Algorithm for Blind Signal Separation 759 
It is easy to relate the Kullback-Leibler divergence D(W) to the average MI of y: 
n 
D(W) = -H(y) q- Z H(y*) (2) 
a----1 
where 
H(y) = - f p(y)logp(y)dy, 
H(y a) = -fpa(y")logp,(y")dy  is the marginal entropy. 
The minimization of the Kullback-Leibler divergence leads to an ICA algorithm for 
estimating W in [6] where the Edgeworth expansion is used to evaluate the negen- 
tropy. We use the truncated Gram-Charlier expansion to evaluate the Kullback- 
Leibler divergence. The Edgeworth expansion has some advantages over the Gram- 
Charlier expansion only for some special distributions. In the case of the Gamma 
distribution or the distribution of a random variable which is the sum of iid random 
variables, the coefficients of the Edgeworth expansion decrease uniformly. However, 
there is no such advantage for the mixed output ya in general cases. 
To calculate each H(y ) in (2), we shall apply the Gram-Charlier expansion to 
approximate the pdf p(y). Since E[y] = E[WAs] - 0, we have Ely ] - 0. To 
simplify the calculations for the entropy H(y ) to be carried out later, we assume 
rn - 1. We use the following truncated Gram-Charlier expansion to approximate 
the pdf p,(y"): 
p,(y')  e(y'){l+ . sty ) +  4tY )I (3) 
where n] = m], n = rn- 3, rn = E[(y) i] is the k-th order moment of y, 
(y) =  ,' 
v;e : , and Hi(y) are Chebyshev-Hermite polynomials defined by the 
identity 
dis(y) 
(-1) i y = Hi(y)e(y). 
We prefer the Gram-Charlier expansion to the Edgeworth expansion because the 
former clearly shows how n] and n affect the approximation of the pdf. The last 
term in (3) characterizes non-Gaussian distributions. To apply (3) to calculate 
H(y'), we need the following integrals: 
f 1 
- a(y)H2(y) log a(y)dy =  (4) 
f a(y)(H2(y))2H4(y)dy = 24. (5) 
These integrals can be obtained easily from the following results for the moments 
of a Gaussian random variable N(0,1): 
f y21+la(y)dy = O, f y21a(y)dy = 1.3'" (2k - 1). (6) 
By using the expansion 
y2 
_ _ q- O(y 3) 
log(1 + y)  y 2 
and taking account of the orthogonality relations of the Chebyshev-Hermite poly- 
nomials and (4)-(5), the entropy H(y ') is expanded as 
I log(2,re) (])2 () 5    I  3 
H(Ya) "'  2.3! 2-4! +  (t3) t4 q- ]--(/4)  (7) 
760 S. AMARI, A. CICHOCKI, H. H. YANG 
It is easy to calculate 
I log(2re). 
- c(y) log c(y)dy -  
From y - Wx, we have H(y) - H(x) + log [det(W)[. Applying (7) and the above 
expressions to (2), we have 
n [a2 )2 
n log(2re) - ;-"[ L% L (n 
D(W)  -H(x) - log [det(W)l + 
5 a  a I ()]. (8) 
-- (n3) n4 16 
4 A NEW LEARNING ALGORITHM 
To obtain the gradient descent algorithm to update W recursively, we need to 
calculate OD 
0--- where w is the (a,k) element of W in the a-th row and k-th column. 
Let cof(w) be the cofactor of w in W. It is not difficult to derive the followings: 
O,oglat(W)l _ cof(w) _ (W_T)  
ow - det(W) 
o; _ a[(y)xk] 
0w -- 
0; __ 4E[(y)3xk ] 
Ow - 
where (w-T) denotes the (a,k) element of (wT) -. From (8), we obtain 
OD 
  __(w-T) qt_ f(n,n)E[(y.)2x] + g(n,n)E[(y.)3x] 
where 
() 
f(y,z) = -ly + yz, g(y,z) =   2 3 2 
2 -z+y +z . 
From (9), we obtain the gradient descent algorithm to update W recursively: 
d,.,, _ O D 
at - -o(t) Ow 
= - a ")[(y")" *] g(] )[(y")x*])(10) 
v(t)((w-) ( , - , 
where q(t) is a learning rate function. Replacing the expectation values in (10) by 
their instantaneous values, we have the stochastic gradient descent gorithm: 
at = q(t){(w-T)- f(n,n)(Y") x --g(n,n)(Y")ax)  (11) 
We need to use the following adaptive Mgorithm to compute n and n in (11): 
dn] 
dt = -(t)(n-(Ya)S) 
dn_ )4 3) (12) 
at - -(t)(l - (f + 
where p(t) is another leaning rate function. 
The performance of the algorithm (11) relies on the estimation of the third and 
fourth order cumults performed by the Mgorithm (12). Replacing the moments 
A New Learning Algorithm for Blind Signal Separation 761 
of the random variables in (11) by their instantaneous values, we obtain the following 
algorithm which is a direct but coarse implementation of (11): 
= - f(y")x 
where the activation function f(y) is defined by 
 259 147 4 2y3 
f(y) = yX + _y _ _y _ _ yS + . 
(13) 
(14) 
Note the activation function f(y) is an odd function, not a monotonic function. 
The equation (13) can be written in a matrix form: 
dW = (t)(W_  _ f(y)x) ' (15) 
dt 
This equation can be further simplified as following by substituting xW  = y: 
dW 
at = (t)(I- f(y)yT)W - (16) 
where f(y) = (f(y),...,f(yn))T. The above equation is based on the gradient 
descent algorithm (10) with the following matrix form: 
dW OD (17) 
at = -r(t) OW' 
From information geometry perspective[I], since the mixing matrix A is non- 
singular we had better replace the above algorithm by the following natural gradient 
descent algorithm: 
dW t OD  
M -- -/( )-W W. (18) 
Applying the previous approximation of the gradient oD 
o--W to (18), we obtain the 
following algorithm: 
dW 
dt= rl(t)(I- f(y)yT)W (19) 
which has the same "equivariant" property as the algorithms developed in [4, 5]. 
Although the on-line learning algorithms (16) and (19) look similar to those in 
[3, 7] and [5] respectively, the selection of the activation function in this paper is 
rational, not ad hoc. The activation function (14) is determined by the ICA. It is 
a non-monotonic activation function different from those used in [3, 5, 7]. 
There is a simple way to justify the stability of the algorithm (19). Let Vec(-) 
denote an operator on a matrix which cascades the columns of the matrix from the 
left to the right and forms a column vector. Note this operator has the following 
property: 
Vec(ABC) = (C   A)Vec(B). (20) 
Both the gradient descent algorithm and the natural gradient descent algorithm are 
special cases of the following general gradient descent algorithm: 
dVec(W) = _rl(t) P OD (21) 
dt 0Vec(W) 
where P is a symmetric and positive definite matrix. It is trivial that (21) becomes 
(17) when P = I. When P = wTw  I, applying (20) to (21), we obtain 
_ OD - -(t)Vec( OD WW) 
dVec(W) _ _ri(t)(WTW  I) OVec(W)  
dt 
762 S. AMARI, A. CICHOCKI, H. H. YANG 
and this equation implies (18). So the natural gradient descent algorithm updates 
W(t) in the direction of decreasing the dependency D(W). The information geom- 
etry theory[1] explains why the natural gradient descent algorithm should be used 
to minimize the MI. 
Another on-line learning algorithm for blind separation using recurrent network was 
proposed in [2]. For this algorithm, the activation function (14) also works well. 
In practice, other activation functions such as those proposed in [2]-[6] may also be 
used in (19). However, the performance of the algorithm for such functions usually 
depends on the distributions of the sources. The activation function (14) works for 
relatively general cases in which the pdf of each source can be approximated by the 
truncated Gram-Charlier expansion. 
5 SIMULATION 
In order to check the validity and performance of the new on-line learning algorithm 
(19), we simulate it on the computer using synthetic source signals and a random 
mixing matrix. The extensive computer simulations have fully confirmed the theory 
and the validity of the algorithm (19). Due to the limit of space we present here 
only one illustrative example. 
Example: 
Assume that the following three unknown sources are mixed by a random mixing 
matrix A: 
Is  (t), s2(t), s3(t)] - In(t), 0.1sin(4OOt)cos(aOt), O.01sign[sin(5OOt + 9cos(40t))] 
where n(t) is a noise source uniformly distributed in the range [-1, +1], and s2(t) 
and s3(t) are two deterministic source signals. The elements of the mixing matrix 
A are randomly chosen in [-1, +1]. The learning rate is exponentially decreasing 
to zero as r/(t) = 250exp(-5t). 
A simulation result is shown in Figure 1. The first three signals denoted by Xl, 
X2 and X3 represent mixing (sensor) signals: xX(t), x(t) and x3(t). The last 
three signals denoted by O1, 02 and 03 represent the output signals: y(t), y(t), 
and y3(t). By using the proposed learning algorithm, the neural network is able 
to extract the deterministic signals from the observations after approximately 500 
milliseconds. 
The performance index E1 is defined by 
n n n n 
E = Z(Z ]Pij] 1) + Z( Z [pij[ -1) 
maxk IPik] maxk IPj] 
i----1 j----1 j----1 i=1 
where P -- (Pij ) -- WA. 
6 CONCLUSION 
The major contribution of this paper the rigorous derivation of the effective blind 
separation algorithm with equivariant property based on the minimization of the 
MI of the outputs. The ICA is a general principle to design algorithms for blind 
signal separation. The most difficulties in applying this principle are to evaluate 
the MI of the outputs and to find a working algorithm which decreases the MI. 
Different from the work in [6], we use the Gram-Charlier expansion instead of the 
Edgeworth expansion to calculate the marginal entropy in evaluating the MI. Using 
A New Learning Algorithm for Blind Signal Separation 763 
the natural gradient method to minimize the MI, we have found an on-line learning 
algorithm to find a de-mixing matrix. The algorithm has equivariant property and 
can be easily implemented on a neural network like model. Our approach provides 
a rational selection of the activation function for the formal neurons in the network. 
The algorithm has been simulated for separating unknown source signals mixed by 
a random mixing matrix. Our theory and the validity of the new learning algorithm 
are verified by the simulations. 
2 01 
-2 T 
i nnng  ,  mnnnnnnnnm   hnnnnnqnm  
02 
0 
0.2 03 
Figure 1: The mixed and separated signals, and the performance index 
Acknowledgment 
We would like to thank Dr. Xiao Yah SU for the proof-reading of the manuscript. 
References 
[1] S.-I. Amari. Differential-Geometrical Methods in Statistics, Lecture Notes in 
Statistics vo1.8. Springer, 1985. 
[2] S. Amari, A. Cichocki, and H. H. Yang. Recurrent neural networks for blind sep- 
aration of sources. In Proceedings 1995 International Symposium on Nonlinear 
Theory and Applications, volume I, pages 37-42, December 1995. 
[3] A. J. Bell and T. J. Sejnowski. An information-maximisation approach to blind 
separation and blind deconvolution. Neural Computation, 7:1129-1159, 1995. 
[4] J.-F. Cardoso and Beate Laheld. Equivariant adaptive source separation. To 
appear in IEEE Trans. on Signal Processing, 1996. 
[5] A. Cichocki, R. Unbehauen, L. Moszczyfiski, and E. Rummert. A new on-line 
adaptive learning algorithm for blind separation of source signals. In ISANN9, 
pages 406-411, Taiwan, December 1994. 
[6] P. Comon. Independent component analysis, a new concept? Signal Processing, 
36:287-314, 1994. 
[7] C. Jutten and J. Herault. Blind separation of sources, part i: An adaptive 
algorithm based on neuromimetic architecture. Signal Processing, 24:1-10, 1991. 
[8] A. Stuart and J. K. Ord. Kendall's Advanced Theory of Statistics. Edward 
Arnold, 1994. 
