Multiplicative Updating Rule 
for Blind Separation Derived 
from the Method of Scoring 
Howard Hua Yang 
Department of Computer Science 
Oregon Graduate Institute 
PO Box 91000, Portland, OR 97291, USA 
hyang@cse.ogi.edu 
Abstract 
For blind source separation, when the Fisher information matrix is 
used as the Riemannian metric tensor for the parameter space, the 
steepest descent algorithm to maximize the likelihood function in 
this Riemannian parameter space becomes the serial updating rule 
with equivariant property. This algorithm can be further simplified 
by using the asymptotic form of the Fisher information matrix 
around the equilibrium. 
I Introduction 
The relative gradient was introduced by (Cardoso and Laheld, 1996) to design 
multiplicative updating algorithms with equivariant property for blind separation 
problems. The idea is to calculate differentials by using a relative increment instead 
of an absolute increment in the parameter space. This idea has been extended to 
compute the relative Hessian by (Pham, 1996). 
For a matrix function f = f(W), the relative gradient is defined by 
Of,, r 
(1) 
From the differential of f(W) based on the relative gradient, the following learning 
rule is given by (Cardoso and Laheld, 1996) to maximize the function f: 
dW Of. ,. 
 = ,9fw = '5-W  (2) 
Also motivated by designing blind separation algorithms with equivariant property, 
MuItipIicative Updating RuIe for BIind Separation 697 
the natural gradient defined by 
v f = ow wTw (3) 
was introduced in (Amari et al, 1996) which yields the same learning rule (2). The 
geometrical meaning of the natural gradient is given by (Amari, 1996). More details 
about the natural gradient can be found in (Yang and Amari, 1997) and (Amari, 
1997). 
The framework of the natural gradient learning was proposed by (Amari, 1997). In 
this framework, the ordinary gradient descent learning algorithm in the Euclidean 
space is not optimal in minimizing a function defined in a Pdemannian space. The 
ordinary gradient should be replaced by the natural gradient which is defined by 
operating the inverse of the metric tensor in the Pdemannian space on the ordinary 
gradient. Let w denote a parameter vector. It is proved by (Amari, 1997) that if 
C(w) is a loss function defined on a Pdemannian space {w} with a metric tensor G, 
the negative natural gradient of C(w) namely, -G- oc is the steepest descent 
' 
direction to decrease this function in the Riemannian space. Therefore, the steepest 
descent algorithm in this Pdemannian space has the following form: 
dw _ OC 
'dr - -oG- Ow' 
If the Fisher information matrix is used as the metric tensor for the Riemannian 
space and C(w) is replaced by the negative log-likelihood function, the above learn- 
ing rule becomes the method of scoring ( Kay, 1993) which is the focus of this paper. 
Both the relative gradient  and the natural gradient  were proposed in order to 
design the multiplicative updating algorithms with the equivariant property. The 
former is due to a multiplicative increment in calculating differential while the latter 
is due to an increment based on a nonholonomic basis (Amari, 1997). Neither V 
nor V depends on the data model. The Fisher information matrix is a special 
and important choice for the Riemannian metric tensor for statistical estimation 
problems. It depends on the data model. Operating the inverse of the Fisher 
information matrix on the ordinary gradient, we have another gradient operator. It 
is called a natural gradient induced by the Fisher information matrix. 
In this paper, we show how to derive a multiplicative updating algorithm from 
the method of scoring. This approach is different from those based on the relative 
gradient and the natural gradient defined by (3). 
2 Fisher Information Matrix For Blind Separation 
Consider a linear mixing system: 
X--AS 
where A e nxn, x = (x,...,xn) :r and s = (s,...,sn) :r. 
are independent with a factorized joint pdf: 
The likelihood function is 
r(8/= II 
i----1 
r(A-lx) 
p(x;A) = [AI 
Assume that sources 
698 H. H. Yang 
where IAI - Idet(A)l. Let W = A -1 and y = Wx ( a demixing system), then we 
have the log-likelihood function 
L(W) = Elogri(yi) + log [W[. 
i=1 
It is easy to obtain 
o 
Owi--' = ri(yi) x1 + WT (4) 
where W T is the (i,j) entry in W -T = (w-l) T. Writing (4) in a matrix form, 
we have 
OL = W_r _ (y)xr = (I-,I,(y)yr)W-r= r(y)w -r 
ow 
where (y) = (b(y),... c)n(y,,)) r, c)i(yi) = r and F(y) = I-(I,(y)y r. 
' -- ri (Yl) 
The maximum likelihood algorithm based on the ordinary gradient o is 
dW 
dt= (I- (y)yT)W-T ---- vF(y)W -r 
(5) 
which has the high computational complexity due to the matrix inverse W -. The 
maximum likelihood algorithm based on the natural gradient of matrix functions is 
_ = - (U)Ur)W. (6) 
dt 
The same algorithm is obtained from aW = rlLW by using the relative gradient. 
An apparent reason for using this algorithm is to avoid the matrix inverse W -1. 
Another good reason for using it is due to the fact that the matrix W driven by 
(6) never becomes singular if the initial matrix W is not singular. This is proved 
by (Yang and Amari, 1997). In fact, this property holds for any learning rule of the 
following type: 
dW 
= (7) 
Let < U, V >= Tr(UrV) denote the inner product of U and V  nx. When 
W(t) is driven by the equation (7), we have 
alW I  aW 
a, --< , a, >-< IWl( W-I)T aw 
-- ' dt > 
= Tr(IWIW-1H(y)W)- (H(y))IWI. 
Therefore, 
Iw(t)l-Iw(0)lexp{ Tr(H(u(r)))dr} (8) 
which is non-singular when the initial matrix W(0) is non-singular. 
The matrix function F(y) is also called an estimating function. At the equilibrium 
of the system (6), it satisfies the zero condition E[F(y)] = 0, i.e., 
E[efii(yi)Yj] = 5ij (9) 
where 5i I = I if i = j and 0 otherwise. 
To calculate the Fisher information matrix, we need a vector form of the equation 
(5). Let Vec(.) denote an operator on a matrix which cascades the columns of the 
Multiplicative Updating Rule for Blind Separation 699 
matrix from the left to the right and forms a column vector. This operator has the 
following property: 
Vec(ABC) = (C '  A)Vec(B) (10) 
where  denotes the Kroneckr product. Applying this property, we first rewrite 
(5) as 
OL OL 
0Vec(W) = Vec(-) -- (W -1  I)Vec(F(y)), (11) 
and then obtain the Fisher information matrix 
E' 0 ( 
G = [0Ve-W) 0Vec(W) 
= (w -1  X)E[Vec(F(u))Vecr(F(y))](W -r  .r). (12) 
The inverse of G is 
G -1 = (W T  I)D-I(w  I) 
where D = E[Vec(F(y))Vec r(F(y))]. 
(13) 
3 Natural Gradient Induced By Fisher Information Matrix 
Define a Riemaanian space 
V = {Vec(W); W e Gl(n)} 
in which the Fisher information matrix G is used as its metric. Here, Gl(n) is the 
space of all the n x n invertible matrices. 
Let C(W) be a matrix function to be minimized. It is shown by (Amari, 1997) that 
the steepest descent direction in the Riemannian space 12 is --6 -1 OC 
0 e' 
Let us define the natural gradient in 12 by 
re(w) = (W T  I)D-I(w  I) OC (14) 
0Vec(W) 
which is called the natural gradient induced by the Fisher information matrix. The 
time complexity of computing the natural gradient in the space 12 is high since 
inverting the matrix D of n 2 x n 2 is needed. 
Using the natural gradient in 12 to maximize the likelihood function L(W) or the 
method of scoring, from (11) and (14) we have the following learning rule 
Vec(d-t ) = rl(W T  I)D-1Vec(F(y)) (15) 
We shall prove that the above learning rule has the equivariant property. 
Denote Vec -1 the inverse of the operator Vec. Let matrices B and A be of n 2 x n 2 
and n x n, respectively. Denote B(i, .) the i-th row of B and Bi = Vec-l(B(i, ')), 
i = 1,..-, n 2. Define an operator B* as a mapping from nxn to nxn: 
< B1,A> ... < Bn2_n+l,A> ] 
B.A= ......... 
< Bn, A> '" <Bn2,A> 
where < .,. > is the inner product in nxn. With the operation *, we have 
BVec(A) =  = Vec(Vec-( i  )) = Vec(B. A), 
< Bn2,A > < Bn,A > 
BVec(A) = Vec(B * A). 
Applying the above relation, we first rewrite the equation (15) as 
Vec(d- -) = rl(W T  I)Vec(D -i * F(y)), 
then applying (10) to the above equation we obtain 
dW 
dt = rl(D-t * F(y))W' 
(16) 
Theorem I For the blind separation problem, the maximum likelihood algorithm 
based on the natural gradient induced by the Fisher information matrix or the method 
of scoring has the form (16) which is a multiplicative updating rule with the equiv- 
ariant property. 
To implement the algorithm (16), we estimate D by sample average. Let fij (y) be 
the (i, j) entry in F(y). A general form for the entries in D is 
which depends on the source pdfs ri(si). When the source pdfs are unknown, in 
practice we choose ri (si) as our prior assumptions about the source pdfs. To simplify 
the algorithm (16), we replace D by its asymptotic form at the solution points 
a = (ci%(1),'" ,CnSa(n)) T where (a(1),..-,a(n)) is a permutation of (1,... ,n). 
Regarding the structure of the asymptotic D, we have the following theorem: 
Theorem 2 Assume that the pdfs of the sources si are even functions. 
Then at the solution point a = (cisao),... , cn%(n)) T, D is a diagonal matrix and 
its n 2 diagonal entries have two forms, namely, 
E[fij(a)fij(a)] = piAj, for i  j and 
2] 
where Ii = E[(ai)], Ai = E[a] and ui = E[(ai)a]- 1. 
have 
D = diag(Vec(H)) 
whe 
More concisely, we 
(17) 
- diag(/A,... ,Pn,n) + diag(yl,'",yn) 
The proof of Theorem 2 is given in Appendix 1. 
Let H = (hij)nxn. Since all/i, Ai, and vi are positive, and so are all hij. We define 
Then from (17), we have 
1 
D -1 = diag(VeC(H) ). 
The results in Theorem 2 enable us to simphfy the algorithm (16) to obtain a low 
complexity learning rule. Since D -1 is a diagonal matrix, for any n x n matrix A 
we have 
1 
D-1Vec(A) = Vec(  A) (18) 
Multiplicative Updating Rule for Blind Separation 701 
where  denotes the componentwise multiplication of two matrices of the same 
dimension. Applying (18) to the learning rule (15), we obtain the following learning 
rule 
Vec( ) = i(W T  I) Vec( /./  F(y) ). 
Again, applying (10) to the above equation we have the following learning rule 
dW 1 
 - I(H  F(y))W. (19) 
Like the learning rule (16), the algorithm (19) is also multiplicative; but unlike (16), 
there is no need to inverse the n 2 x n 2 matrix in (19). The computation of  is 
straightforward by computing the reciprocals of the entries in H. 
(/i, ,Xi, ui) are 3n unknowns in G. Let us impose the following constraint 
= (20) 
Under this constraint, the number of unknowns in G is 2n, and D can be written 
D = Dx  D, (21) 
where Dx = diag(,X,. ., ,Xn) and Du = diag(/,...,/n). 
From (14), using (21) we have the natural gradient descent rule in the Riemannian 
space 12 
_ a (22) 
Vec(W) _ _,W TDw  D)aVec(W) ' 
dt 
Applying the property (10), we rewrite the above equation in a matrix form 
dW OC W T W. (23) 
Since/i and ,Xi are unknown, Du and Dx are replaced by the identity matrix in 
practice. Therefore, the algorithm (2) is an approximation of the algorithm (23). 
Taking C = -L(W) as the negative likelihood function and applying the expres- 
sion (5), we have the following maximum likelihood algorithm based on the natural 
gradient in 
dW 
at = (I - w. (24) 
Again, replacing Du and D by the identity matrix we obtain the maximum like- 
lihood algorithm (6) based on the relative gradient or natural gradient of matrix 
functions. 
In the context of the blind separation, the source pdfs are unknown. The prior 
assumption ri(si) used to define the functions qbi(yi) may not match the true pdfs 
of the sources. However, the algorithm (24) is generally robust to the mismatch 
between the true pdfs and the pdfs employed by the algorithm if the mismatch is 
not too large. See (Cardoso, 1997) and ( Pham, 1996) for example. 
4 Conclusion 
In the context of blind separation, when the Fisher information matrix is used as 
the Riemannian metric tensor for the parameter space, maximizing the likelihood 
function in this Riemannian space based on the steepest descent method is the 
method of scoring. This method yields a multiplicative updating rule with the 
equivariant property. It is further simplified by using the asymptotic form of the 
Fisher information matrix around the equilibrium. 
5 Appendix 
Appendix 1 Proof of Theorem : 
By definition fij (Y) = 50- cfi(Yi)yj. At the equilibrium a = (c s,0),... , cns,(n) 
we have E[cfi(ai)aj] = 0 for i  j and E[cfi(ai)ai] = 1. So E[fij(a)] = 0. Since 
the source pdfs are even functions, we have E[ai] -- 0 and E[qbi(ai)] = 0. Applying 
these equalities, it is not difficult to verify that 
E[fij(a)fkt(a)] = O, for (i,j)  (k,/). 
So, D is a diagonal matrix and 
E[fii(a)fii(a)] = El(1- cfi(ai)ai) 2] = E[cf(ai)a] - 1, 
E[fij(a) fij(a)] ---- E[qbi2(ai)a] - !uiAj 
for i 
Q.E.D. 
(25) 
References 
[1] S. Amari. Natural gradient works efficiently in learning. Accepted by Neural 
Computation, 1997. 
[2] S. Amari. Neural learning in structured parameter spaces - natural Riemannian 
gradient. In Advances in Neural Information Processing Systems, 9, ed. M. C. 
Mozer, M. I. Jordan and T. Petsche, The MIT Press: Cambridge, MA., pages 
127-133, 1997. 
[3] S. Amari, A. Cichocki, and H. H. Yang. A new learning algorithm for blind 
signal separation. In Advances in Neural Information Processing Systems, 8, 
eds. David S. Touretzky, Michael C. Mozer and Michael E. Hasselmo, MIT 
Press: Cambridge, MA., pages 757-763, 1996. 
[4] J.-F. Cardoso. Infomax and maximum likelihood for blind source separation. 
IEEE Signal Processing Letters, April 1997. 
[5] J.-F. Cardoso and B. Laheld. Equivariant adaptive source separation. IEEE 
Trans. on Signal Processing, 44(12):3017-3030, December 1996. 
[6] S. M. Kay. Fundamentals of Statistical Signal Processing: Estimation Theory. 
PTR Prentice Hall, Englewood Cliffs, 1993. 
[7] D. T. Pham. Blind separation of instantaneous mixture of sources via an ica. 
IEEE Trans. on Signal Processing, 44(11):2768-2779, November 1996. 
[8] H. H. Yang and S. Amari. Adaptive on-line learning algorithms for blind sep- 
aration: Maximum entropy and minimum mutual information. Neural Compu- 
tation, 9(7):1457-1482, 1997. 
