Symplectic Nonlinear Component 
Analysis 
Lucas C. Parra 
Siemens Corporate Research 
755 College Road East, Princeton, NJ 08540 
lucas@scr.siemens.com 
Abstract 
Statistically independent features can be extracted by finding a fac- 
torial representation of a signal distribution. Principal Component 
Analysis (PCA) accomplishes this for linear correlated and Gaus- 
sian distributed signals. Independent Component Analysis (ICA), 
formalized by Comon (1994), extracts features in the case of lin- 
ear statistical dependent but not necessarily Gaussian distributed 
signals. Nonlinear Component Analysis finally should find a facto- 
rial representation for nonlinear statistical dependent distributed 
signals. This paper proposes for this task a novel feed-forward, 
information conserving, nonlinear map - the explicit symplectic 
transformations. It also solves the problem of non-Gaussian output 
distributions by considering single coordinate higher order statis- 
tics. 
I Introduction 
In previous papers Deco and Brauer (1994) and Parra, Deco, and Miesbach (1995) 
suggest volume conserving transformations and factorization as the key elements 
for a nonlinear version of Independent Component Analysis. As a general class 
of volume conserving transformations Parra et al. (1995) propose the symplectic 
transformation. It was defined by an implicit nonlinear equation, which leads to a 
complex relaxation procedure for the function recall. In this paper an explicit form 
of the symplectic map is proposed, overcoming thus the computational problems. 
438 L. C. PARRA 
In order to correctly measure the factorization criterion for non-Gaussian output 
distributions, higher order statistics has to be considered. Comon (1994) includes 
in the linear case higher order cumulants of the output distribution. Deco and 
Brauer (1994) consider multi-variate, higher order moments and use them in the 
case of nonlinear volume conserving transformations. But the calculation of multi- 
coordinate higher moments is computational expensive. 
The factorization criterion for statistical independence can be expressed in terms of 
minimal mutual information. Considering only volume conserving transformations 
allows to concentrate on single coordinate statistics, which leads to an important 
reduction of computational complexity. So far, this approach (Deco & Schiirman, 
1994; Parra et al., 1995) has been restricted to second order statistic. The present 
paper discusses the use of higher order cumulants for the estimation of the single 
coordinate output distributions. The single coordinate entropies measured by the 
proposed technique match the entropies of the sampled data more accurately. This 
leads in turns to better factorization results. 
2 Statistical Independence 
More general than decorrelation used in PCA the goal is to extract statistical 
independent features from a signal distribution p(x). We look for a determinis- 
tic transformation on n: y = f(x) which generates a factoffal representation 
P(Y) = rli p(Yi), or at least a representation where the individual coordinates p(Yi) 
of the output variable y are "as factoffal as possible". This can be accomplished 
by minimizing the mutual information MI[p(y)]. 
0 _ MI[p(y)] = E H[p(yi)]- Hip(y)], (1) 
i----1 
since MI[p(y)] - 0 holds if p(y) is factorial. The mutual information can be used 
as a measure of "independence". The entropies H in the definition (1) are defined 
as usual by Hip(y)] - - f_ p(y) in p(y) dy. 
As in linear PCA we select volume conserving transformations, but now without 
restricting ourselves to linearity. In the noise-free case of reversible transformations 
volume conservation implies conservation of entropy from the input x to the output 
y, i.e. H[p(y)] = H[p(x)] = const (see Papoulis, 1991). The minimizationof mutual 
information (1) reduces then to the minimization of the single coordinate output 
entropies H[p(yi)]. This substantially simplifies the complexity of the problem, 
since no multi-coordinate statistics is required. 
2.1 Measuring the Entropy with Cumulants 
With an upper bound minimization criterion the task of measuring entropies can 
be avoided (Parra et al., 1995): 
1 
lln(2e) + lna. 
H[p(yi)] _   
(2) 
Symplectic Nonlinear Component Analysis 439 
0.8 
0'7 t 
0.6 
0.4 
0.3 
0.2 
Edgeworth approxmatn to second and forth order 
0.1 
0 
'0.1.4 
Yl 
dQ(y)/dy 
Xl X 2 
1 
Figure 1: LEFT: Doted line: exponential distribution with additive Gaussian noise 
sampled with 1000 data points. (noise-variance/decay-constant = 0.2). Dashed 
line: Gaussian approximation equivalent to the Edgeworth approximation to second 
order. Solid line: Edgeworth approximation including terms up to fourth order. 
RIGHT: Structure of the volume conserving explicit symplectic map. 
The minimization of the individual output coordinate entropies H[p(yi)] simplifies 
to the minimization of output variances (ri. For the validity of that approach it is 
crucial that the map y = f(x) transforms the arbitrary input distribution p(x) into 
a Gaussian output distribution. But volume conserving and continuous maps can 
not transform arbitrary distributions into Gaussians. To overcome this problem one 
includes statistics - higher than second order - to the optimization criterion. 
Comon (1994) suggests to use the Edgeworth expansion of a probability distribu- 
tion. This leads to an analytic expression of the entropy in terms of measurable 
higher order cumulants. Edgeworth expands the multiplicative correction to the 
best Gaussian approximation of the distribution in the orthonormal basis of Her- 
mite polynomials ha(y). The expansion coefficients are basically given by the cu- 
mulants ca of distribution p(y). The Edgeworth expansions reads for a zero-mean 
distribution with variance (r 2, (see Kendall & Stuart, 1969) 
p(y) 
f(y) 
(3) 
Note, that by truncating this expansion at a certain order, we obtain an approx- 
imation Papp(Y), which is not strictly positive. Figure 1, left shows a sampled 
exponential distribution with additive Gaussian noise. 
By cutting expansion (3) at fourth order, and further expanding the logarithm in 
definition of entropy up to sixth order, Comon (1994) approximates the entropy by, 
440 L. C. PARRA 
i c i c4 2 7 c a lc c4 
1 ln(2re) + In o- -- (4) 
H[p(y)app] ,  12 r 6 48 r 8 48 r u + 8 r 6 r 4 
We suggest to use this expression to minimize the single coordinate entropies in the 
definition of the mutual information (1). 
2.2 Measuring the Entropy by Estimating an Approximation 
Note that (4) could only be obtained by truncating the expansion (3). It is there- 
fore limited to fourth order statistic, which might be not enough for a satisfactory 
approximation. Besides, the additional approximation of the logarithm is accurate 
only for small corrections to the best Gaussian approximation, i.e. for f(y)  1. 
For distributions with non-Gaussian tails the correction terms might be rather large 
and even negative as noted above. We therefore suggest alternatively, to measure 
the entropy by estimating the logarithm of the approximated distribution in Papp (Y) 
with the given data points y and using Edgeworth approximation (3) for Papp (Y), 
i N 
--- E In Papp (Y) - const + In (r - -- E in f(y) 
Hip(y)] , N u_-i N ,--1 
(5) 
Furthermore, we suggest to correct the truncated expansion Papp by setting 
lapp (Y) -+ 0 for all lapp (Y) < 0. For the entropy measurement (5) there is in 
principle no limitation to any specific order. 
In table 1 the different measures of entropy are compared. The values in the row 
labeled 'partition' are measured by counting the numbers n(i) of data points falling 
in equidistant intervals i of width Ay and summing-p(i)Ay lnp(i) over all intervals, 
with p(i)Ay = n(i)/N. This gives good results compared to the theoretical values 
only because of the relatively large sampling size. These values are presented here 
in order to have an reliable estimate for the case of the exponential distribution, 
where cumulant methods tend to fail. 
The results for the exponential distribution show the difficulty of the measurement 
proposed by Comon, whereas the estimation measurement given by equation (5) is 
stable even when considering (for this case) unreliable 5th and 6th order cumulants. 
The results for the symmetric-triangular and uniform distribution demonstrate the 
insensibility of the Gaussian upper bound for the example of figure 2. A uniform 
squared distribution is rotated by an angle c. On the abscissa and ordinate a 
triangular or uniform distribution are observed for the different angles c: II/4 
or c = 0 respectively. The approximation of the single coordinate entropies with 
a Gaussian measure is in both cases the same. Whereas measurements including 
higher order statistics correctly detect minimal entropy (by fixed total information) 
for the uniform distribution at c = 0. 
3 Explicit Symplectic Transformation 
Different ways of realizing a volume conserving transformation that guarantees 
H[p(x)] = H[p(x)] have been proposed (Deco & Schiirman, 1994; Parra et al., 
Symplectic Nonlinear Component Analysis 441 
Measured entropy of Gauss uniform triangular exponential 
sampled distributions symmetric + Gauss noise 
partition 1.35 q- .02 .024 q- .006 .14 q- .02 1.31 q- .03 
Gaussian upper bound (2) 1.415 q- .02 .18 q- .016 .18 q- .02 1.53 q- .04 
Comon, eq. (4) 1.414 q- .02 .14 q- .015 .17 q- .02 3.0 q- 2.5 
Estimate (5) - 4th order 1.414 q- .02 .13 q- .015 .17 q- .02 1.39 q- .05 
Estimate (5) - 6th order 1.414 q- .02 .092 q- .001 .16 q- .02 1.3 q- .5 
theoretical value 1.419 .0 .153 
Table 1: Entropy values for different distributions sampled with N -- 1000 data 
points and the different estimation methods explained in the text. The standard 
deviations are obtained by multiple repetition of the experiment. 
1995). A general class of volume conserving transformations are the symplectic 
maps (Abraham & Marsden, 1978). An interesting and for our purpose important 
fact is that any symplectic transformation can be expressed in terms of a scalar 
function. And in turn any scalar function defines a symplectic map. In (Parra 
et al., 1995) a non-reflecting symplectic transformation has been presented. But 
its implicit definition results in the need of solving a nonlinear equation for each 
data point. This leads to time consuming computations which limit in practice the 
applications to low dimensional problems (n_< 10). In this work reflecting symplec- 
tic transformations with an explicit definition are used to define a "feed-forward" 
volume conserving maps. The input and output space is divided in two partitions 
x: (x,x2) and y = (y,y2), with x,x2,y,y2 C 3 '/. 
0P(x2) 0Q(y) 
y =xx , Y2=x2+ (6) 
Ox2 Oyl 
The structure of this symplectic map is represented in figure 1, right. Two scalar 
functions P: 3 '/2 - 3 and Q: 3 '/2 - 3 can be chosen arbitrarily. Note that 
for quadratic functions equation (6) represents a linear transformation. In order 
to have a general transformation we introduce for each of these scalar functions a 
3-layer perceptron with nonlinear hidden units and a single linear output unit: 
P(x2)=w2.g(W2x2) , Q(y)=w.g(Wy). 
(7) 
The scalar functions P and Q are parameterized by the network parameters 
w,w2 C /' and W, W2 G /' x //2. The hidden-unit, nonlinear activation 
function g applies to each component of the vectors Wy and W2x2 respectively. 
Because of the structure of equation (6) the output coordinates y depend only addi- 
tively on the input coordinates x. To obtain a more general nonlinear dependence 
a second symplectic layer has to be added. 
To obtain factorial distributions the parameters of the map have to be trained. 
The approximations of the single coordinate entropies (4) or (5) are inserted in the 
mutual information optimization criterion (1). These approximations are expressed 
through moments in terms of the measured output data points. Therefore, the 
442 L. C. PARRA 
0.8 
0.6 
0.4 
0.2 
0 
-0.2 
-0.4 
-OA 
-0A 
-0.8 
Figure 2: Sampled 2-dimensional squared uniform distribution rotated by '/4. Solid 
lines represent the directions found by any of the higher order techniques explained 
in the text. Dashed lines represent directions calculated by linear PCA. (This result 
is arbitrary and varies with noise). 
gradient of these expressions with respect to parameters of the map can be computed 
in principle. For that matter different kinds of averages need to be computed. 
Even though, the computational complexity is not substantially increased compared 
with the efficient minimum variances criterion (2), the complexity of the algorithm 
increases considerably. Therefore, we applied an optimization algorithm that does 
not require any gradient information. The simple stochastic and parallel update 
algorithm ALOPEX (Unnikrishnan & Venugopal, 1994) was used. 
4 Experiments 
As explained above, finding the correct statistical independent directions of a ro- 
tated two dimensional uniform distribution causes problems for techniques which 
include only second order statistic. The statistical independent coordinates are sim- 
ply the axes parallel to the edges of the distribution (see figure 2). A rotation i.e. 
a linear transformation suffices for this task. The covariance matrix of the data is 
diagonal for any rotation of the squared distribution and, hence, does not provide 
any information about the correct orientation of the square. It is well known, that 
PCA fails to find in the case of non-Gaussian distributions the statistical indepen- 
dent coordinates. Similarly the Gaussian upper bound technique (2)is not capable 
to minimize the mutual information in this case. Instead, with any one of the higher 
order criteria explained in the previous section one finds the appropriate coordinates 
for any linearly transformed multi-dimensional uniform distribution. This has been 
observed empirically for a series of setups. The symplectic map was restricted in 
this experiments to linearity by using square scalar functions. 
The second example shows that the proposed technique in fact finds nonlinear 
relations between the input coordinates. An one-dimensional signal distributed 
according to the distribution of figure 1 was nonlinearly transformed into a two- 
Symplectic Nonlinear Component Analysis 443 
Figure 3: Symplectic map trained with 4th and 2nd order statistics corresponding 
to the equations (5) and (2) respectively. Left: input distribution. The line at 
the center of the distribution gives the nonlinear transformed noiseless signal dis- 
tributed according to the distribution shown in figure 1. Center and Right: Output 
distribution of the symplectic map corresponding to the 4th order (right) and 2nd 
order (center) criterion. 
dimensional signal and corrupted with additive noise, leading to the distribution 
shown in figure 3, left. The task of finding statistical independent coordinates has 
been tackled by an explicit symplectic transformation with. n = 2 and m = 6. 
On figure 3 the different results for the optimization according to the Gaussian 
upper bound criterion (2) and the approximated entropy criterion (5) are shown. 
Obviously considering higher order statistics in fact improves the result by finding 
the better representation of the nonlinear dependency. 
Reference 
Abraham, R., & Marsden, J. (1978). Foundations of Mechanics The Benjamin- 
Cummings Publishing Company, Inc., London. 
Comon, P. (1994). Independent component analysis, A new concep Signal Pro- 
cessing, 36, 287-314. 
Deco, G., & Brauer, W. (1994). Higher Order Statistical Decorrelation by Volume 
Concerving Nonlinear Maps. Neural Networks, ? submitted. 
Deco, G., & Schfirman, B. (1994). Learning Time Series Evolution by Unsupervised 
Extraction of Correlations. Physical Review E, ? submitted. 
Kendall, M. G., & Stuart, A. (1969). The Advanced Theory of Statistics (3 edition)., 
Vol. 1. Charles Griffin and Company Limited, London. 
Papoulis, A. (1991). Probability, Random Variables, and Stochastic Processes. Third 
Edition, McGraw-Hill, New York. 
Parra, L., Deco, G., & Miesbach, S. (1995). Redundancy reduction with 
information-preserving nonlinear maps. Network, 6(1), 61-72. 
Unnikrishnan, K., P., & Venugopal, K., P. (1994). Alopex: A Correlation-Based 
Learning Algorithm for Feedforward and Recurrent Neural Networks. Neural 
Computation, 6(3), 469-490. 
