Unsupervised and supervised clustering: 
the mutual information between 
parameters and observations 
Didier Herschkowitz Jean-Pierre Nadal 
Laboratoire de Physique Statistique de I'E.N.S.* 
Ecole Normale Sup6rieure 
24, rue Lhomond - 75231 Paris cedex 05, France 
herschko@lps.ens.fr nadal@lps.ens.fr 
http://www.lps. ens. fr / ~risc / rescomp 
Abstract 
Recent works in parameter estimation and neural coding have 
demonstrated that optimal performance are related to the mutual 
information between parameters and data. We consider the mutual 
information in the case where the dependency in the parameter (a 
vector 0) of the conditional p.d.f. of each observation (a vector 
), is through the scalar product 0. only. We derive bounds and 
asymptotic behaviour for the mutual information and compare with 
results obtained on the same model with the "replica technique". 
1 INTRODUCTION 
In this contribution we consider an unsupervised clustering task. Recent results 
on neural coding and parameter estimation (supervised and unsupervised learning 
tasks) show that the mutual information between data and parameters (equiva- 
lently between neural activities and stimulus) is a relevant tool for deriving optimal 
performances (Clarke and Barron, 1990; Nadal and Parga, 1994; Opper and Kinzel, 
1995; Haussler and Opper, 1995; Opper and Haussler, 1995; Rissanen, 1996; Brunel 
and Nadal 1998). 
Laboratory associated with C.N.R.S. (U.R.A. 1306), ENS, and Universities Paris VI 
and Paris VII. 
Mutual Information between Parameters and Observations 233 
With this tool we analyze a particular case which has been studied extensively 
with the "replica technique" in the framework of statistical mechanics (Warkin and 
Nadal, 1994; Reimann and Van den Broeck, 1996: Buhot and Gordon, 1998). After 
introducing the model in the next section, we consider the mutual information 
between the patterns and the parameter. We derive a bound on it which is of 
interest for not too large p. We show how the "free energy" associated to Gibbs 
learning is related to the mutual information. We then compare the exact results 
with replica calculations. We show that the asymptotic behaviour (p >> N) of 
the mutual information is in agreement with the exact result which is known to be 
related to the Fisher information (Clarke and Barron, 1990; Rissanen, 1996; Brunel 
and Nadal 1998). However for moderate values of c = p/N, we can eliminate false 
solutions of the replica calculation. Finally, we give bounds related to the mutual 
information between the parameter and its estimators, and discuss common features 
of parameter estimation and neural coding. 
2 THE MODEL 
We consider the problem where a direction 0 (a unit vector) of dimension N has 
to be found based on the observation of p patterns. The probability distribution 
of the patterns is uniform except in the unknown symmetry-breaking direction 0. 
Various instances of this problem have been studied recently within the satistical 
mechanics framework, making use of the replica technique (Warkin and Nadal, 1994; 
Reimann and Van den Broeck, 1996; Buhot and Gordon, 1998). More specifically 
it is assumed that a set of patterns D: {(}= is generated by p independant 
samplings from a non-uniform probability distribution where 0 = {01,..., 0v} 
The probability is written in the 
represents the symmetry-breaking orientation. 
form: 
?(10) = /_exp( 2 V(A)) (1) 
where N is the dimension of the space, A = 0.( is the overlap and V(A) characterizes 
the structure of the data in the breaking direction. As justified within the Bayesian 
and Statistical Physics frameworks, one has to consider a prior distribution on the 
parameter space, p(0), e.g. the uniform distribution on the sphere. 
The mutual information I(DlO ) between the data and 0 is defined by 
I(D,O) = f dOdDP(O)P(D,O)ln(Pp(D[O)) ) (2) 
It can be rewritten: 
where 
Z(DI0) 
< < ln( Z) >>. 
N =-o < V(,) > - N (3) 
Z = dOp(O)exp(- V(,V')) (4) 
oo it__ ! 
In the statistical physics literature - in Z is a "free energy". The brackets << .. >> 
stand for the average over the pattern distribution, and < .. > is the average over 
the resulting overlap distribution. We will consider properties valid for any N and 
an)' p, others for p > > N, and the replica calculations are valid for N and p large 
at any given value of c = -. 
234 D. Herschkowitz and J.-P. Nadal 
3 LINEAR BOUND 
The mutual information, a positive quantity, cannot grow faster than linearly in the 
amount of data, p. We derive the simple linear bound' 
I(DIO ) < -p < V(A) > (5) 
We proove the inequality for the case < A >= 0. The extension to the case < A > 0 
is straightforward. The mutual information can be written as I = H(D) - H(DIO ). 
The calculation of H(D]O) is straightforward: 
H(DI0) PNln(e27r) + p (< 2 -1) < V > (6) 
:-  > +p 
Now, the entropy of the data H(D) = - f dD?(D)lnP(D) is lower or equal to the 
entropy of a Gaussian distribution with the same variance. We thus calculate the 
covariance matrix of the data 
<< , >>= 5( & +(< x 2 > 
(7) 
where (.) denotes the average over the parameter distribution. We then have 
N 
_   n(1 + (< > -1)7,) 
i=1 
(s) 
where "/i are the eigen value of the matrix OiOj. Using  0 = 1 and the property 
In(1 + x) _< x we obtain 
H(D) < P-ln(27re) + p (< 2 -1) 
-  > 
and (6) together, we find the inequality (5). /,From this and 
Putting (9) 
follows also 
p <V> < - <<ln(Z) >> <_0 
(9) 
(3) it 
(10) 
4 REPLICA CALCULATIONS 
In the limit N - oc with c finite, the free energy becomes self-averaging, that 
is equal to its average, and its calculation can be performed by standard replica 
technique. This calculation is the same as calculations related to Gibbs learning, 
done in (Reimann and van den Broeck, 1996, Buhot and Gordon, 1998), but the 
interpretation of the order parameters is different. Assuming replica symmetry, we 
reproduce in fig.2 results from (Buhot and Gordon, 1998) for the behaviour with c 
of Q which is the typical overlap between two directions compatible with the data. 
The overlap distribution P() was chosen to get patterns distributed according to 
two clusters along the symmetry-breaking direction 
P(A) = i , (A - ep) 2 
2ax/'  exp( ) (11) 
In fig.2 and fig.1 we show the corresponding behaviour of the average free energy 
and of the mutual information. 
Mutual Information between Parameters and Observations 235 
4.1 Discussion 
Up to O 1 Q : 0 and the mutual information is in a purely linear phase I(O[D) _ 
' N -- 
-a < V() >. This correspond to a regime where the data have no correlations. 
For a _> ax, the replica calculation admits up to three differents solutions. In view of 
the fact that the mutual information can never decrease with a and that the average 
free energy can not be positive, it follows that only two behaviours are acceptable. 
In the first, Q leaves the solution Q = 0 at al, and follows the lower branch until a3 
where it jumps to the upper branch. This is the stable way. The second possibility 
is that Q = 0 until a2 where it directly jumps to the upper branch. In (Buhot and 
Gordon, 1998), it has been suggested that one can reach the upper branch, well 
before a3. Here we have thus shown that it is only possible from a2. It remains 
also the possibility of a replica symetry breacking phase in this range of a. 
In the limit a - c the replica calculus gives for the behaviour of the mutual 
information 
N dl?() )2 >) (12) 
(o10) < ( 
The r.h.s can be shown to be equal to half the logarithm of the determinant of the 
Fisher information matrix, which is the exact asymptotic behaviour (Clarke and 
Barron, 1990; Brunel and Nadal, 1998). It can be shown that this behaviour for 
p >> N implies that the best possible estimator based on the data will saturate 
the Cramer-Rao bound (see e.g. Blahut, 1988). It has already been noted that the 
asymptotics performance in estimating the direction, as computed by the replica 
technique, saturate this bound (Van den Broeck, 1997). What we have check here 
is that this manifests itself in the behaviour of the mutual information for large c. 
4.2 Bounds for specific estimators 
Given the data D, one wants to find an estimate J of the parameter. The amount 
of information I(D]O) limits the performance of the estimator. Indeed, one has 
I(J]O) < I(D[O). This basic relationship allows to derive interesting bounds based 
on the choice of particular estimators. We consider first Gibbs learning, which 
consists in sampling a direction J from the 'a posteriori' probability P(JtD) = 
P(D[J)p(J) / P(D). In this particular case, the differential entropy of the estimator 
J and of the parameter 0 are equal H(J) = H(O). If i - Qa2 is the variance of the 
Gibbs estimator one gets, for a Gaussian prior on 0, the relations 
ln(1- Qa2) < Icibbs(jiO ) _< I(DiO) (13) 
These relations together with the linear bound (5) allows to bound the order pa- 
rameter Qa for small c where this bound is of interest. 
The Bayes estimator consists in taking for J the center of mass of the 'a posteriori' 
probability. In the limit c - o, this distribution becomes Gaussian centered at its 
most probable value. We can thus assume ?Baye(J]O) to be Gaussian with mean 
QbO and variance 1 - (2b 2, then the first inequality in (13) (with Qa replaced by 
Q and Gibbs by Bayes) is an equality. Then using the Cramer-Rao bound on the 
variance of the estimator, that is (1 - Q)/Q >_ (c < (dV/d) 2 >)-, one can 
bound the mutual information for the Bayes estimator 
I,ye(JlO ) < 2N---ln(1 + c < (dV(X-----})) 2 >) (14) 
- d 
236 D. Herschkowitz and J.-P. Nadal 
These different quantities are shown on fig.1. 
5 CONCLUSION 
We have studied the mutual information between data and parameter in a problem 
of unsupervised clustering: we derived bounds, asymptotic behaviour, and com- 
pared these results with replica calculations. Most of the results concerning the 
behaviour of the mutual information, observed fox' this particular clustering task, 
are "universal", in that they will be qualitatively the same for any problem which 
can be formulated as either a parameter estimation task or a neural coding/signal 
processing task. In particular, there is a linear regime for small enough amount of 
data (number of coding cells), up to a maximal value related to the VC dimension 
of the system. For large data size, the behaviour is logarithmic - that is I  lnp 
l lnp (Clarke and Bar- 
(Nadal and Parga, 1994; Opper and Haussler, 1995) or  
ron, 1990; Opper and Haussler, 1995; Brunel and Nadal, 1998) depending on the 
smoothness of the model. A more detailed review with more such universal features, 
exact bounds and relations between unsupervised and supervised learning will be 
presented elsewhere. (Nadal, Herschkowitz, to appear in Phys. rev. E). 
Acknowledgements 
We thank Arnaud Buhot and Mirta Gordon for stimulating discussions. This work 
has been partly supported by the French contract DGA 96 2557A/DSP. 
References 
[BSS] 
[BG98] 
[BN98] 
[CB90] 
[HO95] 
[OH95] 
[NP94a] 
[OK95] 
[Ris] 
R. E. Blahut, Addison-Wesley, Cambridge MA, 1998. 
A. Buhot and M. Gordon. Phys. Rev. E, 57(3):3326-3333, 1998. 
N. Brunel and J.-P. Nadal. Neural Computation, to appear, 1998. 
B. S. Clarke and A. R. Barron. IEEE Trans. on Information Theory, 
36 (3):453-471, 1990. 
D. Haussler and M. Opper. conditionally independent observa- 
tions. In VIIIth Ann. Workshop on Computational Learning Theory 
(COLT95), pages 402-411, Santa Cruz, 1995 (ACM, New-York). 
M. Opper and D. Haussler supervised learning, Phys. Rev. Lett., 
75:3772-3775, 1995. 
J.-P. Nadal and N. Parga. unsupervised learning. Neural Computa- 
tion, 6:489-506, 1994. 
M. Opper and W. Kinzel. In E. Domany J.L. van Hemmen and 
K. Schulten, editors, Physics of Neural Networks, pages 151-. 
Springer, 1995. 
J. Rissanen. IEEE Trans. on Information Theory, 42 (1):40-47, 
1996. 
Mutual Information between Parameters and Observations 237 
[RVdB96] 
[VdB9S] 
[WN94] 
P. Reimann and C. Van den Broeck. Phys. Rev. E, 53 (4):3989-3998, 
1996. 
C. Van den Broeck. In proceedings of the TANG workshop (Hong- 
Kong May 26-28, 1997). 
T. Warkin and J.-P. Nadal. J. Phys. A: Math. and Gen., 27:1899- 
1915, 1994. 
1.8 - 
1.6 - 
1.4 - 
1.2 - 
1.0 - 
0.8 - 
I 
0.6 
/, 
0.4 - / 
/, 
0.2 ,,. 
0.0 
0.0 
I 
I 
I 
/ 
/ 
I I 
.... p/N <V> 
-- replica information 
.......... 0.5*ln(l+p/N <(V')**2>) 
---- -0.5*ln(1-Qb**2) 
.... 0.5'1n(1 -Qg**2) 
12 I I I I I I 
2.0 4.0 6.0 8.0 10.0 12.0 14.0 
Figure 1' Dashed line is the linear bound on the mutual information I(D[O)/N. The 
latter, calculated with the replica technique, saturates the bound for a _< ax, and 
is the (lower) solid line for a > a. The special structure on fig.2 is not visible here 
x ln(1 - Qg2) is a lower bound on the mutual 
due to the graph scale. The curve -3 
information between the Gibbs estimator and 0 (which would be equal to this bound 
if the conditional probability distribution of the estimator were Gaussian with mean 
QgO and variance 1 - (292). Shown also is the analogous curve -ln(1 - Qb 2) for 
the Bayes estimator. In the limit a -+ oc these two latter Gaussian curves and 
the replica informnation I(D[O), all converge toward the exact asymptotic behaviour, 
di'(x)/2 >) (upper solid line). This latter 
which can be expressed as ln(1 + a < ( dx / 
expression is, for any p, an upper bound for the two Gaussian curves. 
238 D. Herschkowitz and J.-P. Nadal 
0.002 
o.ool 
o.ooo 
-o.ool 
-0.002 
-0.003 
2.0 
-<<In(z)>> 
0(, 3 
I I I 
2.2 2.4 2.6 
0.9 
0.8 
0.7 
0.6 
0.5 
0.4 
0.3 
0.2 
0.1 
0.0 
2.( 
.... borne Cramer-Rao 
 i i 
2.2 2.4 2.6 
Figure 2: In the lower figure, the optimal learning curve Qb(c) for p = 1.2 and 
a = 0.5, as computed in (Buhot and Gordon, 1998) under the replica symetric 
ansatz. We have put the Cramer-Rao bound for this quantity. In the upper figure, 
the average free energy - << lnZ >> IN. All the part above zero has to be 
rejected. 
O 1 = 2.10, c2 = 2.515 and c3 = 2.527 
