Model selection in clustering by uniform 
convergence bounds* 
Joachim M. Buhmann nd Marcus Held 
Institut fiir Informatik III, 
RSmerstrafie 164, D-53117 Bonn, Germany 
{jb,held)@cs.uni-bonn.de 
Abstract 
Unsupervised learning algorithms are designed to extract struc- 
ture from data samples. Reliable and robust inference requires a 
guarantee that extracted structures are typical for the data source, 
i.e., similar structures have to be inferred from a second sample 
set of the same data source. The overfitting phenomenon in max- 
imum entropy based annealing algorithms is exemplarily studied 
for a class of histogram clustering models. Bernstein's inequality 
for large deviations is used to determine the maximally achievable 
approximation quality parameterized by a minimal temperature. 
Monte Carlo simulations support the proposed model selection cri- 
terion by finite temperature annealing. 
I Introduction 
Learning algorithms are designed to extract structure from data. Two classes of 
algorithms have been widely discussed in the literature - supervised and unsuper- 
vised learning. The distinction between the two classes depends on supervision or 
teacher information which is either available to the learning algorithm or missing. 
This paper applies statistical learning theory to the problem of unsupervised learn- 
ing. In particular, error bounds as a protection against over fitting are derived for 
the recently developed Asymmetric Clustering Model (ACM) for co-occurrence 
data [6]. These theoretical results show that the continuation method "determin- 
istic annealing" yields robustness of the learning results in the sense of statistical 
learning theory. The computational temperature of annealing algorithms plays the 
role of a control parameter which regulates the complexity of the learning machine. 
Let us assume that a hypothesis class 7/ of loss functions h(x; c) is given. These 
loss functions measure the quality of structures in data. The complexity of 7/ is 
controlled by coarsening, i.e., we define a 'y-cover of 7/. Informally, the inference 
principle advocated by us performs learning by two inference steps: (i) determine 
the optimal approximation level 7 for consistent learning (in terms of large risk devi- 
ations); (ii) given the optimal approximation level 7, average over all hypotheses in 
an appropriate neighborhood of the empirical minimizer. The result of the inference 
*This work has been supported by the German Israel Foundation for Science and Re- 
search Development (GIF) under grant #1-0403-001.06/95. 
Model Selection in Clustering by Uniform Convergence Bounds 217 
procedure is not a single hypothesis but a set of hypotheses. This set is represented 
either by an average of loss functions or, alternatively, by a typical member of this 
set. This induction approach is named Empirical Risk Approximation (ERA) [2]. 
The reader should note that the learning algorithm has to return an average struc- 
ture which is typical in a -,-cover sense but it is not supposed to return the hypothesis 
with minimal empirical risk as in Vapnik's "Empirical Risk Minimization" (ERM) 
induction principle for classification and regression [9]. The loss function with mini- 
mal empirical risk is usually a structure with maximal complexity, e.g., in clustering 
the ERM principle will necessarily yield a solution with the maximal number of clus- 
ters. The ERM principle, therefore, is not suitable as a model selection principle to 
determine the number of clusters which are stable under sample fluctuations. The 
ERA principle with its approximation accuracy q, solves this problem by controlling 
the effective complexity of the hypothesis class. 
In spirit, this approach is similar to the Gibbs-algorithm presented for example in 
[3]. The Gibbs-algorithm samples a random hypothesis from the version space to 
predict the label of the l + lth data point xt+. The version space is defined as 
the set of hypotheses which are consistent with the first l given data points. In our 
approach we use an alternative definition of consistency, where all hypothesis in an 
appropriate neighborhood of the empirical minimizer define the version space (see 
also [4]). Averaging over this neighborhood yields a structure with risk equivalent to 
the expected risk obtained by random sampling from this set of hypotheses. There 
exists also a tight methodological relationship to [7] and [4] where learning curves 
for the learning of two class classifiers are derived using techniques from statistical 
mechanics. 
2 The Empirical Risk Approximation Principle 
The data samples Z = (zr  f, 1 _ r _ l) which have to be analyzed by the unsu- 
pervised learning algorithm are elements of a suitable object (resp. feature) space 
f. The samples are distributed according to a measure/z which is not assumed to 
be known for the analysis.  
A mathematically precise statement of the ERA principle requires several defini- 
tions which formalize the notion of searching for structure in the data. The qual- 
ity of structures extracted from the data set Z is evaluated by the empirical risk 
1 
7(c; Z) :-- ? '-r=l h(zr; c) of a structure a given the training set Z. The function 
h(z; a) is known as loss function in statistics. It measures the costs for processing a 
generic datum z with model a. Each value a  A parameterizes an individual loss 
function with A denoting the set of possible parameters. The loss function which 
minimizes the empirical risk is denoted by d  := argminae^ 7(a; Z). 
The relevant quality measure for learning is the expected risk T(c) := 
fa h(z; a)d/z(z). The optimal structure to be inferred from the data is a  :- 
argminae^ T(a). The distribution /z is assumed to decay sufficiently fast with 
bounded rth moments E, (lb(z; a) - T(a)l < r!w-2V. (h(z; a)), a e A 
(r  2). E, (.) and V, (.) denote expectation and variance of a random vari- 
able, respectively. - is a distribution dependent constant. 
ERA requires the learning algorithm to determine a set hypotheses on the basis 
of the finest consistently learnable cover of the hypothesis class. Given a learning 
accuracy 7 a subset of parameters A v = (el,... ,alAl-1) U (&) can be defined 
such that the hypothesis class 7/ is covered by the function balls with index sets 
By(a) := {a'' fa Ih(z;a')- h(z;a)[ dp(z) _< 7}, i.e. A C [JaeA B(a). The em- 
 Knowledge of covering numbers is required in the following analysis which is a weaker 
type of information than complete knowledge of the probability measure/ (see also [5]). 
218 J. M. Buhmann andM. HeM 
pirical minimizer & has been added to the cover to simplify bounding arguments. 
Large deviation theory is used to determine the approximation accuracy "/for learn- 
ing a hypothesis from the hypothesis class 7/. The expected risk of the empirical 
minimizer exceeds the global minimum of the expected risk T(c ) by ca* with a 
probability bounded by Bernstein's inequality [8] 
P {T(& ) -T(a ) > 
< P(sup17(a)-T(a)I> 1 
_ 2[A[exp-8+4,r(e_7/a.r) 
The complexity lay[ of the coarsened hypothesis class has to be small enough to 
guarantee with high confidence small e-deviations? This large deviation inequality 
weighs two competing effects in the learning problem, i.e. the probability of a 
large deviation exponentially decreases with growing sample size l, whereas a large 
deviation becomes increasingly likely with growing cardinality of the "/-cover of the 
hypothesis class. According to (1) the sample complexity l0 (% e, 5) is defined by 
t0 2 
lg IAvl- 8 + 4r(e-f/a 
2 
+1og =0. (2) 
With probability 1 - 5 the deviation of the empirical risk from the expected risk is 
bounded by  (ePt7 T -- "/) --: "/'PP. Averaging over a set of functions which exceed 
the empirical minimizer by no more than 2"/'pp in empirical risk yields an average 
hypothesis corresponding to the statistically significant structure in the data, i.e., 
7(a) - 7(& ) <_ T(a) +"/'P - (7(&) -"/') _< 2"/'p since T(a ) < T(& ) by 
definition. The key task in the following remains to calculate the minimal precision 
e("/) as a function of the approximation "/and to bound from above the cardinality 
[Av[ of the "/-cover for specific learning problems. 
3 Asymmetric clustering model 
The asymmetric clustering model was developed for the analysis resp. grouping of 
objects characterized by co-occurrence of objects and certain feature values [6]. 
Application domains for this explorerive data analysis approach are for example 
texture segmentation, statistical language modeling or document retrieval. 
Denote by f - A' x y the product space of objects xi  A',I _< i _< n and 
features yj  y, 1 _< j _< f. The xi  3:' are characterized by observations 
Z = {zr} = {(xi(0,yj(0),r = 1,... ,l}. The sufficient statistics of how often 
the object-feature pair (xi, yj) occurs in the data set Z is measured by the set 
of frequencies {rli j  number of observations (xi, y j)/total number of observations}. 
Derived measurements are the frequency of observing object xi, i.e. r/i = Y]-=I r/ij 
and the frequency of observing feature yj given object xi, i.e. r/li = rli/rli. The 
asymmetric clustering model defines a generarive model of a finite mixture of com- 
ponent probability distributions in feature space with cluster-conditional distri- 
butions q = (qjlv), 1 _< j _< f, 1 _<  _< k (see [6]). We introduce indicator 
variables Mir  {0, 1} for the membership of object xi in cluster   {1,..., k}. 
Y].,x Mi = I i  i _< i <_ n enforces the uniqueness constraint for assignments. 
2The maximal standard deviation a* := supe^ /V {h(z; a)} defines the scale to 
measure deviations of the empirical risk from the expected risk (see [2]). 
Model Selection in Clustering by Uniform Convergence Bounds 219 
Using these variables the observed data Z are distributed according to the genera- 
tive model over A' x y: 
i k 
P {xi'YjlM'q) - n .=x Mi"qjl"' (3) 
For the analysis of the unknown data source -- characterized (at least approxima- 
tively) by the empirical data Z -- a structure a - (M, q) with M  {0, 1) xk has 
to be inferred. The aim of an ACM analysis is to group the objects xi as coded by 
the unknown indicator variables Uiv and to estimate for each cluster  a prototyp- 
ical feature distribution qjlv 
k 
Using the loss function h(xi,yj; a) = logn - '-v=x Mir logqjlv the maximiza- 
tion of the likelihood can be formulated as minimization of the empirical risk: 
7(a; Z) = y.in=x y.= rlijh(xi, yj; a), where the essential quantity to be minimized 
is the expected risk: 7(a) = in y.= ptrue {xi,Yj} h(xi,Yj; c 0. Using the max- 
imum entropy principle the following annealing equations are derived [6]: 
ZiLl (Miv)r]iJ __ Z____ 1 (Miv)r]i 
qjlv = Y.in__x {Mir) Y.=, {May) rljli' (4) 
exp [ '.f=l ljli log qjlv ] 
(M/v) = (5) 
The critical temperature: Due to the limited precision of the observed data it 
is natural to study histogram clustering as a learning problem with the hypothesis 
 2  1)A 
class 7/= {-Y.v M,v logqjlv 'M,v  {0,1} A Y.vM,v = 1A qYlv  {?, 7," , 
Y-y qlv = 1}. The limited number of observations results in a limited precision of 
the frequencies jli. The value qJl- = 0 h been excluded since it causes infinite 
expected risk for pte {yj ix i } > 0. The size of the regulized hypothesis class 
can be upper bounded by the cardinality of the complete hypothesis cls divided 
by the minimal cardinality of a 7-function ball centered at a function of the 7-cover 
A,, i.e. IA, I < I1/me 
The cdinity of a hnction ball with radius ff can be appromated by adopting 
techniques from asymptotic anysis [1] (0 (x) = (o 
I(a)l : o .- (yy Ix) log (6) 
 {ql} i,j n j[(i) 
= ' dql"  dO. 
and the entropy $ is given by 
3(q,Q,&) = 7&-ZvvZ,q, lv1)+ 
i Zi log Z. exp 
k 
The auxiliary variables Q = (Q.).= are Lagrge pameters to enforce the nor- 
malizations j qJl. = 1. Choosing qjla = qj[(i)(i)  a, we obtain  approxi- 
mation of the inteal. The reader should note that a saddlepoint appromation in 
220 J. M. Buhmann and M. HeM 
the usual sense is only applicable for the parameter & but will fail for the q, Q param- 
eters since the integrand is maximal at the non-differentiability point of the absolute 
value function. We, therefore, expand q (q, Q, &) up to linear terms (.2 (q - ) and 
integrate piece-wise. 
I l the following saddle 
Using the abbreviation niv := j ptrue {yj [xi } log Os(,) 
point approximation for the integral over  is obtained: 
1 Z- Z k _ exp(-ia) 
7 =  _ ,=lPi"l" with Pi - ,exp(_iu). (8) 
The entropy  evaluated at q = q yields in combination with the Laplace approxi- 
mation [1] an estimate for the cdinality of the 7-cover 
1 
logjam[ = n(logk-)+Zi,pnipPip(ZPini-nip) (9) 
where the second term results from the second order term of the Taylor expansion 
ound the saddle point. serting this complexity in equation (2) yields an equation 
which determines the required number of samples l0 for a fixed precision e d 
confidence . This equation defines & function relationship between the precision 
e and the approximation quity 7 for fixed sample size l0 and confidence . Under 
this sumption the precision e depends on 7 in a non-monotone fhion, i.e. 
e = a  + 2loC + + rC , (10) 
using the abbreviation C = log Avl + log . The minimum of the function e (7) 
defines a compromise between uncertainty originating from empirical fluctuations 
and the loss of precision due to the approximation by a 7-cover. Differentiating 
with respect to 7 d setting the result to zero (de(f)/d 7 = 0) yields  upper 
bound for the inverse temperature: 
1 /0 ( lo+Cr2 ) - 
 -< a* 2n r + x/21oC + r2C 2 (11) 
Analogous to estimates of k-means, phase-transitions occur in ACM while lowering 
the temperature. The mixture model for the data at hand can be partitioned 
into more and more components, revealing finer and finer details of the generation 
process. The critical &opt defines the resolution limit below which details can not 
be resolved in a reliable fashion on the basis of the sample size 10. 
Given the inverse temperature & the effective cardinality of the hypothesis class 
can be upper bounded via the solution of the fix point equation (8). On the other 
hand this cardinality defines with (11) and the sample size 10 an upper bound on &. 
Iterating these two steps we finally obtain an upper bound for the critical inverse 
temperature given a sample size 10. 
Empirical Results: 
For the evaluation of the derived theoretical result a series of Monte-Carlo exper- 
iments on artificial data has been performed for the asymmetric clustering model. 
Given the number of objects n = 30, the number of groups k = 5 and the size of the 
histograms f = 15 the generative model for this experiments was created randomly 
and is summarized in fig. 1. From this generative model sample sets of arbitrary 
size can be generated and the true distributions ptrue {yjlxi } can be calculated. 
In figure 2a,b the predicted temperatures are compared to the empirically observed 
critical temperatures, which have been estimated on the basis of 2000 different sam- 
ples of randomly generated co-occurrence data for each 10. The expected risk (solid) 
Model Selection in Clustering by Uniform Convergence Bounds 221 
' qJl' 
I {0.11, 0.01, 0.11, 0.07, 0.08, 0.04, 0.06, O, 0.13, 0.07, 0.08, 0.1, O, 0.11, 0.03} 
2 {0.18, 0.1, 0.09, 0.02, 0.05, 0.09, 0.08, 0.03, 0.06, 0.07, 0.03, 0.02, 0.07, 0.06, 0.05} 
3 {0.17, 0.05, 0.05, 0.06, 0.06, 0.05, 0.03, 0.11, 0.09, 0, 0.02, 0.1, 0.03, 0.07, 0.11} 
{0.15, 0.07, 0.1, 0.03, 0.09, 0.03, 0.04, 0.05, 0.06, 0.05, 0.08, 0.04, 0.08, 0.09, 0.04} 
5 {0.09, 0.09, 0.07, 0.1, 0.07, 0.06, 0.06, 0.11, 0.07, 0.07, 0.1, 0.02, 0.07, 0.02, 0) 
re(i) ---- (5,3,2,5,2,2,5,4,2,2,2,4,1,5,3,5,3,4,1,2,2,3,1,1,2,5,5,2,2,1) 
Figure 1: Generatire ACM model for the Monte-Carlo experiments. 
and empirical risk (dashed) of these 2000 inferred models are averaged. Overfitting 
sets in when the expected risk rises as a function of the inverse temperature :. 
Figure 2c indicates that on average the minimal expected risk is assumed when the 
effective number is smaller than or equal 5, i.e. the number of clusters of the true 
generatire model. Predicting the right computational temperature, therefore, also 
enables the data analyst to solve the cluster validation problem for the asymmetric 
clustering model. Especially for 10 = 800 the sample fluctuations do not permit the 
estimate of five clusters and the minimal computational temperature prevents such 
an inference result. On the other hand for l0 = 1600 and 10 = 2000 the minimal 
temperature prevents the algorithm to infer too many clusters, which would be an 
instance of overfitting. 
As an interesting point one should note that for an infinite number of observations 
the critical inverse temperature reaches a finite positive value and not more than 
the five effective clusters are extracted. At this point we conclude, that for the case 
of histogram clustering the Empirical Risk Approximation solves for realizable rules 
the problem of model validation, i.e. choosing the right number of clusters. 
Figure 2d summarizes predictions of the critical temperature on the basis of the 
empirical distribution ij rather than the true distribution ptrue (xi, yj). The em- 
pirical distribution has been generated by a training sample set with  of eq. (11) 
being used as a plug-in estimator. The histogram depicts the predicted inverse 
temperature for 10 = 1200. The average of these plug-in estimators is equal to 
the predicted temperature for the true distribution. The estimates of & are biased 
towards too small inverse temperatures due to correlations between the parameter 
estimates and the stopping criterion. It is still an open question and focus of ongo- 
ing work to rigorously bound the variance of this plug-in estimator. 
Empirically we observe a reduction of the variance of the expected risk occurring 
at the predicted temperature for higher sample sizes 10. 
4 Conclusions 
The two conditions that the empirical risk has to uniformly converge towards the 
expected risk and that all loss functions within an 27PP-range of the global empirical 
risk minimum have to be considered in the inference process limits the complexity 
of the underlying hypothesis class for a given number of samples. The maximum 
entropy method which has been widely employed in deterministic annealing proce- 
dures for optimization problems is substantiated by our analysis. Solutions with 
too many clusters clearly overfit the data and do not generalize. The condition that 
the hypothesis class should only be divided in function balls of size q, forces us to 
stop the stochastic search at the lower bound of the computational temperature. 
Another important result of this investigation is the fact that choosing the right 
stopping temperature for the annealing process not only avoids over fitting but also 
solves the cluster validation problem in the realizable case of ACM. A possible 
inference of too many clusters using the empirical risk functional is suppressed. 
222 J. M. Buhmann and M. Held 
8O 
72 
a) ' ,' 1:800exp 
.. Io800emp 
/ .... Io1200emP i 
78 // Io1200emp i 
76 -- ? / 
% 
70 
68 I I I I I 
5 10 15 20 26 30 35 
inverse temperature [ 
80 b) ' ' I !  Io1600exp' ' 
I I -- Io16P 
I I Io2000emp 
78 io2OOOem p . 
72 
68  
0 5 10 15 20 25 30 
inverse temperature [ 
10 i ...... .---  786 
9 :: , ,? ,.- 
 i  i ,-' 7648 
 7 i  ," .' 
. i ' .. 
' 6 ! . : 7436 
,i/.'z _ o=SOO 
0 6 12 18 24 30 
inveme temperature  
35 
:).25 
:).2 
:).15 
:).05 
[1] N. G. De Bruijn. Asymptotic Methods in Analysis. North-Holland Publishing Co., 
(repr. Dover), Amsterdam, 1958, (1981). 
[2] J. M. Buhmann. Empirical risk approximation. Technical Report IAI-TR 98-3, Institut 
fiir Informatik III, Universitit Bonn, 1998. 
[3] D. Haussler, M. Kearns, and R. Schapire. Bounds on the sample complexity of Bayesian 
learning using information theory and the VC dimension. Machine Learning, 14(1):83- 
113, 1994. 
[4] D. Haussler, M. Kearns, H.S. Seung, and N. Tishby. Rigorous learning curve bounds 
from statistical mechanics. Machine Learning, 25:195-236, 1997. 
[5] D. Haussler and M. Opper. Mutual information, metric entropy and cumulative relative 
entropy risk. Annals of Statistics, December 1996. 
[6] T. Hofmann, J. Puzicha, and M.I. Jordan. Learning from dyadic data. In M. S. Kearns, 
S. A. Solla, and D. A. Cobh, editors, Advances in Neural Information Processing Sys- 
tems 11. MIT Press, 1999. to appear. 
[7] H. S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from 
examples. Physical Review A, 45(8):6056-6091, April 1992. 
[8] A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes. 
Springer-Verlag, New York, Berlin, Heidelberg, 1996. 
[9] V. N. Vapnik. Statistical Learning Theory. Wiley-Interscience, New York, 1998. 
References 
Figure 2: Comparison between the theoretically derived upper bound on  and 
the observed critical temperatures (minimum of the expected risk rs.  curve). 
Depicted are the plots for 10 = 800, 1200, 1600, 2000. Vertical lines indicate the 
predicted critical temperatures. The average effective number of clusters is drawn 
in part c. In part d the distribution of the plug-in estimates is shown for 10 -- 1200. 
