Convergence Properties of the K-Means 
Algorithms 
Lon Bottou 
Neuristique, 
28 rue des Petites Ecuries, 
75010 Paris, France 
leonneurist ique. fr 
Yoshua Bengio* 
Dept. I.R.O. 
Universit de Montrdal 
Montreal, Qc H3C-3J7, Canada 
bengioyiro. umontreal. ca 
Abstract 
This paper studies the convergence properties of the well known 
K-Means clustering algorithm. The K-Means algorithm can be de- 
scribed either as a gradient descent algorithm or by slightly extend- 
ing the mathematics of the EM algorithm to this hard threshold 
case. We show that the K-Means algorithm actually minimizes the 
quantization error using the very fast Newton algorithm. 
I INTRODUCTION 
K-Means is a popular clustering algorithm used in many applications, including the 
initialization of more computationally expensive algorithms (Gaussian mixtures, 
Radial Basis Functions, Learning Vector Quantization and some Hidden Markov 
Models). The practice of this initialization procedure often gives the frustrating 
feeling that K-Means performs most of the task in a small fraction of the overall 
time. This motivated us to better understand this convergence speed. 
A second reason lies in the traditional debate between hard threshold (e.g. K- 
Means, Viterbi Training) and soft threshold (e.g. Gaussian Mixtures, Baum Welch) 
algorithms (Nowlan, 1991). Soft threshold algorithms are often preferred because 
they have an elegant probabilistic framework and a general optimization algorithm 
named EM (expectation-maximization) (Dempster, Laird and Rubin, 1977). In the 
case of a gaussian mixture, the EM algorithm has recently been shown to approxi- 
mate the Newton optimization algorithm (Xu and Jordan, 1994). We prove in this 
*also, AT&T Bell Labs, Holmdel, NJ 07733 
586 l_on Bottou, Yoshua Bengio 
paper that the corresponding hard threshold algorithm, K-Means, minimizes the 
quantization error using exactly the Newton algorithm. 
In the next section, we derive the K-Means algorithm as a gradient descent pro- 
cedure. Section 3 extends the mathematics of the EM algorithm to the case of 
K-Means. This second derivation of K-Means provides us with proper values for 
the learning rates. In section 4 we show that this choice of learning rates opti- 
mally rescales the parameter space using Newton's method. Finally, in section 5 
we present and discuss a few experimental results comparing various versions of the 
K-Means algorithm. The 5 clustering algorithms presented here were chosen for a 
good coverage of the algorithms related to K-Means, but this paper does not have 
the ambition of presenting a literature survey on the subject. 
2 K-MEANS AS A GRADIENT DESCENT 
Given a set of P examples (xi), the K-Means algorithm computes k prototypes 
w = (w) which minimize the quantization error, i.e., the average distance between 
each pattern and the closest prototype: 
1 nn(xi - w)' 
i i 
(1) 
Writing si(w) for the subscript of the closest prototype to example xi, we have 
1 
E(w) = (2) 
i 
2.1 GRADIENT DESCENT ALGORITHM 
We can then derive a gradient descent algorithm for the quantization error: 
OE(w) This leads to the following batch update equation (updating pro- 
AW -- --t Ow 
totypes after presenting all the examples)' 
-wn) ifk=si(w) 
Aw: = y. ;t(xi otherwise. (3) 
$ 
We can also derive a corresponding online algorithm which updates the prototypes 
after the presentation of each pattern xi: 
0/,(zi, 
Aw = --t Ow , i.e., 
AWL={ ;t(xi-- w) ifk=si(w) 
otherwise. 
(4) 
The proper value of the learning rate et remain to be specified in both batch and 
online algorithms. Convergence proofs for both algorithms (Bottou, 1991) exist 
for decreasing values of the learning rates satisfying the conditions  et = c and 
 et 2 < c. Following (Kohonen, 189), we could choose et = eo/t. We prove 
however in this paper that there exist a much better choice of learning rates. 
Convergence Properties of the K-Means Algorithms 587 
3 K-MEANS AS AN EM STYLE ALGORITHM 
3.1 EM STYLE ALGORITHM 
mizes Q(w , w) where w is the previous set of prototypes. 
compute the explicit solution of this minimization problem. 
OQ(w t, w)/Owt, = 0 yields: 
The following derivation of K-Means is similar to the derivation of (MacQueen, 
1967). We insist however on the identity between this derivation and the mathe- 
matics of EM (Liporace, 1976) (Dempster, Laird and Rubin, 1977). 
Although K-Means does not fit in a probabilistic framework, this similarity holds 
for a very deep reason: The semi-ring of probabilies (+, +, x) and the idempo- 
tent semi-ring of hard-threshold scores (, Min, +) share the most significant al- 
gebraic properties (Bacceli, Cohen and Olsder, 1992). This assertion completely 
describes the similarities and the potential differences between soft-threshold and 
hard-threshold algorithms. A complete discussion however stands outside the scope 
of this paper. 
The principle of EM is to introduce additional "hidden" variables whose knowledge 
would make the optimization problem easier. Since these hidden variables are un- 
known, we maximize an auxiliary function which averages over the values of the 
hidden variables given the values of the parameters at the previous iteration. In 
our case, the hidden variables are.the assignments si(w) of the patterns to the pro- 
totypes. Instead of considering the expected value over the distribution on these 
hidden variables, we just consider the values of the hidden variables that minimize 
the cost, given the previous values of the parameters: 
1 
 W t 2 
Q(w', w) der  (x, -- ,() 
i 
The next step consists then in finding a new set of prototypes w t which mini- 
We can analytically 
Solving the equation 
1 
w ;vk  
i: =s(w) 
(5) 
where Nk is the number of examples assigned to prototype w. 
consists in repeatedly replacing w by w t using update equation (6) until convergence. 
Since si(w ) is by definition the best assignment of patterns xi to the prototypes 
w, we have the following inequality: 
1 
(w,) - Q(w',w) =  (, ' , ' - w' , 
- w ( ) - (zi 8() <_0 
i 
Using this result, the identity E(w) = Q(w, w) and the definition of w t, we can 
derive the following inequality: 
(w') - (w) = (w') - Q(w', w) + Q(w', w) - Q(w, w) 
_ Q(w', w) - Q(w, w) _ o 
The algorithm 
Each iteration of the algorithm thus decreases the otherwise positive quantization 
error E (equation 1) until the error reaches a fixed point where condition w * = w* 
is verified (unicity of the minimum of Q(., w*)). Since the assignment functions 
si(w) are discrete, there is an open neighborhood of w* on which the assignments 
are constant. According to their definition, functions E(.) and Q(., w*) are equal 
on this neighborhood. Being the minimum of function Q(., w*), the fixed point w* 
of this algorithm is also a local minimum of the quantization error E. E2 
588 Ldon Bottou, Yoshua Bengio 
3.2 BATCH K-MEANS 
The above algorithm (5) can be rewritten in a form similar to that of the batch 
gradient descent algorithm (3). 
if k = s(zi, w) 
otherwise. (6) 
This algorithm is thus equivalent to a batch gradient descent with a specific, pro- 
1 
totype dependent, learning rate N-' 
3.3 ONLINE K-MEANS 
The online version of our EM style update equation (5) is based on the computation 
of the mean pt of the examples xl,   ', xt with the following recursive formula: 
1 (Zt+l -- 
1 (t t t - Xt+l) : t t - t- 
tt+l -- t+l 
Let us introduce new variables n which count the number of examples so far 
assigned to prototype w. We can then rewrite (5) as an online update applied 
after the presentation of each pattern xi: 
1 ifk = s(xi,w) 
An = 0 otherwise. 
{ l(xi-w,) ifk=s(xi,w) 
Aw = k otherwise. (7) 
This algorithm is equivalent to an online gradient descent (4) with a specific, proto- 
type dependent, learning rate h-' Unlike in the batch case, the pattern assignments 
s(xi, w) are thus changing after each pattern presentation. Before applying this al- 
gorithm, we must of course set n to zero and w to some initial value. Various 
methods have been proposed including initializing w with the first k patterns. 
3.4 CONVERGENCE 
General convergence proofs for the batch and online gradient descent (Bottou, 1991; 
Driancourt, 1994) directly apply for all four algorithms. Although the derivatives 
are undefined on a few points, these theorems prove that the algorithms almost 
surely converge to a local minimum because the local variations of the loss function 
are conveniently bounded (semi-differentiability). Unlike previous results, the above 
convergence proofs allow for non-linearity, non-differentiability (on a few points) 
(Bottou, 1991), and replacing learning rates by a positive definite matrix (Drian- 
court, 1994). 
4 K-MEANS AS A NEWTON OPTIMIZATION 
We prove in this section that Batch K-Means (6) applies the Newton algorithm. 
4.1 THE HESSIAN OF K-MEANS 
Let us compute the Hessian H of the K-Means cost function (2). This matrix 
contains the second derivatives of the cost E(w) with respect to each pair of pa- 
rameters. Since E(w) is a sum of terms L(xi, w), we can decompose H as the sum 
Convergence Properties of the K-Means Algorithms 589 
of matrices Hi for each term of the cost function: 
1 
L(xi, w) = min (xi - wk) 2. 
Furthermore, the Hi can be decomposed in blocks corresponding to each pair of 
prototypes. Since L(xi, w) depends only on the closest prototype to pattern xi, all 
these blocks are zero except block (si(w), si(w)) which is the identity matrix. Sum- 
ming the partial Hessian matrices Hi thus gives a diagonal matrix whose diagonal 
elements are the counts of examples N assigned to each prototype. 
H __ 
o 
N2I 
0 0 
We can thus write the Newton update of the parameters as follows: 
Aw= -H -OE(w) 
0w 
which can be exactly rewritten as the batch EM style algorithm (6) presented earlier: 
(xi-w) ifk=s(xi,w) 
Aw = Z  otherwise. (8) 
4.2 CONVERGENCE SPEED 
When optimizing a quodratic function, Newton's algorithm requires only one step. 
In the case of a non quadratic function, Newton's algorithm is superlinear if we 
can bound the variations of the second derivatives. Standard theorems that bound 
this variation using the third derivative are not useful for K-Means because the 
gradient of the cost function is discontinuous. We could notice that the variations 
of the second derivatives are however nicely bounded and derive similar proofs for 
K-Means. 
For the sake of brevity however, we are just giving here an intuitive argument: we 
can make the cost function indefinitely differentiable by rounding up the angles 
around the non differentiable points. We can even restrict this cost function change 
to an arbitrary small region of the space. The iterations of K-Means will avoid 
this region with a probability arbitrarily close to 1. In practice, we obtain thus a 
superlinear convergence 
Batch K-Means thus searches for the optimal prototypes at Newton speed. Once 
it comes close enough to the optimal prototypes (i.e. the pattern assignment is 
optimal and the cost function becomes quadratic), K-Means jumps to the optimum 
and terminates. 
Online K-Means benefits of these optimal learning rates because they remove the 
usual conditioning problems of the optimization. However, the stochastic noise 
induced by the online procedure limits the final convergence of the algorithm. Final 
convergence speed is thus essentially determined by the schedule of the learning 
rates. 
Online K-Means also benefits from the redundancies of the training set. It converges 
significantly faster than batch K-Means during the first training epochs (Darken 
590 Ldon Bottou, Yoshua Bengio 
KM Cos 
2500 
2&00 
2300 
2200 
2100 
2000 
1900 
1800 
1700 
1600 
1500 
1&00 
1300 
1200 
1100 
1000 
900 
800 
700 
600 
5OO 
&00 
3OO 
2OO 
iO0 
MCos 
-lOO 
I 2 3 4, 5 6 7 8 9 lO 11 12 13 14 15 16 17 18 19 20 
-8O 
-60 
-40 
-20 
Figure 1: Et- E, versus t. black circles: online K-Means; black squares: batch 
K-Means; empty circles: online gradient; empty squares: batch gradient; no mark: 
EM+Gaussian mixture 
and Moody, 1991). After going through the first few patterns (depending of the 
amount of redundancy), online K-Means indeed improves the prototypes as much 
as a complete batch K-Means epoch. Other researchers have compared batch and 
online algorithms for neural networks, with similar conclusions (Bengio, 1991). 
5 EXPERIMENTS 
Experiments have been carried out with Fisher's iris data set, which is composed of 
150 points in a four dimensional space representing physical measurements on var- 
ious species of iris flowers. Codebooks of six prototypes have been computed using 
both batch and online K-Means with the proper learning rates (6) and (7). These 
results are compared with those obtained using both gradient descent algorithms 
(3)' and (4) using learning rate et = 0.03/t that we have found optimal. Results are 
also compared with likelihood maximization with the EM algorithm, applied to a 
mixture of six Gaussians, with fixed and uniform mixture weights, and fixed unit 
variance. Inputs were scaled down empirically so that the average cluster variance 
was around unity. Thus only the cluster positions were learned, as for the K-Means 
algorithms. 
Each run of an algorithm consists in (a) selecting a random initial set of prototypes, 
(b) running the algorithm during 20 epochs and recording the error measure Et after 
each epoch, (c) running the batch K-Means algorithm I during 40 more epochs in 
order to locate the local minimum Eo, corresponding to the current initialization 
of the algorithm. For the four K-Means algorithms, Et is the quantization error 
(equation 1). For the Gaussian mixture trained with EM, the cost Et is the negative 
except for the case of the mixture of Gaussians, in which the EM algorithm was applied 
Convergence Properties of the K-Means Algorithms 591 
1 
0.9 
0.8 
0.7 
0.6 
o.5 
o.& 
o., 
o.1 
I a :3 ,I 5 6 '7 8 9 10 12 la 13 14, 15 16 17 18 19 
Figure 2: Et+-Eoo 
s,-soo versus t. black circles: online K-Means; black squares: batch 
K-Means; empty circles: online gradient; empty squares: batch gradient; no mark: 
EM+Gaussian mixture 
logarithm of the likelihood of the data given the model. 
Twenty trials were run for each algorithm. Using more than twenty runs did not 
improve the standard deviation of the averaged measures because various initializa- 
tions lead to very different local minima. The value Eve of the quantization error 
on the local minima ranges between 3300 and 5800. This variability is caused by 
the different initializations and not by the different algorithms. The average values 
of Eve for each algorithm indeed fall in a very small range (4050 to 4080). 
Figure 1 shows the average value of the residual error Et - Eve during the first 
20 epochs. Online K-Means (black circles) outperforms all other algorithms during 
the first five epochs and stabilizes on a level related to the stochastic noise of the 
online procedure. Batch K-Means (black squares) initially converges more slowly 
but outperforms all other methods after 5 epochs. All 20 runs converged before the 
15th epoch. Both gradients algorithms display poor convergence because they do 
not benefit of the Newton effect. Again, the online version (white circles) starts 
faster then the batch version (white square) but is outperformed in the long run. 
The negative logarithm of the Gaussian mixture is shown on the curve with no 
point marks, and the scale is displayed on the right of Figure 1. 
Figure 2 show the final convergence properties of all five algorithms. The evolutions 
of the ratio (Et+ - Eve)/(Et - Eve) characterize the relative improvement of the 
residual error after each iteration. All algorithms exhibit the same behavior after a 
few epochs except batch K-Means (black squares). The fast convergence of this ratio 
to zero demonstrates the final convergence of batch K-Means. The EM algorithm 
displays a better behavior than all the other algorithms except batch K-Means. 
Clearly, however, its relative improvement ratio doesn't display the fast convergence 
behavior of batch K-Means. 
592 Ldon Bottou, Yoshua Bengio 
The online K-Means curve crosses the batch K-Means curve during the second 
epoch, suggesting that it is better to run the online algorithm (7) during one epoch 
and then switch to the batch algorithm (6). 
6 CONCLUSION 
We have shown with theoretical arguments and simple experiments that a well 
implemented K-Means algorithm minimizes the quantization error using Newton's 
algorithm. The EM style derivation of K-Means shows that the mathematics of EM 
are valid well outside the framework of probabilistic models. Moreover the provable 
convergence properties of the hard threshold K-Means algorithm are superior to 
those of the EM algorithm for an equivalent soft threshold mixture of Gaussians. 
Extending these results to other hard threshold algorithms (e.g. Viterbi Training) 
is an interesting open question. 
References 
Bacceli, F., Cohen, G., and Olsder, G. J. (1992). Synchronization and Linearity. 
Wiley. 
Bengio, Y. (1991). Artificial Neural Networks and their Application to Sequence 
Recognition. PhD thesis, McGill University, (Computer Sciehce), Montreal, 
Qc., Canada. 
Bottou, L. (1991). Une approche thorique de l'apprentissage connexioniste; ap- 
plications & la reconnaissance de la parole. PhD thesis, Universit4 de Paris 
XI. 
Darken, C. and Moody, J. (1991). Note on learning rate schedules for stochastic 
optimization. In Lippman, R. P., Moody, R., and Touretzky, D. S., editors, 
Advances in Neural Information Processing Systems 3, pages 832-838, Denver, 
CO. Morgan Kaufmann, Palo Alto. 
Dempster, A. P., Laird, N.M., and Rubin, D. B. (1977). Maximum-likelihood from 
incomplete data via the EM algorithm. Journal of Royal Statistical Society B, 
39:1-38. 
Driancourt, X. (1994). Optimisation par descente de gradient stochastique .... PhD 
thesis, Universit de Paris XI, 91405 Orsay cedex, France. 
Kohonen, T. (1989). Self-Organization and Associative Memory. Springer-Verlag, 
Berlin, 3 edition. 
Liporace, L. A. (1976). PTAH on continuous multivariate functions of Markov 
chains. Technical Report 80193, Institute for Defense Analysis, Communication 
Research Department. 
MacQueen, J. (1967). Some methods for classification and analysis of multivariate 
observations. In Proceedings of the Fifth Berkeley Symposium on Mathematics, 
Statistics and Probability, Vol. 1, pages 281-296. 
Nowlan, S. J. (1991). Soft Competitive Adaptation: Neural Network Learning Algo- 
rithms based on Fitting Statistical Mixtures. CMU-CS-91-126, School of Com- 
puter Science, Carnegie Mellon University, Pittsburgh, PA. 
Xu, L. and Jordan, M. (1994). Theoretical and experimental studies of conver- 
gence properties of the em algorithm for unsupervised learning based on finite 
mixtures. Presented at the Neural Networks for Computing Conference. 
