Agglomerative Information Bottleneck 
Noam $1onim Naftali Tishby* 
Institute of Computer Science and 
Center for Neural Computation 
The Hebrew University 
Jerusalem, 91904 Israel 
emaih {nomm,tishby}cs .huj i. ac. il 
Abstract 
We introduce a novel distributional clustering algorithm that max- 
imizes the mutual information per cluster between data and giv- 
en categories. This algorithm can be considered as a bottom up 
hard version of the recently introduced "Information Bottleneck 
Method". The algorithm is compared with the top-down soft ver- 
sion of the information bottleneck method and a relationship be- 
tween the hard and soft results is established. We demonstrate the 
algorithm on the 0 Newsgroups data set. For a subset of two news- 
groups we achieve compression by 3 orders of magnitudes loosing 
only 10% of the original mutual information. 
1 Introduction 
The problem of self-organization of the members of a set X based on the similarity 
of the conditional distributions of the members of another set, Y, {p(y]x)}, was first 
introduced in [8] and was termed "distributional clustering". 
This question was recently shown in [9] to be a special case of a much more fun- 
damental problem: What are the features of the variable X that are relevant for 
the prediction of another, relevance, variable Y ? This general problem was shown 
to have a natural information theoretic formulation: Find a compressed represen- 
tation of the variable X, denoted , such that the mutual information between  
and Y, I(;Y), is as high as possible, under a constraint on the mutual infor- 
mation between X and , I(X; ). Surprisingly, this variational problem yields 
an exact self-consistent equations for the conditional distributions p(y]), p(x]), 
and p(). This constrained information optimization problem was called in [9] The 
Information Bottleneck Method. 
The original approach to the solution of the resulting equations, used already in [8], 
was based on an analogy with the "deterministic annealing" approach to clustering 
(see [7]). This is a top-down hierarchical algorithm that starts from a single cluster 
and undergoes a cascade of cluster splits which are determined stochastically (as 
phase transitions) into a "soft" (fuzzy) tree of clusters. 
In this paper we propose an alternative approach to the information bottleneck 
618 N. Slonim and N. Tishby 
problem, based on a greedy bottom-up merging. It has several advantages over the 
top-down method. It is fully deterministic, yielding (initially) "hard clusters", for 
any desired number of clusters. It gives higher mutual information per-cluster than 
the deterministic annealing algorithm and it can be considered as the hard (zero 
temperature) limit of deterministic annealing, for any prescribed number of clusters. 
Furthermore, using the bottleneck self-consistent equations one can "soften" the 
resulting hard clusters and recover the deterministic annealing solutions without 
the need to identify the cluster splits, which is rather tricky. The main disadvantage 
of this method is computational, since it starts from the limit of a cluster per each 
member of the set X. 
1.1 The information bottleneck method 
The mutual information between the random variables X and Y is the symmetric 
functional of their joint distribution, 
I(X;Y)= E p(x,y)log( p(x'Y) )= 
xex,e k,p(x)p(y) 
f P(Yl x) 
E P(x)p(ylx)log k, p-- ) 
xX,yY 
(1) 
The objective of the information bottleneck method is to extract a compact rep- 
resentation of the variable X, denoted here by 5:, with minimal loss of mutual 
information to another, relevance, variable Y. More specifically, we want to find a 
(possibly stochastic) map, p(lx), that minimizes the (lossy) coding length of X via 
5:, I(X; 5:), under a constraint on the mutual information to the relevance variable 
1(5:; ). In other words, we want to find an efficient representation of the variable 
X, 5:, such that the predictions of Y from X through  will be as close as possible 
to the direct prediction of Y from X. 
As shown in [9], by introducing a positive Lagrange multiplier  to enforce the 
mutual information constraint, the problem amounts to minimization of the La~ 
grangian: 
C[plx)] = t(X; 5:) - t(5:; Y) , (2) 
with respect to p(lx), subject to the Markov condition 5: -> X -> Y and normal- 
ization. 
This minimization yields directly the following self-consistent equations for the map 
p(5:lx), as well as for p(ylS:) and p(5:): 
{ p(lx) = P() exp(-D::[p(ylx)l]p(yl)]) 
z(,x) 
p(yl) = p(Ylx)P(lx) 
P() -- E P(lx)P(x) 
(3) 
The variational principle, Eq. (2), determines also the shape of the annealing process, 
since by changing  the mutual informations Ix - I(X; 5:) and I _-- I(Y; 5:) vary 
such that 
6I _ _ 
5Ix (4) 
where Z(,x) is a normalization function. The functional DcL[Pllq] -= 
y.yp(y) log qP(--) is the Kulback-Liebler divergence [3], which emerges here from the 
variational principle. These equations can be solved by iterations that are proved to 
converge for any finite value of  (see [9]). The Lagrange multiplier ] has the nat- 
ural interpretation of inverse temperature, which suggests deterministic annealing 
[7] to explore the hierarchy of solutions in 5:, an approach taken already in [8]. 
Agglomerative Information Bottleneck 619 
Thus the optimal curve, which is analogous to the rate distortion function in in- 
formation theory [3], follows a strictly concave curve in the (Ix, I) plane, called 
the information plane. Deterministic annealing, at fixed number of clusters, follows 
such a concave curve as well, but this curve is suboptimal beyond a certain critical 
value of . 
Another interpretation of the bottleneck principle comes from the relation between 
the mutual information and Bayes classification error. This error is bounded above 
and below (see [6]) by an important information theoretic measure of the class con- 
ditional distributions p(xlYi), called the Jensen-Shannon divergence. This measure 
plays an important role in our context. 
The Jensen-Shannon divergence of M class distributions, pi(x), each with a prior 
ri, 1 _< i _< M, is defined as, [6, 4]. 
M M 
JSv,p2, ...,PM] ------ H[y riPi(X)] - y riHvi(x)] , (5) 
i=1 i=1 
where Hip(x)] is Shannon's entropy, Hip(x)] = -Exp(x)logp(x). The convexi- 
ty of the entropy and Jensen inequality guarantees the non-negativity of the JS- 
divergence. 
1.2 The hard clustering limit 
For any finite cardinality of the representation ll -- m the limit  -> o of the 
Eqs.(3) induces a hard partition of X into m disjoint subsets. In this limit each 
member x 6 X belongs only to the subset  6 . for which p(ylS:) has the smallest 
DKLV(ylx)llp(yl)] and the probabilistic map p(l x) obtains the limit values 0 and 
i only. 
In this paper we focus on a bottom up agglomerative algorithm for generating 
"good" hard partitions of X. We denote an m-partition of X, i.e.  with cardinality 
m, also by Zm = {Zl,Z2, ..., Zm}, in which case p() = p(zi). We say that Zm is an 
optimal m-partition (not necessarily unique) of X if for every other m-partition of 
X, Ztm, I(Zm; Y) _> I(Zm; Y). Starting from the trivial N-partition, with N = IX], 
we seek a sequence of merges into coarser and coarser partitions that are as close 
as possible to optimal. 
It is easy to verify that in the  - o limit Eqs.(3) for the m-partition distributions 
are simplified as follows. Let  -- z = {x,x2,...,xlzl} , xi  X denote a specific 
component (i.e. cluster) of the partition Zm, then 
1 if x  z Vx  X 
p(z[x) = { 0 otherwise 
P(Y] ) = p--(Ty Yi= p(xi, y) Yy e Y 
p(z) ---- Yi= p(xi) 
(6) 
Using these distributions one can easily evaluate the mutual information between 
Zm and Y, I(Zm; Y), and between Zm and X, I(Zm; X), using Eq.(1). 
Once any hard partition, or hard clustering, is obtained one can apply "reverse 
annealing" and "soften" the clusters by decreasing  in the self-consistent equations, 
Eqs.(3). Using this procedure we in fact recover the stochastic map, p(lx), from 
the hard partition without the need to identify the cluster splits. We demonstrate 
this reverse deterministic annealing procedure in the last section. 
620 N. Sionin and N. Tishby 
1.3 Relation to other work 
A similar agglomerative procedure, without the information theoretic framework 
and analysis, was recently used in [1] for text categorization on the 20 newsgroup 
corpus. Another approach that stems from the distributional clustering algorith- 
m was given in [5] for clustering dyadic data. An earlier application of mutual 
information for semantic clustering of words was used in [2]. 
2 The agglomerative information bottleneck algorithm 
The algorithm starts with the trivial partition into N = ]X I clusters or components, 
with each component contains exactly one element of X. At each step we merge 
several components of the current partition into a single new component in a way 
that locally minimizes the loss of mutual information I(.; Y) = I(Zm; Y). 
Let Z, be the current m-partition of X and Zm denote the new -partition 
of X after the merge of several components of Zm. Obviously,  < m. Let 
{zi,z2,...,zk} C_ Zm denote the set of components to be merged, and 2k  Zm 
the new component that is generated by the merge, so h = m - k + 1. 
To evaluate the reduction in the mutual information I(Zm; Y) due to this merge one 
needs the distributions that define the new h-partition, which are determined as 
follows. For every z e Zm,z  , its probability distributions (p(z),p(y[z),p(zlx)) 
remains equal to its distributions in Zm. For the new component, 2  Zm, we 
define, 
= 5-.i= p(zi) 
() -i= p(zi,y) Vy e Y (7) 
1 ifxziforsomel<i<k VxX 
p(2lx)- 0 otherwise 
It is easy to verify that Zm is indeed a valid h-partition with proper probability 
distributions. 
Using the same notations, for every merge we define the additional quantities: 
 The merge prior distribution: defined by IIk -- (r, r2, ..., r), where ri 
is the prior probability of zi in the merged subset, i.e. ri -= p(,). 
 The Y-information decrease: the decrease in the mutual information 
I(.; Y) due to a single merge, 5I(zl, ...,z) -- I(Zm; Y) - I(Zm; Y) 
 The X-information decrease: the decrease in the mutual information 
I(., X) due to a single merge, 5Ix(z,z2,...,zl) -- I(Zm;X) - I(Zm;X) 
Our algorithm is a greedy procedure, where in each step we perform "the best possi- 
ble merge", i.e. merge the components {z,..., z} of the current m-partition which 
minimize 5I (z,..., zk). Since 5I(z, ..., z) can only increase with k (corollary 2), 
for a greedy procedure it is enough to check only the possible merging of pairs of 
components of the current m-partition. Another advantage of merging only pairs is 
that in this way we go through all the possible cardinalities of Z = X, from N to 1. 
re(m--l) possible pairs 
For a given m-partition Zm -- {21,22,...,2m} there are 2 
to merge. To find "the best possible merge" one must evaluate the reduction of 
information 5I(zi, zj) = I(Zm;Y) - I(Zm-1;Y) for every pair in Zm, which is 
O(m  IYI) operations for every pair. However, using proposition I we know that 
5I(zi,zj) = (p(zi) +p(zj)). JSn2(p(Y[zi),p(YIzj)), so the reduction in the mutual 
Agglomerat've Information Bottleneck 621 
information due to the merge of zi and zj can be evaluated directly (looking only 
at this pair) in O(]Y]) operations, a reduction of a factor of m in time complexity 
(for every merge). 
Input: Empirical probability matrix p(x,y), N = IXI, M = IYI 
Output: zm: m-partition of X into m clusters, for every 1 <_ m _< N 
Initialization: 
 Construct Z --- X 
- For i = 1...N 
 z- {x} 
 p(zi) =p(xi) 
 P(YlZi) ---- P(YlXi) for every y e Y 
 p(zlxj) = I if j = i and 0 otherwise 
- Z= {z,...,zN} 
 for every i, j = 1...N, i < j, calculate 
d,j - (P(Xd + P(XJ)) ' JSn2(ylxd,p(ylxy)] 
(every di,j points to the corresponding couple in Z) 
Loop: 
 For t = 1...(N- 1) 
- Find {a,} = argmini,j{di,j} 
(if there e severM minima choose bitrily between them) 
- Merge {z,zz}  2: 
 p() = p(,) + p(,) 
 p(yl)= 1 
p(p(z,y) +p(zz,y)) for every y  Y 
 p(2[x) = 1 if x  z U ze and 0 otherwise, for every x  X 
- Update Z= {Z-{z,ze}}U{ } 
(Z is now a new (N - t)-partition of X with N - t clusters) 
- Update di,j COSTS and pointers w.r.t.  
(only for couples contained z or zz). 
 End For 
Figure 1: Pseudo-code of the algorithm. 
3 Discussion 
The algorithm is non-parametric, it is a simple greedy procedure, that depends only 
on the input empirical joint distribution of X and Y. The output of the algorithm 
is the hierarchy of all m-partitions Zm of X for m = N, (N - 1), ..., 2, 1. Moreover, 
unlike most other clustering heuristics, it has a built in measure of efficiency even 
for sub-optimal solutions, namely, the mutual information I(Zm; Y) which bounds 
the Bayes classification error. The quality measure of the obtained Z, partition is 
the fraction of the mutual information between X and Y that Zm captures. This 
I(Z,;Y) ]Zm[. We found that empirically this 
is given by the curve (x;r) vs. m = 
curve was concave. If this is always true the decrease in the mutual information at 
(Zm;r)-(Zm-;r) 
every step, given by 5(m) -- (x;) can only increase with decreasing 
m. Therefore, if at some point 5(m) becomes relatively high it is an indication 
that we have reached a value of m with "meaningful" partition or clusters. Further 
622 N. Slonim and N. Tishby 
merging results in substantial loss of information and thus significant reduction in 
the performance of the clusters as features. However, since the computational cost 
of the final (low m) part of the procedure is very low we can just as well complete 
the merging to a single cluster. 
Figure 2: On the left figure the results of the agglomerative algorithm are shown in the 
"information plane", normalized I(Z; Y) vs. normalized I(Z; X) for the NG1000 dataset. 
It is compared to the soft version of the information bottleneck via "reverse annealing" 
for [Z[ - 2, 5, 10, 20,100 (the smooth curves on the left). For ]Z I -- 20,100 the annealing 
curve is connected to the starting point by a dotted line. In this plane the hard algorithm 
is clearly inferior to the soft one. 
On the right-hand side: I(Z,, Y) of the agglomerative algorithm is plotted vs. the car- 
dinality of the partition m for three subsets of the newsgroup dataset. To compare the 
performance over the different data cardinalities we normalize I(Z,; Y) by the value of 
I(Z50; Y), thus forcing all three curves to start (and end) at the same points. The predic- 
tive information on the newsgroup for NG1000 and NG100 is very similar, while for the 
dichotomy dataset, 2ng, a much better prediction is possible at the same IZI, as can be 
expected for dichotomies. The inset presents the full curve of the normalized I(Z; Y) vs. 
IZ[ for NG100 data for comparison. In this plane the hard partitions are superior to the 
soft ones. 
4 Application 
To evaluate the ideas and the algorithm we apply it to several subsets of the 
20Newsgroups dataset, collected by Ken Lang using 20,000 articles evenly distribut- 
ed among 20 UseNet discussion groups (see [1]). We replaced every digit by a single 
character and by another to mark non-alphanumeric characters. Following this pre- 
processing, the first dataset contained the 530 strings that appeared more then 1000 
times in the data. This dataset is referred as NG1000. Similarly, all the strings that 
appeared more then 100 times constitutes the NG100 dataset and it contains 5148 
different strings. To evaluate also a dichotomy data we used a corpus consisting of 
only two discussion groups out of the 20Newsgroups with similar topics: alt. atheism 
and talk. religion. misc. Using the same pre-processing, and removing strings that 
occur less then 10 times, the resulting "lexicon" contained 5765 different strings. 
We refer to this dataset as ng. 
We plot the results of our algorithm on these three data sets in two different planes. 
First, the normalized information t(x;) vs. the size of partition of X (number 
of clusters), [Z I. The greedy procedure directly tries to maximize I(Z; Y) for a 
given [Z], as can be seen by the strong concavity of these curves (figure 2, right). 
Indeed the procedure is able to maintain a high percentage of the relevant mutual 
information of the original data, while reducing the dimensionality of the "features", 
Agglomerative Information Bottleneck 623 
[Z[, by several orders of magnitude. 
On the right hand-side of figure 2 we present a comparison between the efficiency 
of the procedure for the three datasets. The two-class data, consisting of 5765 
different strings, is compressed by two orders of magnitude, into 50 clusters, almost 
without loosing any of the mutual information about the news groups (the decrease 
in I(;Y) is about 0.1%). Compression by three orders of magnitude, into 6 
clusters, maintains about 90% of the original mutual information. 
Similar results, even though less striking, are obtained when Y contain all 20 news- 
groups. The NG100 dataset was compressed from 5148 strings to 515 clusters, 
keeping 86% of the mutual information, and into 50 clusters keeping about 70% 
of the information. About the same compression efficiency was obtained for the 
NG1000 dataset. 
The relationship between the soft and hard clustering is demonstrated in the Infor- 
mation plane, i.e., the normalized mutual information values, // vs. /H--' In 
this plane, the soft procedure is optimal since it is a direct maximization of I(Z; Y) 
while constraining I(Z; X). While the hard partition is suboptimal in this plane, 
as confirmed empirically, it provides an excellent starting point for reverse anneal- 
ing. In figure 2 we present the results of the agglomerative procedure for NG1000 
in the information plane, together with the reverse annealing for different values 
of I Z]. As predicted by the theory, the annealing curves merge at various critical 
values of  into the globally optimal curve, which correspond to the "rate distor- 
tion function" for the information bottleneck problem. With the reverse annealing 
("heating") procedure there is no need to identify the cluster splits as required in the 
original annealing ("cooling") procedure. As can be seen, the "phase diagram" is 
much better recovered by this procedure, suggesting a combination of agglomerative 
clustering and reverse annealing as the ultimate algorithm for this problem. 
References 
[1] L. D. Baker and A. K. McCallum. Distributional Clustering of Words for Text Clas- 
siftcation In A CM SIGIR 98, 1998. 
[2] P. F. Brown, P.V. deSouza, R.L. Mercer, V.J. DellaPietra, and J.C. Lai. Class-based 
n-gram models of natural language. In Computational Linguistics, 18(4):467-479, 
1992. 
[3] T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley &: 
Sons, New York, 1991. 
[4] R. E1-Yaniv, S. Fine, and N. Tishby. Agnostic classification of Maxkovian sequences. 
In Advances in Neural Information Processing (NIPS'97) , 1998. 
[5] T. Holmann, J. Puzicha, and M. Jordan. Learning from dyadic data. In Advances in 
Neural Information Processing (NIPS'98), 1999. 
[6] J. Lin. Divergence Measures Based on the Shannon Entropy. IEEE Transactions on 
Information theory, 37(1):145-151, 1991. 
[7] K. Rose. Deterministic Annealing for Clustering, Compression, Classification, Regres- 
sion, and Related Optimization Problems. Proceedings of the IEEE, 86(11):2210-2239, 
1998. 
[8] F.C. Perefta, N. Tishby, and L. Lee. Distributional clustering of English words. In 
30th Annual Meeting of the Association for Computational Linguistics, Columbus, 
Ohio, pages 183-190, 1993. 
[9] N. Tishby, W. Bialek, and F. C. Peteira. The information bottleneck method: Ex- 
tracting relevant information from concurrent data. Yet unpublished manuscript, 
NEC Research Institute TR, 1998. 
