Unsupervised On-Line Learning of 
Decision Trees for Hierarchical Data 
Analysis 
Marcus Held and Joachim M. Buhmann 
Rheinische Friedrich-Wilhelms-Universit/t 
Institut ffir Informatik III, RSmerstrafie 164 
D-53117 Bonn, Germany 
emaih (held, jb).cs.uni-bonn.de 
WWW: http://www-dbv. cs. uni-bonn. de 
Abstract 
An adaptive on-line algorithm is proposed to estimate hierarchical 
data structures for non-stationary data sources. The approach 
is based on the principle of minimum cross entropy to derive a 
decision tree for data clustering and it employs a metalearning idea 
(learning to learn) to adapt to changes in data characteristics. Its 
efficiency is demonstrated by grouping non-stationary artifical data 
and by hierarchical segmentation of LANDSAT images. 
i Introduction 
Unsupervised learning addresses the problem to detect structure inherent in un- 
labeled and unclassified data. The simplest, but not necessarily the best ap- 
proach for extracting a grouping structure is to represent a set of data samples 
2d- xi lRli- 1,...,N t by aset of prototypes J?- y lRlc- 1,...,K1, 
K  N. The encoding usually is represented by an assignment matrix M - (Mi), 
where Mi, - I if and only if xi belongs to cluster c, and Mi, - 0 otherwise. Accord- 
1 iN__l Micl)(xi, yc) 
ing to this encoding scheme, the cost function 7/(M, J?) -  
measures the quality of a data partition, i.e., optimal assignments and prototypes 
(M, )])opt = arg minM,y 7/(M, J?) minimize the inhomogeneity of clusters w.r.t. a 
given distance measure/). For reasons of simplicity we restrict the presentation 
to the'sum-offsquared-error criterion/)(x, y) = IIx - yll 2 in this paper. To fa- 
cilitate this minimization a deterministic annealing approach was proposed in [5] 
which maps the discrete optimization problem, i.e. how to determine the data as- 
signments, via the Maximum Entropy Principle [2] to a continuous parameter es- 
Unsupervised On-line Learning of Decision Trees for Data Analysis 515 
timation problem. Deterministic annealing introduces a Lagrange multiplier/ to 
control the approximation of 7/(M, Y) in a probabilistic sense. Equivalently to 
maximize the entropy at fixed expected K-means costs we minimize the free energy 
i /N=i In (y.uK__ 1 exp (--//)(xi, ya))) w.r.t. the prototypes ya. The assign- 
r= 
ments Mi, are treated as random variables yielding a fuzzy centroid rule 
(1) 
where the expected assignments {Mi,) are given by Gibbs distributions 
exp(-l)(xi,ya)) (2) 
{Mia) = K ' 
Eu= exp (-fid (xi,ya)) 
For a more detailed discussion of the DA approach to data clustering cf. [1, 3, 5]. 
In addition to assigning data to clusters (1,2), hierarchical clustering provides the 
partitioning of data space with a tree structure. Each data sample x is sequentially 
assigned to a nested structure of partitions which hierarchically cover the data 
space lRa. This sequence of special decisions is encoded by decision rules which are 
attached to nodes along a path in the tree (see also fig. 1). 
Therefore, learning a decision tree requires to determine a tree topology, the accom- 
panying assignments, the inner node labels $ and the prototypes y at the leaves. 
The search of such a hierarchical partition of the data space should be guided by 
an optimization criterion, i.e., minimal distortion costs. 
This problem is solvable by a two-stage approach, which on the one hand minimizes 
the distortion costs at the leaves given the tree structure and on the other hand 
optimizes the tree structure given the leaf induced partition of lRa. This approach, 
due to Miller & Rose [3], is summarized in section 2. The extensions for adaptive on- 
line learning and experimental results are described in sections 3 and 4, respectively. 
x 
s s 
s s 
'x / \ 
b c d e f 
partition 
of data space 
Figure 1: Right: Topology of a decision tree. Left: Induced partitioning of the 
data space (positions of the letters also indicate the positions of the prototypes). 
Decisions are made according to the nearest neighbor rule. 
2 Unsupervised Learning of Decision Trees 
Deterministic annealing of hierarchical clustering treats the assignments of data to 
inner nodes of the tree in a probabilistic way analogous to the expected assignments 
of data to leaf prototypes. Based on the maximum entropy principle, the probability 
q.U. that data point xi reaches inner node sj is recursively defined by (see [3])' 
H U. H exp(--"/T)(xi,sj)) (3) 
/,root :-- 1, "J -- i'parent(j)7ri'J' 7ri'J =  exp (--77) (xi, sk))' 
kEsiblings(j) 
516 M. Held and J. M. Buhmann 
where the Lagrange multiplier - controls the fuzziness of all the transitions ri,j. 
On the other hand, given the tree topology and the prototypes at the leaves, the 
maximum entropy principle naturally recommends an ideal probability ,a at leaf 
y,, resp. at an inner node s j, 
exp (-fT) (xi, ya) ) and (I)Ii,j - )  'i,k' (4) 
,a = y]. exp(--fD(xi,yt)) 
/EY kEdescendants(j) 
We apply the principle of minimum cross entropy for the calculation of the proto- 
types at the leaves given a priori the probabilities for the parents of the leaves. Min- 
imization of the cross entropy with fixed expected costs (Hx,) =  (Mi,)T)(xi, y,) 
for the data point xi yields the expression 
min Z ({(Mia)} H ) 
II{i,parent()/tf} -- min y.(Mi.)In .H , (5) 
{(Mi)) {(Mi)) ,parent(.) 
where 27 denotes the Kullback-Leibler divergence and K defines the degree of the 
inner nodes. The tilted distribution 
(Mi() -- i,parent(c,) exp (-fiT) (xi, y.)) (6) 
Et  /H, parent(t ) exp (--fT)(xi,Yt))' 
generalizes the probabilistic assignments (2). In the case of Euclidinn distances 
we again obtain the centroid formula (1) as the minimum of the free energy 
' -- -- E//V=i In [Y-aeY /H, parent(-)exp(--fT)(xi,y,))]. Constraints induced by 
the tree structure are incorporated in the assignments (6). For the optimization 
of the hierarchy, Miller and Rose in a second step propose the minimization of the 
distance between the hierarchical probabilities .. and the ideal probabilities !,., 
the distance being measured by the Kullback-Lei)ler divergence 
N I 
(I)i, j 
min  27 ({!,j}ll{.H,j}) -- min E E ,J In .H.' (7) 
ff'$ %$ sj parent(y) i----1 
s parent(y) 
Equation (7) describes the minimization of the sum of cross entropies between the 
probability densities !. and .. over the parents of the leaves. Calculating the 
gradients for the inner hodes sj and the Lagrange multiplier ff we receive 
N N 
0 Z ---- --2y (i- Sj) (I)I --(I)i,parent(j)7ri, j '-- --2E A 1 (i,sj) (8) 
i,j , 
08j i=1 i=1 
N N 
I 
0 27 __ T)(i, Sj) {ii,j _i,parentrj)7ri,j} ;__ EA2(i, Sj) ' (9) 
07 
i=! j$ i=1 jE$ 
The first gradient is a weighted average of the difference vectors (xi - s j), where the 
weights measure the mismatch between the probability ,j and the probability in- 
duced by the transition ri,j. The second gradient (9) measures the scale - T) (xi, s j) 
- on which the transition probabilities are defined, and weights them with the mis- 
match between the ideal probabilities. This procedure yields an algorithm which 
starts at a small value f with a complete tree and identical test vectors attached 
to all nodes. The prototypes at the leaves are optimized according to (6) and the 
centroid rule (1), and the hierarchy is optimized by (8) and (9). After convergence 
one increases f and optimizes the hierarchy and the prototypes at the leaves again. 
The increment of f leads to phase transitions where test vectors separate from each 
other and the formerly completely degenerated tree evolves its structure. For a 
detailed description of this algorithm see [3]. 
Unsupervised On-line Learning of Decision Trees for Data Analysis 517 
3 On-Line Learning of Decision Trees 
Learning of decision trees is refined in this paper to deal with unbalanced trees 
and on-line learning of trees. Updating identical nodes according to the gradients 
(9) with assignments (6) weighs parameters of unbalanced tree structures in an 
unsatisfactory way. A detailed analysis reveals that degenerated test vectors, i.e., 
test vectors with identical components, still contribute to the assignments and to 
the evolution of % This artefact is overcome by using dynamic tree topologies 
instead of a predefined topology with indistinguishable test vectors. On the other 
hand, the development of an on-line algorithm makes it possible to process huge 
data sets and non-stationary data. For this setting there exists the need of on-line 
learning rules for the prototypes at the leaves, the test vectors at the inner nodes 
and the parameters - and/. Unbalanced trees also require rules for splitting and 
merging nodes. 
Following Buhmann and Kfihnel [1] we use an expansion of order O(1/n) of (1) to 
estimate the prototypes for the Nth datapoint 
y  ya N- + rk, p.N_M (N--y]V--1), (10) 
where p  p- q- 1/M ((M ) - p-) denotes the probability of the occurence 
of class ct. The parameters M and 77, are introduced in order to take the possible 
non-stationarity of the data source into account. M denotes the size of the data 
window, and r/, is a node specific learning rate. 
Adaptation of the inner nodes and of the parameter ? is performed by stochastic 
approximation using the gradients (8) and (9) 
S 7 :- S7 --1 q-77j"yN-1A 1 (XN, S7-1), (11) 
N 1:N--1 __ 77.7 E A2 (XN 8'N-l) (12) 
' J ' 
For an appropriate choice of the learning rates 77, the learning to learn approach of 
Murata et al. [4] suggests the learning algorithm 
w/V = w/V-1 _ 77/v-if (xv, w/V-). (13) 
The flow f in parameter space determines the change of w/v- given a new datapoint 
x/v. Murata et al. derive the following update scheme for the learning rate: 
/.N = (1 __ )/.N--1 q-f (XN,WN-1) , (14) 
77v = rff-1 + ulrff- (u2[[rVl[_ r/N-i), (15) 
where u, u2 and 5 are control parameters to balance the tradeoff between accuracy 
and convergence rate. r N denotes the leaky average of the flow at time N. 
The adaptation of  has to observe the necessary condition for a phase transition 
 > crit -= 1/25nx, 5nx being the largest eigenvalue of the covariance matrix [3] 
M M 
E. - E (xi - y.) (xi - y.)t (Mi.)/E(Mi.} ' (16) 
i----1 i=1 
Rules for splitting and merging nodes of the tree are introduced to deal with un- 
balanced trees and non-stationary data. Simple rules measure the distortion costs 
at the prototypes of the leaves. According to these costs the leaf with highest 
518 M. Held and J. M. Buhmann 
distortion costs is split. The merging criterion combines neighboring leaves with 
minimal distance in a greedy fashion. The parameter M (10), the typical time 
scale for changes in the data distribution is used to fix the time between splitting 
resp. merging nodes and the update of fl. Therefore, M controls the time scale for 
changes of the tree topology. The learning parameters for the learning to learn rules 
(13)-(15) are chosen empirically and are kept constant for all experiments. 
4 Experiments 
The first experiment demonstrates how a drifting two dimensional data source can 
be tracked. This data source is generated by a fixed tree augmented with transition 
probabilities at the edges and with Gaussians at the leaves. By descending the tree 
structure this enerates an i.i.d. random variable X  IR 2, which is rotated around 
the origin of IR  to obtain a random variable T(N) = R(w, N)X. R is an orthogonal 
matrix, N denotes the number of the actual data point and w denotes the angular 
velocity, M = 500. Figure 2 shows 45 degree snapshots of the learning of this non- 
stationary data source. We start to take these snapshots after the algorithm has 
developed its final tree topology (after  8000 datapoints). Apart from fluctuations 
of the test vectors at the leaves, the whole tree structure is stable while tracking 
the rotating data source. 
Additional experiments with higher dimensional data sources confirm the robustness 
of the algorithm w.r.t. the dimension of the data space, i. e. similiar tracking 
performances for different dimensions are observed, where differences are explained 
as differences in the data sources (figure 3). This performance is measured by the 
variance of the mean of the distances between the data source trajectory and the 
trajectories of the test vectors at the nodes of the tree. 
2) 
5) 6) 7) 
Figure 2:45 degree snapshots of the learning of a data source which rotates with a 
velocity w = 2/30000 (360 degree per 30000 data samples). 
A second experiment demonstrates the learning of a switching data source. The 
results confirm a good performance concerning the restructuring of the tree (see 
figure 4). In this experiment the algorithm learns a given data source and after 
10000 data points we switch to a different source. 
As a real-world example of on-line learning of huge data sources the algorithm is 
applied to the hierarchical clustering of 6-dimensional LANDSAT data. The heat 
Unsupervised On-line Learning of Decision Trees for Data Analysis 519 
-0.5 
-1 
-2 
-2.5 
-3 
-3.5 
..4 
-4.5 
-5 
0 
-  2dim -- 
. ..... -'-'.. 4dim .... 
' - -.:' --:.:... ' 12dim ..... 
10000 20000 30000 40000 50000 60000 70000 80000 90000 
N 
Figure 3: Tracking performance for different dimensions. As data sources we use 
d-dimensional Gaussians which are attached to a unit sphere. To the components 
of every random sample X we add sin(wN) in order to introduce non stationarity. 
The first 8000 samples are used for the development of the tree topology. 
A 
defg nohi 
Figure 4: Learning a switching data source: top: a) the partition of the data space 
after 10000 data samples given the first source, b) the restructured partition after 
additional 2500 samples. Below: accompanying tree topologies. 
channel has been discarded because of its reduced resolution. In a preprocessing step 
all channels are rescaled to unit variance, which alternatively could be established 
by using a Mahalanobis distance. Note that the decision tree which clusters this 
data supplies us with a hierarchical segmentation of the corresponding LANDSAT 
image. A tree of 16 leaves has been learned on a training set of 128 x 128 data 
samples, and it has been applied to a test set of 128 x 128 LANDSAT pixels. The 
training is established by 15 sequential runs through the test set, where after each 
M = 16384 run a split of one node is carried out. The resulting empirical errors 
(0.49 training distortion and 0.55 test distortion) differ only slightly from the errors 
obtained by the LBG algorithm applied to the whole training set (0.42 training 
distortion and 0.52 test distortion). This difference is due to the fact that not 
every data point is assigned to the nearest leaf prototype by a decision tree induced 
partition. The segmentation of the test image is depicted in figure 5. 
5 Conclusion 
This paper presents a method for unsupervised on-line learning of decision trees. 
We overcome the shortcomings of the original decision tree approach and extend 
520 M. Held and J. M. Buhmann 
Tmmt Xamm &i Xame 
lve 
Figure 5: Hierarchical segmentation of the test image. The root represents the 
original image, i.e., the gray scale version of the three color channels. 
it to the realm of on-line learning of huge data sets and of adaptive learning of 
non-stationary data. Our experiments demonstrate that the approach is capable of 
tracking gradually changing or switching environments. Furthermore, the method 
has been successfully applied to the hierarchical segmentation of LANDSAT images. 
Future work will address active data selection issues to significantly reduce the 
uncertainty of the most likely tree parameters and the learning questions related to 
different tree topologies. 
Acknowledgement: This work has been supported by the German Israel Foundation 
for Science and Research Development (GIF) under grant #I-0403-001.06/95 and by the 
Federal Ministry for Education, Science and Technology (BMBF #01 M 3021 A/4). 
References 
[1] J. M. Buhmann and H. Kfihnel. Vector quantization with complexity costs. IEEE 
Transactions on Information Theory, 39(4):1133-1145, July 1993. 
[2] T.M. Cover and J. Thomas. Elements of Information Theory. Wiley  Sons, 1991. 
[3] D. Miller and K. Rose. Hierarchical unsupervised learning with growing via phase 
transitions. Neural Computation, 8:425-450, February 1996. 
[4] N. Murata, K.-R. Mfiller, A. Ziehe, and S. Amari. Adaptive on-line learning in changing 
environments. In M.C. Mozer, M.I. Jordan, and T. Petsche, editors, Advances in Neural 
Information Processing Systems, number 9, pages 599-605. MIT Press, 1997. 
[5] K. Rose, E. Gurewitz, and G.C. Fox. A deterministic annealing approach to clustering. 
Pattern Recognition Letters, 11(9):589-594, September 1990. 
