Learning from Dyadic Data 
Thomas Hofmann*, Jan Puzicha +, Michael I. Jordan* 
* Center for Biological and Computational Learning, M.I.T 
Cambridge, MA, {holmann, jordan}@ai.mit.edu 
+ Institut ffir Informatik III, Universitat Bonn, Germany, jan@cs.uni-bonn.de 
Abstract 
Dyadzc data refers to a domain with two finite sets of objects in 
which observations are made for dyads, i.e., pairs with one element 
from either set. This type of data arises naturally in many ap- 
plication ranging from computational linguistics and information 
retrieval to preference analysis and computer vision. In this paper, 
we present a systematic, domain-independent framework of learn- 
ing from dyadic data by statistical mixture models. Our approach 
covers different models with fiat and hierarchical latent class struc- 
tures. We propose an annealed version of the standard EM algo- 
rithm for model fitting which is empirically evaluated on a variety 
of data sets from different domains. 
I Introduction 
Over the past decade learning from data has become a highly active field of re- 
search distributed over many disciplines like pattern recognition, neural compu- 
tation, statistics, machine learning, and data mining. Most domain-independent 
learning architectures as well as the underlying theories of learning have been fo- 
cusing on a feature-based data representation by vectors in an Euclidean space. For 
this restricted case substantial progress has been achieved. However, a variety of 
important problems does not fit into this setting and far less advances have been 
made for data types based on different representations. 
In this paper, we will present a general framework for unsupervised learning from 
dladic data. The notion dyadic refers to a domain with two (abstract) sets of ob- 
jects, X = {x,..., xN} and 3/= {y,..., YM} in which observations $ are made for 
dlads (xi, y). In the simplest case - on which we focus - an elementary observation 
consists just of (x, y) itself, i.e., a co-occurrence of x and y, while other cases 
may also provide a scalar value wi (strength of preference or association). Some ex- 
emplary application areas are: (i) Computational linguistics with the corpus-based 
statistical analysis of word co-occurrences with applications in language modeling, 
word clustering, word sense disambiguation, and thesaurus construction. (it) Text- 
based znformatzon retrieval, where , may correspond to a document collection, 3/ 
Learning from Dyadic Data 467 
to keywords, and (xi, y) would represent the occurrence of a term y. in a document 
:i. (iii) Modeling of preference and consumption behavior by identifying X with in- 
dividuals and 3/' with objects or stimuli as in collaborative filtering. (iv) Computer 
vston, in particular in the context of image segmentation, where X corresponds to 
image locations, 32 to discretized or categorical feature values, and a dyad (xi, y) 
represents a fea[ure yk observed at a par[icular location 
2 Mixture Models for Dyadic Data 
Across different domains there are at least two tasks which play a fundamental role 
in unsupervised learning from dyadic data: (i) probabilistic modeling, i.e., learning 
a joint or conditional probability model over  x 32, and (it) structure discovery, e.g., 
identifying clusters and data hierarchies. The key problem in probabilistic modeling 
is the data sparseress: How can probabilities for rarely observed or even unobserved 
co-occurrences be reliably estimated? As an answer we propose a model-based ap- 
proach and formulate latent class or mixture models. The latter have the further 
advantage to offer a unifying method for probabilistic modeling and structure dis- 
covery. There are at least three (four, if both variants in (it) are counted) different 
ways of defining latent class models: 
i. The most direct way is to introduce an (unobserved) mapping c:  x 32 - 
{c,..., CK} that partitions  x 32 into I: classes. This type of model is 
called aspect-based and the pre-image c-l(c) is referred to as an aspect. 
it. Alternatively, a class can be defined as a subset of one of the spaces  (or 32 
by symmetry, yielding a different model), i.e., c :X - {el,..., CK} which 
induces a unique partitioning on  x 32 by c(x, y) -- c(:c). This model is 
referred to as one-szded clustering and C-i(Cc) __ ,J is called a cluster. 
iii. If latent classes are defined for both sets, c  X -. {c,...,c}} and c  
32 - {cY,...,el}, respectively, this induces a mapping c which is a Ix:. L- 
partitioning of  x 32. This model is called two-sided clustering. 
2.1 Aspect Model for Dyadic Data 
In order to specify an aspect model we make the assumption that all co-occurrences 
in the sample set $ are i.i.d. and that xi and y are conditionally independent gqven 
the class. With parameters P(x Ic), P(y Ic) for the class-conditional distributions 
and prior probabilities P(c) the complete data probability can be written as 
P($, c)-  [P(ci)P(milci)(ylci)] (') , (1) 
i,k 
where n(i, y) are the empirical counts for dyads in  and ci  c(i, y). By 
summing over the latent variables c the usual mixture formulation is obtained 
P(S) :  P(xi,y) ('), where P(xi,y) :  P(c)P(xilc)P(ylc) (2) 
i,k  
Following the standard Expectation Maximization approach for maximum likelihood 
,stimation [Dempster ctal.. 1977], the E-step equations for the class posterior prob- 
abilities are given by 1 
: . (3) 
1in the case of multiple observations of dyads it has been assumed that each observation 
may have a different latent class. If only one latent class variable is introduced for each 
dyad, slightly different equations are obtMned. 
468 T. Hofmann, J. Puzicha and M. I. Jordan 
! two 0.18 went 0.10 Ihave 0.381 shalt 0.18 the 0.95 I he 0.51 I I<.> 0.521 Ithee 0.041 
: llh9:ia!'l seven 0.10 go 0.08 I hath 0.221 hast 0.08 his 0.006 I god 0.081 I <:> 0.161 I me 0.03 I 
! t-_, I,-,  three 0.10 come 0.04 I had 0.11 I wilt 0.08 myO.005 llordO.051 I<,> 0.141 lhimO.031 
: x.. I '-'Ct t four 0.06 came 0.04 I hast 0.091 art 0.07 our 0.003 land 0.041 I<;> 0.071 I it 0.02 I 
'! five 0.06 brought 0.03 I be 0.021 if 0.05 thyO.003 0.0 I I< O I l YouO.021 
! . years 0.11 up 0.40 ! not 0.04 I thou 0.85 lord 0.09 [hath 0.14 ! land 0.331 I<?> 0.2_7 I 
: i/ :housand 0.1 down 0.17 Idone o.o41 not 0.01 :hildren 0.0 shall 0.071 I for 0.081 I<, > o.zl 
! -- -  - x hundred 0.1 forth 0.15 Iiven 0.03l also 0.004 son 0.02 Isaid 0.051 I but 0.071 I <.> 0.121 
:z' klC:Ct, days 0.07 out O.09 ade 0.03] indeed 0.003 land 0.02 I is 0.04 I [then 0.051 I<:> 0.061 
:: .................... .c.u.?...0:.0.. ....... !:..0:.0} ...... !?..:.0.:07.!... :.o):.t.9:0..0  .. 2.?1...o:.o.2....i.w...o:.l..j..s.o..o:.o.z..i...[<.?...o:.! 
Figure 1' Some aspects of the Bible (bigrams). 
It is straightforward to derive the M-step re-estimation formulae 
P(c)cr yn(xi,y)P{ci:c), P(xilc)crEn(xi,y)P{ci:c}, (4) 
i,k k 
and an analogous equation for P(ylca). By re-parameterization the aspect model 
can also be characterized by a cross-entropy criterion. Moreover, formal equiva- 
lence to the aggregate Markov model, independently proposed for language model- 
ing in [Saul, Peteira, 1997], has been established (cf. [Hofmann, Puzicha, 1998] for 
details). 
2.2 One-Sided Clustering Model 
The complete data model proposed for the one-sided clustering model is 
P(,-q,c)= P(c)P(,-qlc)= (HP(c(xi))) (H[P(xi)P(Y'C(Xi))](:')) , (5) 
i i,k 
where we have made the assumption that observations (xi, y) for a particular xi 
are conditionally independent given c(xi). This effectively defines the mixture 
?($) - II ?(&), P(&): Y P(c) II [?()P(ylc)]'v) , (6) 
i a k 
where & are all observations involving zi. Notice that co-occurrences in & are not 
independent (as they are in the aspect model), but get coupled by the (shared) 
latent variable c(zi). As before, it is straightforward to derive an EM algorithm 
with update equations 
P{c(x,)-c}crP(c) 1-IP(ylc) '("), P(yl%)cr n(x,,y)P{c(x,)--c} (?) 
k i 
and P(c) cr -i P{c(xi)=ca}, P(xi) cr yj n(xi,yj). The one-sided clustering 
model is similar to the distributional clustering model [Pereira et al., 1993], how- 
ever, there are two important differences: (i) the number of likelihood contributions 
in (7) scales with the number of observations - a fact which follows from Bayes' rule 
- and (ii) mixing proportions are .missing in the original distributional clustering 
model. The one-sided clustering model corresponds to an unsupervised version of 
the naive Bayes' classifier, if we interpret 32 as a feature space for objects 
There are also ways to weaken the conditional independence assumption, e.g., by 
utilizing a mixture of tree dependency models [Meila, Jordan, 1998]. 
2.3 Two-Sided Clustering Model 
The latent variable structure of the two-sided clustering model significantly reduces 
the degrees of freedom in the specification of the class conditional distribution. We 
Learning from Dyadic Data 469 
Figure 2: Exemplary segmentation results on Aerial by one-sided clustering. 
propose the following complete data model 
P($,c) = H P(c(xi))P(c(y})) [P(xi)P(y})r,(,),4yk)] "("yk) 
i,k 
(8) 
where %g,, are cluster association parameters. In this model the latent variables 
in the 2' and 32 space are coupled by the x-parameters. Therefore, there exists 
no simple mixture model representation for P($). Skipping some of the technical 
details (cf. [Hofmann, Puzicha, 1998]) we obtain P(xi) c  n(xi, y}), P(y) c 
i n(xi, Yk) and the M-step equations 
<'*' = [E, P{c(,) = cg} E, -(,, y,)] [E, P{c(,) = 4} E,-(x,, ,)] 
(9) 
as well as P(c,) = -]i P{c(xi) = c} and P(cY,) = E} P{c(x}) = %Y}. To preserve 
tractability for the remaining problem of computing the posterior probabilities in 
the E-step, we apply a factorial approximation (mean field approximation), i.e., 
P{c(zi) = c A c(y) = cg}  P{c(zi) = c}P{c(y}) = %Y}. This results in the 
following coupled approximation equations for the marginal posterior probabilities 
P{c(xi)=c}crP(c)exp[ 
(10) 
and a similar equation for P{c(y}) = cry }. The resulting approximate EM algorithm 
performs updates according to the sequence (cS-post., r, cY-post., r). Intuitively 
the (probabilistic) clustering in one set is optimized in alternation for a given clus- 
tering in the other space and vice versa. The two-sided clustering model can also 
be shown to maximize a mutual information criterion [Hofmann, Puzicha, 1998]. 
2.4 Discussion: Aspects and Clusters 
To better understand the differences of the presented models it is elucidating to 
systematically compare the conditional probabilities P(ca ]x) and P(ca lye): 
P(clzi) 
P(col/) 
Aspect One-sided One-sided Two-sided 
Model 2' Clustering 3; Clustering Clustering 
As can be seen from the above table, probabilities P(calxi) and P(cly,) correspond 
to posterior probabilities of latent variables if clusters are defined in the 2'- and 
32-space, respectively. Otherwise, they are computed from model parameters. This 
is a crucial difference as, for example, the posterior probabilities are approaching 
470 T. Hofmann, J. Puzicha and M. I. Jordan 
12345678901234567890123456789012 
Figure 3' Two-sided clustering of LOB: r matrix and most probable words. 
Boolean values in the infinite data limit and P(y, lx) = 
are converging to one of the class-conditional distributions. Yet, in the aspect model 
P(YklXi) '- a P(calxi)P(ylca) and P(ca[xi) or P(Cc)P(xilCa) are typically not 
peaking more sharply with an increasing number of observations. In the aspect 
model, conditionals P(yklxi) are inherently a weighted sum of the 'prototypical' 
distributions P(yk [ca). Cluster models in turn ultimately look for the 'best' class- 
conditional and weights are only indirectly induced by the posterior uncertainty. 
3 The Cluster-Abstraction Model 
The models discussed in Section 2 all define a non-hierarchical, 'flat' latent class 
structure. However, for structure discovery it is important to find hierarchical data 
organizations. There are well-known architectures like the Hierarchical Mixtures 
of Experts [Jordan, Jacobs, 1994] which fit hierarchical models. Yet, in the case 
of dyadic data there is an alternative possibility to define a hierarchical model. 
The Cluster-Abstraction Model (CAM) is a clustering model (e.g., in X) where 
the conditionals P(yi[ca) are itself xi-specific aspect mixtures, P(ylca, xg) : 
with a latent aspect mapping a. To obtain a hierarchi- 
cal organization, clusters ca are identified with the terminal nodes of a hierarchy 
(e.g., a complete binary tree) and aspects a with inner and terminal nodes. As 
a compatibility constraint it is imposed that P(a[ca, xi) = 0 whenever the node 
corresponding to a is not on the path to the terminal node ca. Intuitively, con- 
ditioned on a 'horizontal' clustering c all observations (zi, yk)  Si for a particular 
xi have to be generated from one of the 'vertical' abstraction levels on the path to 
c(xi). Since different clusters share aspects according to their topological relation, 
this favors a meaningful hierarchical organization of clusters. Moreover, aspects at 
inner nodes do not simply represent averages over clusters in their subtree as they 
are forced to explicitly represent what is common to all subsequent clusters. 
Skipping the technical details, the E-step is given by 
P{a(xi,y}) = alc(xi) = ca} or P(alca,xi)P(y}la) (11) 
P{c(xi) = Ca} or P(ca) II-][P(alca,xi)P(yi]a)] n(a:"y) (12) 
and the M-step formulae are P(y:la) or Y']-i P{a(xi,y) = a}n(xi,y), P(ca) or 
Y]i P{c(xi) = ca}, and P(alca,xi ) or y']. P{a(xi,y) = alc(xi ) : ca}n(xi,y). 
Learning from Dyadic Data 471 
..................................... network ................................ 
network .... 
............. algorithm 
function learn perform 
weight ,. train network i 
learn \ problem process 
.: example ",, 
net method 
%rate error 
continu learn 
arbitrari perform 
runt taon ! algocithn 
number  backprolOg 
machah ! update 
energl model 
data 
model 
classif control 
train neural 
target design 
featur inform 
classaft network 
ima model 
charact ! no nlinear 
clasifi ! statist 
featur : opUm 
!  architectur :' ', boundari algorithm,,' 
neural 
function 
result 
gener ,: time ', 
: model ", 
.: synapti,. 
activ neural 
spike learn 
respons algorithm 
burst dyn,amlc 
,,' spike problurt  
,"' ,"  Inhlblt 
" :,' circuit ceU 
..'  mtial xten 
canter oscillatri 
posty napt coupl 
....... neuron " 
. dynam,ic process learn , 
author :' algorltln 
inform :' rule 
/ network 
: map 
visual memori 
imag associ 
motion cpapc 
cell pattern 
pr9cess learn 
red'pt pattern 
field memori 
orient node 
object rnJe 
(eitur optira 
recotr s7.e 
jrronn hebbllm 
shape invsri 
enforc line [...l 
region rotat 
oscil transform 
Figure 4: Parts of the top levels of a hierarchical clustering solution for the Neural 
document collection, aspects are represented by their 5 most probable word stems. 
4 Annealed Expectation Maximization 
Annealed EM is a generalization of EM based on the idea of deterministic anneal- 
ing [Rose et al., 1990] that has been successfully applied as a heuristic optimization 
technique to many clustering and mixture problems. Annealing reduces the sensitiv- 
ity to local maxima, but, even more importantly in this context, it may also improve 
the generalization performance compared to maximum likelihood estimation. 2 The 
key idea in annealed EM is to introduce an (inverse temperature) parameter 
and to replace the negative (averaged) complete data log-likelihood by a substitute 
known as the free energy (both are in fact equivalent at /: 1). This effectively 
results in a simple modification of the E-step by taking the likelihood contribution 
in Bayes' rule to the power of/3. In order to determine the optimal value for fi we 
used an additional validation set in a cross validation procedure. 
5 Results and Conclusions 
In our experiments we have utilized the following real-world data sets: (i) Cranfield: 
a standard test collection from information retrieval (N = 1400, M = 4898), (ii) 
Pen: adjective-noun co-occurrences from the Penn Treebank corpus (N = 6931, 
M: 4995) and the LOB corpus (N = 5448, M = 6052), (iii) Neural: a document 
collection with abstracts of journal papers on neural networks (N-- 1278, M = 6065), 
(iv) B,ble: word bigrams from the bible edition of the Gutenberg project (N = M- 
12858), (v) Aerial: Textured aerial images for segmentation (N = 128x128, M = 192). 
In Fig. 1 we have visualized an aspect model fitted to the Bible bigram data. Notice 
that although V -- 32 the role of the preceding and the subsequent words in bigrams 
is quite different. Segmentation results obtained on Aerial applying the one-sided 
clustering model are depicted in Fig. 2. A multi-scale Gabor filter bank (3 octaves, 
4 orientations) was utilized as an image representation (cf. [Hofmann et al., 1998]). 
In Fig. 3 a two-sided clustering solution of LOB is shown. Fig. 4 shows the top 
levels of the hierarchy found by the Cluster-Abstraction Model in Neural. The 
inner node distributions provide resolution-specific descriptors for the documents 
in the corresponding subtree which can be utilized, e.g., in interactive browsing 
for information retrieval. Fig. 5 shows typical test set perplexity curves of the 
2Moreover, the tree topology for the CAM is heuristically grown via phase transitions. 
472 T. Hofinann, J. Puzicha and M. I. Jordan 
(a) 
(b) 
(c) 
Figure 5: 
ing model 
K 
Perplexity curves for annealed EM (aspect (a), (b) and one-sided cluster- 
Aspect X-cluster CAM X/Y-cluster 
(c)) on the Bible and Cran data. 
Aspect X-cluster CAM X/y-cluster 
Cran Penn 
1 
8 
16 
32 
64 
128 
639 
0.73 312 0.08 352 0.13 322 0.55 394 
0.72 255 0.07 302 0.10 268 0.51 335 
0.71 205 0.07 254 0.08 226 0.46 286 
0.69 182 0.07 223 0.07 204 0.44 272 
0.68 166 0.06 231 0.06 179 0.40 241 
685 
0.88 482 0.09 527 0.18 511 0.67 615 
0.72 255 0.07 302 0.10 268 0.51 335 
0.83 386 0.07 452 0.12 438 0.53 506 
0.79 360 0.06 527 0.11 422 0.48 477 
0.78 353 0.04 663 0.10 410 0.45 462 
Table 1: Perplexity results for different models on the Cran (predicting words condi- 
tioned on documents) and Penn data (predicting nouns conditioned on adjectives). 
annealed EM algorithm for the aspect and clustering model (7 > = e - where 1 is 
the per-observation log-likelihood). At/ = 1 (standard EM) overfitting is clearly 
visible, an effect that vanishes with decreasing/. Annealed learning performs also 
better than standard EM with early stopping. Tab. 1 systematically summarizes 
perplexity results for different models and data sets. 
In conclusion mixture models for dyadic data have shown a broad application po- 
tential. Annealing yields a substantial improvement in generalization performance 
compared to standard EM, in particular for clustering models, and also outper- 
forms a complexity control via K. In terms of perplexity, the aspect model has 
the best performance. Detailed performance studies and comparisons with other 
state-of-the-art techniques will appear in forthcoming papers. 
References 
[Dempster et al., 1977] Dempster, A.P., Laird, N.M., Rubin, D.B. (1977). Maximum like- 
lihood from incomplete data via the EM algorithm. J. Royal Statist. Soc. B, 39, 1-38. 
[Hofmann, Puzicha, 1998] Hofmann, T., Puzicha, J. 1998. Statistical models for co- 
occurrence data. Tech. rept. Artifical Intelligence Laboratory Memo 1625, M.I.T. 
[Hofmann et al., 1998] Hofmann, T., Puzicha, J., Buhmann, J.M. (1998). Unsupervised 
texture segmentation in a deterministic annealing framework. IEEE Transactions on 
Pattern Analysis and Machine Intelligence, 20(8), 803-818. 
[Jordan, Jacobs, 1994] Jordan, M.I., Jacobs, R.A. (1994). Hierarchical mixtures of experts 
and the EM algorithm. Neural Computation, 6(2), 181-214. 
[Meila, Jordan, 1998] Meila, M., Jordan, M. I. 1998. Estimating Dependency Structure 
as a Hidden Variable. In: Advances in Neural Information Processing Systems 10. 
[Peteira et al., 1993] Peteira, F.C.N., Tishby, N.Z., Lee, L. 1993. Distributional clustering 
of English words. Pages 183-190 of: Proceedings of the ACL. 
[Rose et al., 1990] Rose, K., Gurewitz, E., Fox, G. (1990). Statistical mechanics and phase 
transitions in clustering. Physical Review Letters, 65(8), 945-948. 
[Saul, Pereira, 1997] Saul, L., Pereira, F. 1997. Aggregate and mixed-order Markov mod- 
els for statistical language processing. In: Proceedings of the 2nd International Confer- 
ence on Empirical Methods in Natural Language Processing. 
