Maximum entropy discrimination 
Tommi Jaakkola 
MIT AI Lab 
545 Technology Sq. 
Cambridge, MA 02139 
tommiai. mir. edu 
Marina Meila 
MIT AI Lab 
545 Technology Sq. 
Cambridge, MA 02139 
mmp ai. mir. edu 
Tony Jebara 
MIT Media Lab 
20 Ames St. 
Cambridge, MA 02139 
jebaramedia. mir. edu 
Abstract 
We present a general framework for discriminative estimation based 
on the maximum entropy principle and its extensions. All calcula- 
tions involve distributions over structures and/or parameters rather 
than specific settings and reduce to relative entropy projections. 
This holds even when the data is not separable within the chosen 
parametric class, in the context of anomaly detection rather than 
classification, or when the labels in the training set are uncertain or 
incomplete. Support vector machines are naturally subsumed un- 
der this class and we provide several extensions. We are also able 
to estimate exactly and efficiently discriminative distributions over 
tree structures of class-conditional models within this flamework. 
Preliminary experimental results are indicative of the potential in 
these techniques. 
I Introduction 
Effective discrimination is essential in many application areas. Employing gener- 
ative probability models such as mixture models in this context is attractive but 
the criterion (e.g., maximum likelihood) used for parameter/structure estimation 
is suboptimal. Support vector machines (SVMs) are, for example, more robust 
techniques as they are specifically designed for discrimination [9]. 
Our approach towards general discriminative training is based on the well known 
maximum entropy principle (e.g., [3]). This enables an appropriate training of both 
ordinary and structural parameters of the model (cf. [5, 7]). The approach is not 
limited to probability models and extends, e.g., SVMs. 
2 Maximum entropy classification 
Consider a two-class classification problem  where labels y  {-1, 1} are assigned 
The extension to a multi-class is straightforward[4]. The formulation also admits an 
easy extension to regression problems, analogously to SVMs. 
Maximum Entropy Discrimination 471 
to examples X  X. Given two generative probability distributions P(XlOv) with 
parameters Or, one for each class, the corresponding decision rule follows the sign 
of the discriminant function: 
v(xlo) 
c(xlo) = log + b 
(1) 
where 0 = {O,O_,b} and b is a bias term, usually expressed as a log-ratio b = 
log p/(1 -p). The class-conditional distributions may come from different families 
of distributions or the parametric discriminant function could be specified directly 
without any reference to models. The parameters 0v may also include the model 
structure (see later sections). 
The parameters 0 = {0, 0_, b} should be chosen to maximize classification accu- 
racy. We consider here the more general problem of finding a distribution P(O) 
over parameters and using a convex combination of discriminant functions, i.e., 
f P(O)(XlO)dO in the decision rule. The search for the optimal P(O) can be for- 
malized as a maximum entropy (ME) estimation problem. Given a set of training 
examples {X,... ,XT} and corresponding labels {y,... ,yT} we find a distribu- 
tion P(O) that maximizes the entropy H(P) subject to the classification constraints 
f P(O) [Yt (XtlO)] dO _> 7 for all t. Here 7 > 0 specifies a desired classification 
margin. The solution is unique (if it exists) since H(P) is concave and the linear 
constraints specify a convex region. Note that the preference towards high entropy 
distributions (fewer assumptions) applies only within the admissible set of distribu- 
tions 7v consistent with the constraints. See [2] for related work. 
We will extend this basic idea in a number of ways. The ME formulation assumes, 
for example, that the training examples can be separated with the specified mar- 
gin. We may also have a reason to prefer some parameter values over others and 
would therefore like to incorporate a prior distribution P0(O). Other extensions 
and generalizations will be discussed later in the paper. 
A more complete formulation is based on the following minimum relative entropy 
principle: 
Definition 1 Let {Xt, Yt} be the training examples and labels, (X]O) a parametric 
discriminant function, and 7 = [7,...,fit] a set of margin variables. Assuming a 
prior distribution P0(O,7), we find the discriminative minimum relative entropy 
(MRE) distribution P(O, 7) by minimizing D(PIIPo ) subject to 
f r(o, 7) [Yt (XtlO) - ?t] dOd7 _ 0 
(2) 
for all t. Here  = sign ( f P(O) (XlO ) dO ) specifies the decision rule for any 
new example X. 
The margin constraints and the preference towards large margin solutions are encod- 
ed in the prior P0 ("7). Allowing negative margin values with non-zero probabilities 
also guarantees that the admissible set 7 > consisting of distributions P(, 7) con- 
sistent with the constraints, is never empty. Even when the examples cannot be 
separated by any discriminant function in the parametric class (e.g., linear), we 
get a valid solution. The miss-classification penalties follow from P0(7) as well. 
472 T. Jaakkola, M. Meila and T. debara 
a) , c) 
.----""'"  j Po 
 D(P,P 0 ) 
/ 
o 05 
Figure 1: a) Minimum relative entropy (MRE) projection from the prior distribution 
to the admissible set. b) The margin prior Po(%). c) The potential terms in the 
MRE formulation (solid line) and in SVMs (dashed line). c = 5 in this case. 
Suppose P0(O, 7) = P0(O)P0(7) and Po(7) = 1-It Po(7t), where 
Po(7t) = ce -c0-v) for 7t _< 1, 
(3) 
This is shown in Figure lb. The penalty for margins smaller than 1 - 1/c (the prior 
mean of 7t) is given by the relative entropy distance between P(7) and P0(7). This 
is similar but not identical to the use of slack variables in support vector machines. 
Other choices of the prior are discussed in [4]. 
The MRE solution can be viewed as a relative entropy projection from the prior 
distribution P0(O,7) to the admissible set P. Figure la illustrates this view. From 
the point of view of regularization theory, the prior probability P0 specifies the 
entropic regularization used in this approach. 
Theorem 1 The solution to the MRE problem has the following general form [1] 
__1 P0(O, 7) e -], [ y,c(XlO)-v (4) 
P(o, 7) = 
where is the noalization constant (paition function) and  = (,..., T) 
defines a set of non-negative Lagrange multipliers, one for each classification con- 
straint.  are set by finding the unique maximum of the following jointly concave 
objective function: J() = -logZ() 
The solution is sparse, i.e., only a few Lagrange multipliers will be non-zero. This 
arises because many of the classification constraints become irrelevant once the 
constraints are enforced for a small subset of examples. Sparsity leads to immediate 
but weak generalization guarantees expressed in terms of the number of non-zero 
Lagrange multipliers [4]. Practical leave-one-out cross-validation estimates can be 
also derived. 
2.1 Practical realization of the MRE solution 
We now turn to finding the MRE solution. To begin with, we note that any disjoint 
factorization of the prior P0(O,7), where the corresponding parameters appear in 
distinct additive components in ytl(Xt, O) -7t, leads to a disjoint factorization of 
the MRE solution P(O, 7). For example, {O \ b, b, 7} provides such a factorization. 
As a result of this factorization, the bias term could be eliminated by imposing 
additional constraints on the Lagrange multipliers [4]. This is analogous to the 
handling of the bias term in support vector machines [9]. 
We consider now a few specific realizations such as support vector machines and a 
class of graphical models. 
Maximum Entropy Discrimination 473 
2.1.1 Support vector machines 
It is well known that the log-likelihood ratio of two Gaussian distributions with equal 
covariance matrices yields a linear decision rule. With a few additional assumptions, 
the MRE formulation gives support vector machines: 
Theorem 2 Assuming (X, O) = oTx -- b and P0(O, 7) = Po(O)Po(b)Po(7) where 
Po(O) is N(O,I), Po(b) approaches a non-informative prior, and Po(7) is given by 
eq. (3) then the Lagrange multipliers .k are obtained by maximizing J(.k) subject to 
0 <_ At <_ c and Y';.t ,ktYt ---- O, where 
I (xtTxt,) 
J(A) = E[ St + log(1 - At/c)] -  E AtAt, ytyt, 
t t,t' 
(5) 
The only difference between our J(A) and the (dual) optimization problem for 
SVMs is the additional potential term log(1 -At/c). This highlights the effect 
of the different miss-classification penalties, which in our case come from the MRE 
projection. Figure lb shows, however, that the additional potential term does not 
always carry a huge effect (for c = 5). Moreover, in the separable case, letting 
c -+ c, the two methods coincide. The decision rules are formally identical. 
We now consider the case where the discriminant function (X, )) corresponds to 
the log-likelihood ratio of two Gaussians with different (and adjustable) covariance 
matrices. The parameters ) in this case are both the means and the covariances. 
The prior P0()) must be the conjugate Normal-Wishart to obtain closed form 
integrals 2 for the partition function, Z. Here, P()i, 
a density over means and covariances. 
The prior distribution has the form P0 ()x) = Af(mx; m0, V/k) ZW(V; kVo, k) with 
parameters (k, m0, V0) that can be specified manually or one may let k -+ 0 to get 
a non-informative prior. Integrating over the parameters and the margin, we get 
Z = Z v x Z1 x Z-l, where 
a N? d/2 ISxl d F( (N, + 1- j)/2) 
1-I j= 1 
(6) 
N1 _A Zt Wt, .'1 __A__ Zt 't' xt, Sl -- Zt wtXtX? - NiX1X1 T. Here, wt is a scalar 
weight given by wt = u(yt)+Ytt. For Z_, the weights are set to wt = u(-yt)-Yt,kt; 
u(.) is the step function. Given Z, updating  is done by maximizing J(). The 
resulting marginal MRE distribution over the parameters (normalized by Z x Z_i) 
is a Normal-Wishart distribution itself, P(O) = Af(ml; '1, V1 IN1 ) Z]//(V1; S1, N1 ) 
with the final A values. Predicting the label for a new example X involves taking 
expectations of the discriminant function under a Normal-Wishart. This is 
N1 
ElD(el) [log P(X101)] = constant - -(X - )Ts{- (X - ) 
(7) 
We thus obtain discriminative quadratic decision boundaries. These extend the 
linear boundaries without (explicitly) resorting to kernels. More generally, the 
covariance estimation in this framework adaptively modifies the kernel. 
2This can be done more generally for conjugate priors in the exponential family. 
474 T. Jaakkola, M. Meila and T. Jebara 
2.1.2 Graphical models 
We consider here graphical models with no hidden variables. The ME (or MRE) 
distribution is in this case a distribution over both structures and parameters. Find- 
ing the distribution over parameters can be done in closed form for conjugate priors 
when the observations are complete. The distribution over structures is, in general, 
intractable. A notable exception is a tree model that we discuss in the forthcoming. 
A tree graphical model is a graphical model for which the structure is a tree. This 
model has the property that its log-likelihood can be expressed as a sum of local 
terms [8] 
log(X,1o) = + Wv(X,O) (8) 
u uvE 
The discriminant function consisting of the log-likelihood ratio of a pair of tree 
models (depending on the edge sets E, E-l, and parameters 0, 0_) can be also 
expressed in this form. 
We consider here the ME distribution over tree structures for fixed parameters 3. 
The treatment of the general case (i.e. including the parameters) is a direct exten- 
sion of this result. The ME distribution over the edge sets E1 and E-1 factorizes 
with components 
C uv [ t'O4'l)'"Eu hu(Xt,04-1)] H wl (9) 
P(Ei) = Z+ - Z+uveZ. 
where Z+, h +, W + are functions of the same Lagrange multipliers A. To com- 
pletely define the distribution we need to find A that optimize J(A) in Theorem 1; 
for classification we also need to compute averages with respect to P(E+i). For 
these, it suffices to obtain an expression of the partition function(s) Zx. 
P is a discrete distribution over all possible tree structures for n variables (there 
are n '-2 trees). However, a remarkable graph theory result, called the Matrix Tree 
Theorem [10], enables us to perform all necessary summations in closed form in 
polynomial time. On the basis of this result, we find 
Theorem3 
The normalization constant Z of a distribution of the form (9) is 
Z = h.E H Wuv = h. IQ(W)[, where (10) 
E uv6E 
-Wuv uv (11) 
Q,v(W) = 
2v,=Wv,v u=v 
This shows that summing over the distribution of all trees, when this distribution 
factors according to the trees' edges, can be done in closed form by computing the 
value of a determinant in time O(n3). Since we obtain a closed form expression, 
optimization of the Lagrange multipliers and evaluating the resulting classification 
rule are also tractable. 
Figure 2a provides a comparison of the discriminative tree approach and a maximum 
likelihood tree estimation method on a DNA splice junction problem. 
aEach tree relies on a different set of n - 1 pairwise node marginals. In our experiments 
the class-conditional pairwise marginals were obtained directly from data. 
Maximum Entropy Discrimination 475 
o 
0.8 
Figure 2: ROC curves based on independent test sets. a) Tree estimation: discrim- 
inative (solid) and ML (dashed) trees. b) Anomaly detection: MRE (solid) and 
Bayes (dashed). c) Partially labeled case: 100% labeled (solid), 10% labeled + 90% 
unlabeled (dashed), and 10% labeled + 0% unlabeled training examples (dotted). 
3 Extensions 
Anomaly detection: In anomaly detection we are given a set of training ex- 
amples representing only one class, the "typical" examples. We attempt to cap- 
ture regularities among the examples to be able to recognize unlikely members 
of this class. Estimating a probability distribution P(X]O) on the basis of the 
training set {X,..., Xr} via the ML (or analogous) criterion is not appropriate; 
there is no reason to further increase the probability of those examples that are al- 
ready well captured by the model. A more relevant measure involves the level sets 
X v = {X E : logP(X[O) _> 7} which are used in deciding the class membership 
in any case. We estimate the parameters 0 to optimize an appropriate level set. 
Definition 2 Given a probability model P(XlO), 0 O, a set of training examples 
{X,..., XT}, a set of margin variables 7 = [7,..., 7iT], and a prior distribution 
Po(O, 7) we find the MRE distribution P(O, 7) such that minimizes D(PlIPo) subject 
to the constraints f P(0, 7) [ log P(XtlO) - 7t ] dOd7 _> 0 for all t= 1,... ,T. 
Note that this again a MRE projection whose solution can be obtained as before. 
The choice of P0 (7) in P0 (0, 7) = Po (O)Po (7) is not as straightforward as before since 
each margin 7t needs to be close to achievable log-probabilities. We can nevertheless 
find a reasonable choice by relating the prior mean of 7t to some c-percentile of the 
training set log-probabilities generated through ML or other estimation criterion. 
Denote the resulting value by l and define the prior Po(7t) as Po(%) = c e 
for 7t <_ l. In this case the prior mean of 7t is l, - 1/c. 
Figure 2b shows in the context of a simple product distribution that this choice of 
prior together with the MRE flamework leads to a real improvement over standard 
(Bayesian) approach. We believe, however, that the effect will be more striking 
for sophisticated models such as HMMs that may otherwise easily capture spurious 
regularities in the data. An extension of this formalism to latent variable models is 
provided in [4]. 
Uncertain or incompletely labeled examples: Examples with uncertain la- 
bels are hard to deal with in any (probabilistic or not) discriminative classification 
method. Uncertain labels can be, however, handled within the maximum entropy 
formalism: let y = {y,...,yT} be a set of binary variables corresponding to the 
labels for the training examples. We can define a prior uncertainty over the labels 
by specifying P0(y); for simplicity, we can take this to be a product distribution 
476 T. Jaakkola, M. Meila and T. Jebara 
Po(Y) = I-It Pt,o(yt) where a different level of uncertainty can be assigned to each 
example. Consequently, we find the minimum relative entropy projection from the 
prior distribution P0(O, 7, Y) = Po(O)Po(7)Po(y) to the admissible set of distribu- 
tions (no longer a function of the labels) that are consistent with the constraints: 
Evf,, P(O,%y)[ytE(Xt,O)-7t] dOd7 _ 0 for all t = 1,...,T. The MRE 
principle differs from transduction [9], provides a soft rather than hard assignment 
of unlabeled examples, and is fundamentally driven by large margin classification. 
The MRE solution is not, however, often feasible to obtain in practice. We can 
nevertheless formulate an efficient mean field approach in this context [4]. Figure 
2c demonstrates that even the approximate method is able to reap most of the ben- 
efit from unlabeled examples (compare, e.g., [6]). The results are for a DNA splice 
junction classification problem. For more details see [4]. 
4 Discussion 
We have presented a general approach to discriminative training of model param- 
eters, structures, or parametric discriminant functions. The formalism is based on 
the minimum relative entropy principle reducing all calculations to relative entropy 
projections. The idea naturally extends beyond standard classification and cov- 
ers anomaly detection, classification with partially labeled examples, and feature 
selection. 
References 
[1] Cover and Thomas (1991). Elements of information theory. John Wiley & Sons. 
[2] Kivinen J. and Warmuth M. (1999). Boosting as Entropy Projection. Proceed- 
ings of the 12th Annual Conference on Computational Learning Theory. 
[3] Levin and Tribus (eds.) (1978). The maximum entropy formalism. Proceedings 
of the Maximum entropy formalism conference, MIT. 
[4] Jaakkola T., Meil& M. and Jebara T. (1999). Maximum entropy discrimination. 
MIT AITR-1668, http://www. ai. mir. edu/~tommi/papers. html. 
[5] Jaakkola T. and Haussler D. (1998). Exploiting generafive models in discrimi- 
native classifiers. NIPS 11. 
[6] Joachims, T. (1999). Transductive inference for text classification using support 
vector machines. International conference on Machine Learning. 
[7] Jebara T. and Pentland A. (1998). Maximum conditional likelihood via bound 
maximization and the CEM algorithm. NIPS 11. 
[8] Meil& M. and Jordan M. (1998). Estimating dependency structure as a hidden 
variable. NIPS 11. 
[9] Vapnik V. (1998). Statistical learning theory. John Wiley & Sons. 
[10] West D. (1996). Introduction to graph theory. Prentice Hall. 
