A variational principle for 
model-based morphing 
Lawrence K. Saul* and Michael I. Jordan 
Center for Biological and Computational Learning 
Massachusetts Institute of Technology 
79 Amherst Street, E10-034D 
Cambridge, MA 02139 
Abstract 
Given a multidimensional data set and a model of its density, 
we consider how to define the optimal interpolation between two 
points. This is done by assigning a cost to each path through space, 
based on two competing goals--one to interpolate through regions 
of high density, the other to minimize arc length. From this path 
functional, we derive the Euler-Lagrange equations for extremal 
motion; given two points, the desired interpolation is found by solv- 
ing a boundary value problem. We show that this interpolation can 
be done efficiently, in high dimensions, for Gaussian, Dirichlet, and 
mixture models. 
I Introduction 
The problem of non-linear interpolation arises frequently in image, speech, and 
signal processing. Consider the following two examples: (i) given two profiles of the 
same face, connect them by a smooth animation of intermediate poses[I]; (ii) given a 
telephone signal masked by intermittent noise, fill in the missing speech. Both these 
examples may be viewed as instances of the same abstract problem. In qualitative 
terms, we can state the problem as follows[2]: given a multidimensional data set, 
and two points from this set, find a smooth adjoining path that is consistent with 
available models of the data. We will refer to this as the problem of model-based 
morphing. 
In this paper, we examine this problem it arises from statistical models of multi- 
dimensional data. Specifically, our focus is on models that have been derived from 
Uurrent address: AT&T Labs, 600 Mountain Ave 2D-439, Murray Hill, NJ 07974 
268 L. K. Saul and M. L Jordan 
some form of density estimation. Though there exists a large body of work on 
the use of statistical models for regression and classification, there has been com- 
paratively little work on the other types of operations that these models support. 
Non-linear morphing is an example of such an operation, one that has important 
applications to video email[3], low-bandwidth teleconferencing[4], and audiovisual 
speech recognition [2]. 
A common way to describe multidimensional data is some form of mixture modeling. 
Mixture models represent the data as a collection of two or more clusters; thus, they 
are well-suited to handling complicated (multimodal) data sets. Roughly speaking, 
for these models the problem of interpolation can be divided into two tasks--how 
to interpolate between points in the same cluster, and how to interpolate between 
points in different clusters. Our paper will therefore be organized along these lines. 
Previous studies of morphing have exploited the properties of radial basis function 
networks[l] and locally linear models[2]. We have been influenced by both these 
works, especially in the abstract formulation of the problem. New features of our 
approach include: the fundamental role played by the density, the treatment of non- 
Gaussian models, the use of a continuous variational principle, and the description 
of the interpolant by a differential equation. 
2 Intracluster interpolation 
Let Q = {q(X), q(2),..., qlQ[) denote a set of multidimensional data points, and let 
P(q) denote a model of the distribution from which these points were generated. 
Given two points, our problem is to find a smooth adjoining path that respects the 
statistical model of the data. In particular, the desired interpolant should not pass 
through regions of space that the modeled density P(q) assigns low probability. 
2.1 Clusters nd metrics 
To develop these ideas further, we begin by considering a special class of models-- 
namely, those that represent clusters. We say that P(q) models a data cluster 
if P(q) has a unique (global) maximum; in turn, we identify the location of this 
maximum, q*, as the prototype. 
Let us now consider the geometry of the space inhabited by the data. To endow this 
space with a geometric structure, we must define a metric, g(q), that provides a 
measure of the distance between two nearby points: 
1 
Z)[q,q+dq]-- Eg(q) dq,dq +O([dq[:). (1) 
Intuitively speaking, the metric should reflect the fact that as one moves away from 
the center of the cluster, the density of the data dies off more quickly in some 
directions than in others. A natural choice for the metric, one that meets the above 
criteria, is the negative Hessian of the log-likelihood: 
(2 
ga/(q) ---- [ln P(q)]. (2) 
OqOq 
A Variational Principle for Model-based Morphing 
This metric is positive-definite if In P(q) is concave; 
examples we discuss. 
269 
this will be true for all the 
2.2 From densities to paths 
The problem of model-based interpolation is to balance two competing goals-- 
one to interpolate through regions of high density, the other to avoid excessive 
deformations. Using the metric in eq. (1), we can now assign a cost (or penalty) to 
each path based on these competing goals. 
Consider the path parameterized by q(t). We begin by dividing the path into 
segments, each of which is traversed in some small time interval, dr. We assign a 
value to each segment by 
[ p(q(t)) 1 } )[q(t),q(t+dt)] 
= L ' 
(3) 
where ? _> 0. For reasons that will become clear shortly, we refer to ? as the 
line tension. The value assigned to each segment dep,ens on two terms: a ratio 
of probabilities, P(q(t))/P(q*), which favors points near the prototype, and the 
constant multiplier, e -. Both these terms are upper bounded by unity, and hence 
so is their product. The value of the segment also decays with its length, as a result 
of the exponent, D[q(t), q(t + dt)l. 
We derive a path functional by piecing these segments together, multiplying their 
individual contributions, and taking the continuum limit. A value for the entire 
path is obtained from the product: 
(4) 
Taking the logarithm of both sides, and considering the limit dt - 0, we obtain the 
path functional 
1 
$[q(t)]-/ {-In [P(q(t))] +/} ga(q)OaO 
[ P(q*) 
(5) 
where dl -- [q] is the tangent vector to the path at time t. The terms in this 
functional balance the two competing goals for non-linear interpolation. The first 
favors paths that interpolate through regions of high density, while the second favors 
paths with small arc lengths; both are computed under the metric induced by the 
modeled density. The line tension/? determines the cost per unit arc length and 
modulates the competition between the two terms. Note that the value of the 
functional does not depend on the rate at which the path is traversed. 
To minimize this functional, we use the following result from the 
variations. Let (q, dl) denote the integrand of eq. (5), such that 
fdt (q, dl). Then the path which minimizes this functional obeys 
Lagrange equations[5]' 
dt Oq 
calculus of 
$[q(t)] = 
the Euler- 
270 L. K. Saul and M. L Jordan 
We define the model-based interpolant between two points as the path which mini- 
mizes this functional; it is found by solving the associated boundary value problem. 
The function (q,/1) is known as the Lagrangian. In the next sections, we present 
eq. (5) for two distributions of interest--the multivariate Gaussian and the Dirichlet. 
2.3 Gaussian cloud 
The simplest model of multidimensional data is the multivariate Gaussian. In this 
case, the data is modeled by 
IMIX/2 { 1 [xTMx]} (7) 
P(x)- (ga.)N/2 exp -- , 
where M is the inverse covariance matrix and N is the dimensionality. Without 
loss of generality, wehave chosen the coordinate system so that the mean of the 
data coincides with the origin. For the Gaussian, the mean also defines the location 
of the prototype; moreover, from eq. (2), the metric induced by this model is just 
the inverse covariance matrix. From eq. (5), we obtain the path functional: 
To find a model-bed interpolant, we seek the path that minimizes this hnctional. 
Because the functional is parameterization-invariant, it suffices to consider paths 
that are traversed at a constant (unit) rate: irMi = 1. om eq. (6), we find that 
the optimal path with this parameterization satisfies: 
This is a set of coupled non-linear equations for the components of x(t). However, 
note that at any moment in time, the acceleration, , can be expressed  a linear 
combination of the position, x, and the velocity, . It follows that the motion of x 
lies in a plane; in particular, it lies in the plane spanned by the initial conditions, 
x and , at time t = 0. This enables one to solve the boundary value problem 
efficiently, even in very high dimensions. 
Figure la shows some solutions to this boundary value problem for different values 
of the line tension, t. Note how the paths bend toward the origin, with the degree 
of curvature determined by the line tension, t. 
2.4 Dirichlet simplex 
For many types of data, the multivariate Gaussian distribution is not the most 
appropriate model. Suppose that the data points are vectors of positive numbers 
whose elements sum to one. In particular, we say that w is a probability vector 
ifw = (wx,w2,...,wv) E Tff v, wa > 0 for alla, and awa = 1. Clearly, the 
multivariate Gaussian is not suited to data of this form, since no matter what the 
mean and covariance matrix, it cannot assign zero probability to vectors outside 
the simplex. Instead, a more natural model is the Dirichlet distribution: 
p(w) = r(0)H ' (10) 
A Variational Principle for Model-basedMorphing 271 
where 0a > 0 for all a, and 0 --  0. Here, r(.) is the gamma function, and 
0 are parameters that determine the statistics of P(w). Note that P(w) = 0 for 
vectors that are not probability vectors; in particular, the simplex constraints on w 
are implicit assumptions of the model. 
We can rewrite the Dirichlet distribution in a more revealing form as follows. First, 
let w* denote the probability vector with elements w} -- 0/0. Then, making a 
change of variables from w to In w, we have: 
1 exp {- 0 [KL (w* IIw)] } 
P(ln w) - Z- ' 
(11) 
where Z0 is a normalization factor that depends on 0a (but not w), and the quantity 
in the exponent is 0 times the Kullback-Leibler (KL) divergence, 
(12) 
The KL divergence measures the mismatch between w and w*, with KL(w* IIw): 0 
if and only ifw = w*. Since mL(w*llw) has no other minima besides the one at w*, 
we shall say that P(lnw) models a data cluster in the variable lnw. 
The metric induced by this modeled density is computed by following the prescrip- 
tion of eq. (2). For two nearby points inside the simplex, w and w + dw, the result 
of this prescription is that the squared distance is given by 
ds 2 = 0 E dw} (13) 
Up to a multiplicative factor of 20, eq. (13) measures the infinitesimal KL divergence 
between w and w + dw. This is a natural metric for vectors whose elements can 
be interpreted as probabilities. 
The functional for non-linear interpolation is found by substituting the modeled 
density and the induced metric into eq. (5). For the Dirichlet distribution, this gives: 
1 
wj 
dr. (14) 
Our problem is to find the path that minimizes this functional. Because the func- 
tional is parameterization-invariant, it again suffices to consider paths that are 
traversed at a constant rate, or -.a }/Wc - 1. In addition to this, however, we 
must also enforce the constraint that w remains inside the simplex; this is done 
by introducing a Lagrange multiplier. Following this procedure, we find that the 
optimal path is described by: 
. 
W2 Wot 
+ -o = O(w - w;). 
(15) 
Given two endpoints, this differential equation defines a boundary value problem 
for the optimal path. Unlike before, however, in this case the motion of w is not 
272 L. K. Saul and M. I. Jordan 
confined to a plane. Hence, the boundary value problem for eq. (15) does not 
collapse to one dimension, as does its Gaussian counterpart, eq. (9). 
To remedy this situation, we have developed an efficient approximation that finds 
a near-optimal interpolant, in lieu of the optimal one. This is done in two steps: 
first, by solving eq. (15) exactly in the limit g -- c; second, by using this limiting 
solution, woo(t), to find the lowest-cost path that can be expressed as the convex 
combination: 
w(t) = m(t)w* + [1 - m(t)] woo(t). (16) 
The lowest-cost path of this form is found by substituting eq. (16) into the Dirichlet 
functional, eq. (14), and solving the Euler-Lagrange equations for re(t). The moti- 
vation for eq. (16) is that for finite g, we expect the optimal interpolant to deviate 
from w(t) and bend toward the prototype at w*. In practice, this approximation 
works very well, and by collapsing the boundary value problem to one dimension, it 
allows cheap computation of the Dirichlet interpolants. Some paths from eq. (16), 
as well as the g -- oo paths on which they are based, are shown in figure lb. These 
paths were computed for the twelve dimensional simplex (N = 12), then projected 
onto the wxw2-plane. 
3 Intercluster interpolation 
The Gaussian and Dirichlet distributions of the previous section are clearly in- 
adequate for modeling for multimodal data sets. In this section, we extend the 
variational principle to mixture models, which describe the data as a collection of 
k _> 2 clusters. In particular, suppose the data is modeled by 
k 
V(q) = E rzV(qlz)' (17) 
z--1 
Here, we have assumed that the conditional densities P(qlz) model data clusters as 
defined in section 2.1, and the coefficients rz = P(z) define prior probabilities for 
the latent variable, z  {.1, 2,..., k}. 
The crucial step for mixture models is to develop the appropriate generalization of 
eq. (5). To this end, let E (q, l) denote the Lagrangian derived from the conditional 
density, P(qlz), and gz the line tension x that appears in this Lagrangian. We now 
combine these Lagrangians into a single functional: 
$[q(t), z(t)] =/dr C(t)(q, 1). 
(18) 
Note that eq. (18) is a functional of two arguments, not one. For mixture models, 
which define a joint density P(q, z) = rzP(qlz), our goal is to find the optimal path 
in the joint space q  z. Here, z(t) is a piecewise-constant function of time that 
assigns a discrete label to each point along the path; in other words, it provides a 
temporal segmentation of the path, q(t). The purpose of z(t) in eq. (18) is to select 
which Lagrangian is used to compute the contribution from the interval It, t + dt]. 
To respect the weighting of the mixture components in eq. (17), we set the line tensions 
according to z - -ln rz. Thus, components with higher weights have lower line tensions. 
A Variational Principle for Model-basedMorphing 273 
0.5 I =0. 
-0. 
:'1:5 -, -o.5 o o.5 
xl 
0.7 
0.6 
0.5 
,nO-4 
0.3 
0.2 
0.1 
0 
0 
0.2 0.4 0.6 --5 0 5 
wl xl 
Figure 1: Model-based morphs for (a) Gaussian distribution; (b) Dirichlet distribu- 
tion; (c) mixture of Gaussians. The prototypes are shown as asterisks;/ denotes the 
line tension. Figure lc shows the convergence of the iterative algorithm; n denotes 
the number of iterations. 
As before, we define the model-based interpolant as the path q(t) that minimizes 
eq. (18). In this case, however, both q(t) and z(t) must be simultaneously optimized 
to recover this path. We have implemented an iterative scheme to perform this 
optimization, one that alternately (i) estimates the segmentation z(t), (ii) computes 
the model-based interpolant within each cluster based on this segmentation, and 
(iii) reestimates the points (along the cluster boundaries) where z(t) changes value. 
In short, the strategy is to optimize z(t) for fixed q(t), then optimize q(t) for fixed 
Figure lc shows how this algorithm operates on a simple mixture of Gaussians. In 
this example, the covariance matrices were set equal to the identity matrix, and 
the means of the Gaussians were distributed along a circle in the xxx2-plane. Note 
that with each iteration, the interpolant converges more closely to the path that 
traverses this circle. The effect is similar to the manifold-snake algorithm of Bregler 
and Omohundro[2]. 
4 Discussion 
In this paper we have proposed a variational principle for model-based interpolation. 
Our framework handles Gaussian, Dirichlet, and mixture models, and the resulting 
algorithms scale well to high dimensions. Future work will concentrate on the 
application to real images. 
References 
[1] T. Poggio and F. Girosi. Networks for approximation and learning. Proc. of IEEE, vol 
78:9 (1990). 
C. Bregler and S. Omohundro. Nonhnear image interpolation using manifold learning. 
In G. Tesauro, D. Touretzky, and T. Leen (eds.). Advances in Neural Information 
Processing Systems 7, 973-980. MIT Press, Cambridge, MA (1995). 
[3] T. Ezzat. Example based analysis and synthesis for images of faces. MIT EECS M.S. 
thesis (1996). 
[4] D. Beymet, A. Shashua, and T. Poggio. Example based image analysis and synthesis. 
AI Memo 1161, MIT (1993). 
[5] H. Goldstein. Classical Mechanics. Addison-Wesley, London (1980). 
