Recovering Perspective Pose with a Dual 
Step EM Algorithm 
Andrew D.J. Cross and Edwin R. Hancock, 
Department of Computer Science, 
University of York, 
York, Y01 5DD, UK. 
Abstract 
This paper describes a new approach to extracting 3D perspective 
structure from 2D point-sets. The novel feature is to unify the 
tasks of estimating transformation geometry and identifying point- 
correspondence matches. Unification is realised by constructing a 
mixture model over the bi-partite graph representing the correspon- 
dence match and by effecting optimisation using the EM algorithm. 
According to our EM framework the probabilities of structural cor- 
respondence gate contributions to the expected likelihood function 
used to estimate maximum likelihood perspective pose parameters. 
This provides a means of rejecting structural outliers. 
I Introduction 
The estimation of transformational geometry is key to many problems of computer 
vision and robotics [10]. Broadly speaking the aim is to recover a matrix representa- 
tion of the transformation between image and world co-ordinate systems. In order 
to estimate the matrix requires a set of correspondence matches between features 
in the two co-ordinate systems [11]. Posed in this way there is a basic chicken- 
and-egg problem. Before good correspondences can be estimated, there need to be 
reasonable bounds on the transformational geometry. Yet this geometry is, after 
all, the ultimate goal of computation. This problem is usually overcome by invoking 
constraints to bootstrap the estimation of feasible correspondence matches [5, 8]. 
One of the most popular ideas is to use the epipolar constraint to prune the space of 
potential correspondences [5]. One of the drawbacks of this pruning strategy is that 
residual outliers may lead to ill-conditioned or singular parameter matrices [11]. 
Recovering Perspective Pose with a Dual Step EM Algorithm 781 
The aim in this paper is to pose the two problems of estimating transformation 
geometry and locating correspondence matches using an architecture that is rem- 
iniscent of the hierarchical mixture of experts algorithm [6]. Specifically, we use 
a bi-partite graph to represent the current configuration of correspondence match. 
This graphical structure provides an architecture that can be used to gate con- 
tributions to the likelihood function for the geometric parameters using structural 
constraints. Correspondence matches and transformation parameters are estimated 
by applying the EM algorithm to the gated likelihood function. In this way we 
arrive at dual maximisation steps. Maximum likelihood parameters are found by 
miniraising the structurally gated squared residuals between features in the two 
images being matched. Correspondence matches are updated so as to maximise the 
a posterJori probability of the observed structural configuration on the bi-partite 
association graph. 
We provide a practical illustration in the domain of computer vision which is aimed 
at matching images of floppy discs under severe perspective foreshortening. How- 
ever, it is important to stress that the idea of using a graphical model to provide 
structural constraints on parameter estimation is a task of generic importance. Al- 
though the EM algorithm has been used to extract arlne and Euclidean parameters 
from point-sets [4] or line-sets [9], there has been no attempt to impose structural 
constraints of the correspondence matches. Viewed from the perspective of graph- 
ical template matching [1, 7] our EM algorithm allows an explicit deformational 
model to be imposed on a set of feature points. Since the method delivers statisti- 
cal estimates for both the transformation parameters and their associated covariance 
matrix it offers significant advantages in terms of its adaptive capabilities. 
2 Perspective Geometry 
Our basic aim is to recover the perspective transformation parameters which bring 
a set of model or fiducial points into correspondence with their counterparts in a 
set of image data. Each point in the image data is represented by an augmented 
vector of co-ordinates _w i = (xi, yi, 1) T where i is the point index. The available set 
of image points is denoted by w = (_wi,Vi C Z)) where Z) is the point index-set. 
The fiducial points constituting the model are similarly represented by the set of 
augmented co-ordinate vectors z = (z_j,Vj  .A/t). Here ,k4 is the index-set for the 
model feature-points and the z_j represent the corresponding image co-ordinates. 
Perspective geometry is distinguished from the simpler Euclidean (translation, ro- 
tation and scaling) and arline (the addition of shear) cases by the presence of signifi- 
cant foreshortening. We represent the perspective transformation by the parameter 
matrix 
1,1 1,2 1,3 
(,) (,) (,) (,) 
 W2,1 2,2 2,3 
() (,) (,) 
3,1 3,2 3,3 
Using homogeneous co-ordinates, the transformation between model and data is 
z() (n) (n) 
J - (z..-'r)-(n)zJ ' where (n) 1) T is a column-vector formed 
' 3,1 , '3,2, 
--$ 
from the elements in bottom row of the transformation matrix. 
782 A. D. J. Cross and E. R. Hancock 
3 Relational Constraints 
One of our goals in this paper is to exploit structural constraints to improve the 
recovery of perspective parameters from sets of feature points. We abstract the 
process as bi-partite graph matching. Because of its well documented robustness to 
noise and change of viewpoint, we adopt the Delaunay triangulation as our basic 
representation of image structure [3]. We establish Delaunay triangulations on the 
data and the model, by seeding Voronoi tessellations from the feature-points. 
Tlie process of Delaunay triangulation generates relational graphs from the two 
sets of point-features. More formally, the point-sets are the nodes of a data graph 
GD -- {Z),ED} and a model graph GM -- {.A//,EM}. Here ED C-- Z) x Z) and 
EM C- M x  are the edge-sets of the data and model graphs. Key to our matching 
process is the idea of using the edge-structure of Delaunay graphs to constrain the 
correspondence matches between the two point-sets. This correspondence matching 
is denoted by the function f  AA -4 D from the nodes of the data-graph to those 
of the model graph. According to this notation the statement f(")(i) = j indicates 
that there is a match between the node i C D of the model-graph to the node j 
of the model graph at iteration n of the algorithm. We use the binary indicator 
s(n) { 1 if f(n)(i) = j (2) 
i,j = 0 otherwise 
to represent the configuration of correspondence matches. 
We exploit the structure of the Delaunay graphs to compute the consistency of 
match using the Bayesian framework for relational graph-matching recently reported 
by Wilson and Hancock [12]. Suffice to say that consistency of a configuration of 
matches residing on the neighbourhood Ri = itA {k; (i, k)  ED} of the node 
i in the data-graph and its counterpart Sj = j U {1; (j, l)  Era} for the node 
j in the model-graph is gauged by Hamming distance. The Hamming distance 
H(i, j) counts the number of matches on the data-graph neighbourhood Ri that 
are inconsistently matched onto the model-graph neighbourhood S. According to 
Wilson and Hancock [12] the structural probability for the correspondence match 
f(i) = j at iteration n of the algorithm is given by 
exp -/3H(i,j)] 
f,?) = (3) 
] 
s exp -/H(i,j) 
In the above expression, the Hamming distance is given by H(i,j) = 
Y](k,t)an,.s (1 (") where the symbol  denotes the composition of the data-graph 
-- k.l } 
relation Ri and the model-graph relation S. The exponential constant/ = In 
P 
is related to the uniform probabihty of structural matching errors P. This proba- 
bility is set to reflect the overlap of the two point-sets. In the work reported here 
we set P =   - vl 
4 The EM Algorithm 
Our aim is to extract perspective pose parameters and correspondences matches 
from the two point-sets using the EM algorithm. According to the original work 
Recovering Perspective Pose with a Dual Step EM Algorithrn 783 
of Dempster, Laird and Rubin [2] the expected likelihood function is computed 
by weighting the current log-probability density by the a posterJori measurement 
probabilities computed from the preceding maximum likelihood parameters. Jordan 
and Jacobs [6] augment the process with a graphical model which effectively gates 
contributions to the expected log-likelihood function. Here we provide a variant of 
this idea in which the bi-partite graph representing the correspondences matches 
gate the log-likelihood function for the perspective pose parameters. 
4.1 Mixture Model 
Our basic aim is to jointly maximize the data-likelihood p(wlz , f, ) over the space 
of correspondence matches f and the matrix of perspective parameters . To 
commence our development, we assume observational independence and factorise 
the conditional measurement density over the set of data-items 
P(WlZ' f' ) - H P(W-ilz' f' ) (4) 
In order to apply the apparatus of the EM algorithm to maximising p(w]z, f, 
with respect to f and , we must establish a mixture model over the space of 
correspondence matches. Accordingly, we apply Bayes theorem to expand over the 
space of match indicator variables. In other words, 
P(W-i[z'f')- E P(W-i'si,j Iz'f') (5) 
s,f 
In order to develop a tractable likelihood function, we apply the chain rule of condi- 
tional probability. In addition, we use the indicator variables to control the switch- 
ing of the conditional measurement densities via exponentiation. In other words we 
assume p(w_ ilsi,j, zj, 
With this simplification,the mixture model for the correspondence matching process 
leads to the following expression for the expected likelihood function 
Q(f(n+l),(n+)lf(n ) (n))_ E E P(si'jlw'z'f(n) (n))s(n)lnp(w-ilzJ (n+)) 
' ' i,j ' 
(6) 
To further simplify matters we make a mean-field approximation and replace 
E (') 5! .) In this way 
by its average value, i.e. we make use of the fact that (oi,J J = -,,3  
the structural matching probabilities gate contributions to the expected likelihood 
function. This mean-field approximation alleviates problems associated with local 
optima which are likely to occur if the likelihood function is discretised by gating 
with 8i, j . 
4.2 Expectation 
Using the Bayes rule, we can re-write the a posterJori measurement probabilities in 
terms of the components of the conditional measurement densities appearing in the 
mixture model in equation (5) 
P(si,j]w, z, f('), ((nq-1)) _. 
,)p(_wil_zj, (I)(n)) 
(,) 
-'j' e i,j' p(w- il-zJ ' , (n) ) 
(7) 
784 A. D. J. Cross and E. R. Hancock 
In order to proceed with the development of a point registration process we require 
a model for the conditional measurement densities, i.e. p(w_ilz_j, (n)). Here we 
assume that the required model can be specified in terms of a multivariate Gaussian 
distribution. The random variables appearing in these distributions are the error 
residuals for the position predictions of the jth model line delivered by the current 
estimated transformation parameters. Accordingly we write 
1 [ 1 _(n)xT,_ 1 n))] 
p(_wilz_j, (n)) _ (270 2 a  exp -(_w i - z ] z (_w i - _z. 
(s) 
In the above expression E is the variance-covariance matrix for the vector of error- 
_(n) between the components of the predicted mea- 
residuals ei,j( ) = _wi - j 
t and their counterparts in the data, i.e. _w i Formally, the 
surement vectors _zj . 
matrix is related to the expectation of the outer-product of the error-residuals i.e. 
4.3 Maximisation 
The maximisation step of our matching algorithm is based on two coupled update 
processes. The first of these aims to locate maximum a posteriori probability corre- 
spondence matches. The second class of update operation is concerned with locating 
maximum likelihood transformation parameters. We effect the coupling by allowing 
information flow between the two processes. Correspondences located by maximum 
a posteriori graph-matching are used to constrain the recovery of maximum likeli- 
hood transformation parameters. A posterJori measurement probabilities computed 
from the updated transformation parameters are used to refine the correspondence 
matches. 
In terms of the indicator variables matches the configuration of maximum a poste- 
riori probability correspondence matches is updated as follows 
exp [-fi Yk,t)en, oSj (1 - (n)x] 
f(n+l)(i) __ argmaxP(zjlw_i, (n)) [ (n)] (9) 
The mimum likelihood transformation parameters satisfy the condition 
_ _(n))rE-l(w i 
(+1)-argn  P(j]i,())(,)(i-j _ -j ) (10) 
In the case of perspective geometry where we have used homogeneous co-ordinates 
the saddle-point equations are not readily amenable in a closed-form linear fash- 
ion. Instead, we solve the non-linear maximisation problem using the Levenberg- 
Marquardt technique. This non-linear optimisation technique offers a compromise 
between the steepest gradient and inverse Hessian methods. The former is used 
when close to the optimum while the latter is used far from it. 
5 Experiments 
The real-world evaluation of our matching method is concerned with recognising 
planer objects in different 3D poses. The object used in this study is a 3.5 inch 
Recovering Perspective Pose with a Dual Step EM Algorithrn 785 
floppy disk which is placed on a desktop. The scene is viewed with a low-quality SGI 
IndyCam. The feature points used to triangulate the object are corners. Since the 
imaging process is not accurately modelled by a perspective transformation under 
pin-hole optics, the example provides a challenging test of our matching process. 
Our experiments are illustrated in Figure 1. The first two columns show the views 
under match. In the first example (the upper row of Figure 1) we are concerned 
with matching when there is a significant difference in perspective forshortening. In 
the example shown in the lower row of Figure 1, there is a rotation of the object 
in addition to the foreshortening. The images in the third column are the initial 
matching configurations. Here the perspective parameter matrix has been selected 
at random. The fourth column in Figure I shows the final matching configuration 
after the EM algorithm has converged. In both cases the final registration is accu- 
rate. The algorithm appears to be capable of recovering good matches even when 
the initial pose estimate is poor. 
Figure 1: Images Under Match, Initial and Final Configurations. 
We now turn to measuring the sensitivity of our method. In order to illustrate 
the benefits offered by the structural gating process, we compare its performance 
with a conventional least-squares parameter estimation process. Figure 2 shows a 
comparison of the two algorithms for a problem involving a point-set of 20 nodes. 
Here we show the RMS error as a function of the number of points which have 
correct correspondence matches. The break-even point occurs when 8 nodes are 
initially matched correctly and there are 12 errors. Once the number of initially 
correct correspondences exceeds 8 then the EM method consistently outperforms 
the least-squares estimation. 
6 Conclusions 
Our main contributions in this paper are twofold. The theoretical contribution has 
been to develop a mixture model that allows a graphical structure to to constrain the 
estimation of maximum likelihood model parameters. The second contribution is a 
practical one, and involves the application of the mixture model to the estimation 
of perspective pose parameters. There are a number of ways in which the ideas 
developed in this paper can be extended. For instance, the framework is readily 
extensible to the recognition of more complex non-planar objects. 
786 A. D. J. Cross and E. R. Hancock 
1 
08 
06 
04 
o2 
o 
o 
Figure 2: Structural Sensitivity. 
References 
[1] Y. Amit and A. Kong, "Graphical Templates for Model Registration", IEEE PAMI, 
18, pp. 225-236, 1996. 
[2] A.P. Dempster, Laird N.M. and Rubin D.B., "Maximum-likelihood from incomplete 
data via the EM algorithm", J. Royal Statistical Soc. Ser. B (methodological),39, pp 
1-38, 1977. 
[3] O.D. Faugeras, E. Le Bras-Mehlman and J-D. Boissonnat, "Representing Stereo Data 
with the Delaunay Triangulation", Artificial Intelligence, 44, pp. 41-87, 1990. 
[4] S. Gold, Rangarajan A. and Mjolsness E., "Learning with pre-knowledge: Clustering 
with point and graph-matching distance measures", Neural Computation, 8, pp. 787- 
804, 1996. 
[5] R.I. Hartley, "Projective Reconstruction and Invariants from Multiple Images", IEEE 
PAMI, 16, pp. 1036--1041, 1994. 
[6] M.I. Jordan and R.A. Jacobs, "Hierarchical Mixtures of Experts and the EM Algo- 
rithm", Neural Computation, 6, pp. 181-214, 1994. 
[7] M. Lades, J.C. Vorbruggen, J. Buhmann, J. Lange, C. von der Maalsburg, R.P. Wurtz 
and W.Konen, "Distortion-invariant object-recognition in a dynamic link architec- 
ture", IEEE 7kansactions on Computers, 42, pp. 300-311, 1993 
[8] D.P. McReynolds and D.G. Lowe, "Rigidity Checking of 3D Point Correspondences 
under Perspective Projection", IEEE PAMI, 18 , pp. 1174-1185, 1996. 
[9] S. Moss and E.R. Hancock, "Registering Incomplete Radar Images with the EM Al- 
gorithm", Image and Vision Computing, 15, 637-648, 1997. 
[10] D. Oberkampf, D.F. DeMenthon and L.S. Davis, "Iterative Pose Estimation using 
Coplanar Feature Points", Computer Vision and Image Understanding, 63, pp. 495- 
511, 1996. 
[11] P. Torr, A. Zisserman and S.J. Maybank, "Robust Detection of Degenerate Configura- 
tions for the Fundamental Matrix", Proceedings of the Fifth International Conference 
on Computer Vision, pp. 1037-1042, 1995. 
[12] R.C. Wilson and E.R. Hancock, "Structural Matching by Discrete Relaxation", IEEE 
PAMI, 19, pp.634-648, 1997. 
