Improving Convergence in Hierarchical 
Matching Networks for Object 
Recognition 
Joachim Utans* Gene Gindi t 
Department of Electrical Engineering 
Yale University 
P.O. Box 2157 Yale Station 
New Haven, CT 06520 
Abstract 
We are interested in the use of analog neural networks for recog- 
nizing visual objects. Objects are described by the set of parts 
they are composed of and their structural relationship. Struc- 
tural models are stored in a database and the recognition prob- 
lem reduces to matching data to models in a structurally consis- 
tent way. The object recognition problem is in general very diffi- 
cult in that it involves coupled problems of grouping, segmentation 
and matching. We limit the problem here to the simultaneous la- 
belling of the parts of a single object and the determination of 
analog parameters. This coupled problem reduces to a weighted 
match problem in which an optimizing neural network must min- 
imize E(M,p) = Y'ai M, iW, i(p), where the {M,i} are binary 
match variables for data parts i to model parts c and {W,i(p)} 
are weights dependent on parameters p. In this work we show that 
by first solving for estimates  without solving for Mi, we may 
obtain good initial parameter estimates that yield better solutions 
for M and p. 
*Current address: International Computer Science Institute, 1947 Center Street, 
Suite 600, Berkeley, CA 94704, utans@icsi.berkeley.edu 
t Current address: SUNY Stony Brook, Department of Electrical Engineering, Stony 
Brook, NY 11784 
401 
402 Utans and Gindi 
Figure 1: Stored Model for a 3-Level Compositional Hierarchy (compare Figure 3). 
I Recognition via Stochastic Forward Models 
The Frameville object recognition system introduced by Mjolsness et al [5, 6, 1] 
makes use of a compositional hierarchy to represent stored models. The recognition 
problem is formulated as the minimization of an objective function. Mjolsness [3, 4] 
has proposed to derive the objective function describing the recognition problem 
in a principled way from a stochastic model that describes the objects the system 
is designed to recognize (stochastic visual grammar). The description mirrors the 
data representation as a compositional hierarchy, at each stage the description of 
the object becomes more detailed as parts are added. 
The stochastic model assigns a probability distribution at each stage of that process. 
Thus at each level of the hierarchy a more detailed description of parts in terms of 
their subparts is given by specifying a probability distribution for the coordinates of 
the subparts. Explicitly specifying these distributions allows for finer control over 
individual part descriptions than the rather general parameter error terms used 
before [1, 8]. The goal is to derive a joint probability distribution for an instance 
of an object and its parts as it appears in the scene. This gives the probability of 
observing such an object prior to the arrival of the data. Given an observed image, 
the recognition problem can be stated as a Bayesian inference problem that the 
neural network solves. 
1.! 3-Level Stochastic Model 
For example, consider the model shown in Figure 1 and 3. The object and its parts 
are represented as line segments (sticks), the parameters were p = (x, y, 1, O) T with 
x, y denoting position, l the length of a stick and 0 its orientation. The model 
considers only a rigid translation of an object in the image. 
Only one model is stored. From a central position p = (x, y, 1, 0), itself chosen 
from a uniform density, the Nfi parts at the first level are placed. Their structural 
relationships is stored as coordinates ufi in an object-centered coordinate frame, 
i.e. relative to p. While placing the parts, Gaussian distributed noise with mean 0 
and is added to the position coordinates to capture the notion of natural variation 
of the object's shape. The variance is coordinate specific, but we assume the same 
distribution for the x and y coordinates, (r; (r is the variance for the length 
Improving Convergence in Hierarchical Matching Networks for Object Recognition 403 
component and o'o for the relative angle. In addition, here we assume for simplicity 
that all parts are independently distributed. Each of the parts/3 is composed of sub- 
parts. For simplicity of notation, we assume that each part/3 is composed from the 
same number of subparts N, (note that the index 7 in Figure 2 here corresponds 
to the double index/3rn to keep track of which part/3 subpart/3m belongs to on the 
model side, i.e. the index/3m denotes the mth sub-part of part/3). The next step 
models the unordering of parts in the image via a permutation matrix M, chosen 
with probability P(M), by which their identity is lost. If this step were omitted, 
the recognition problem would reduce to the problem of estimating part parameters 
because the parts would already be labeled. 
From the grammar we compute the final joint probability distribution (all constant 
terms are collected in a constant C): 
P(M,{pam},{p},p)= 
1 
Cexp ( ( 2er(x-(x+u))2 
1 
1 (l B _ (l + ui)) 2 2er  
2i 
exp 
1 
--(Y-(Y+Uy)) 2 
2o' 
--(Yc - (Y + Umy))  
(1) 
1.2 Frameville Architecture for Part Labelling within a single Object 
The stochastic forward model for the part labelling problem with only a single object 
present in the scene translates into a reduced Frameville architecture as depicted in 
Figure 2. The compositional hierarchy parallels the steps in the stochastic model 
as parts are added at each level. Match variables appear only at the lowest level, 
corresponding to the permutation step of the grammar. Parts in the image must 
be matched to model parts and parts found to belong to the stored object must be 
grouped together. 
The single match neuron Moji at the highest level can be set to unity since we assume 
we know the object's identity and only a single object is present. Similarly, all terms 
inaij from the first to the second level can be set to unity for the correct grouping 
since the grouping is known at this point from the forward model description. In 
addition, at the intermediate (second) level, we may set all Maj = 1 for /3 = j 
and Mj = 0 otherwise with no loss of generality. These mid-level frames may 
be matched ahead of time, but their parameters must be computed from data. 
Introducing a part permutation at the intermediate levels thus is redundant. Given 
this, an additional simplification ina grouping variables at the lowest (third) level 
is possible. Since parts are pre-matched at all but the lowest level, inajk can be 
expressed in terms of the part match M,k as inaji = M. yilNA.yMj and explicitly 
representing inaji as variables is not necessary. 
The input to the system are the {pt}, recognition involves finding the parameters 
404 Utans and Gindi 
Myk 
Model Data 
Fram x 
I 
Figure 2: Frameville Architecture for the Stochastic Model. The 3-level grammar leads to a reduced 
"Frameville" style network architecture: a single model is stored on the model side and only one instance 
of the model is present in the input data. The ovals on the model side represent the object, its parts 
and subparts (compare Figure 1); the arcs INA represent their structural relationship. On the data side, 
the triangles represent parameter vectors (or frames) describing an instance of the object in the scene. 
At the lowest level the Pk represent the input data, parameters at higher levels in the hierarchy must be 
computed by the network (represented as bold triangles). ina represents the grouping of parts on the 
data side (see text). The horizontal lines represent assignments from frames on the data side to nodes 
on the model side. At the intermediate level, frames are prematched to the corresponding parts on the 
model side; match variables are necessary only at the lowest level (represented as bold lines with circles). 
p and {pj) as well as the labelling of parts M. Thus, from Bayes Theorem 
P({pk)lM, p, {pj))P(M, p, {pj)) 
P(M,p, {pj)l{p/)) = 
P((Pk)) 
c P(M,p, {pj), {Pk)) (2) 
and recognition reduces to finding the most probable values for p, {pj) and M 
given the data: 
arg max P(M,p, {pj), {p)) (3) 
M,p,(p) 
Solving the inference problem involves finding the MAP estimate and is is equivalent 
to minimizing the exponent in equation (1) with respect to M, p and {pj). 
2 Bootstrap: Coarse Scale Hints to Initialize the Network 
2.1 Compositional Hierarchy and Scale Space 
In some labelling approaches found in the vision literature, an object is first labelled 
at the coarse, low resolution, level and approximate parameters are found. In this 
top-down approach the information at the higher, more abstract, levels is used 
Improving Convergence in Hierarchical Matching Networks for Object Recognition 405 
Arm 
spatial 
scale 
III 
Lower Arm 
abstraction 
Figure 3: Compositional Hierarchy vs. Scale Space Hierarchy. A compositional hierarchy can represent a 
scale space hierarchy. At successive levels in the hierarchy, more and more detail is added to the object. 
to select initial values for the parts at the next lower level of abstraction. The 
segmentation and labelling at this next lowest level is thus not done blindly; rather 
it is strongly influenced contextually by the results at the level above. 
In fact, in very general terms such a scheme was described by Marr and Nishihara [2]. 
They advocate in essence a hierarchical model base in which a shape is first matched 
to the highest levels, and defaults in terms of relative object-based parameters of 
parts at the next level are recalled from memory. These defaults then serve as initial 
values in an unspecified segmentation algorithm that derives part parameters; this 
step is repeated recursively until the lowest level is reached. 
Note that the highest level of abstractions correspond to the coarsest levels of spatial 
scale. There is nothing in the design of the model base that demands this, but invari- 
ably, elements at the top of a compositional hierarchy are of coarser scale since they 
must both include the many subparts below, and summarize this inclusion with 
relatively few parameters. Figure 3 illustrates the correspondence between these 
representations. In this sense, the compositional hierarchy as applied to shapes 
includes a notion of scale, but there is no "scale-space" operation of intentionally 
blurring data. The notion of Scale Space as utilized here thus differs from the 
application of the method to low-level computations in the visual domain where 
auxiliary coarse scale representations are computed explicitly. The object represen- 
tations in the Frameville system as described earlier combines both, bottom-up and 
top-down elements. If the top-down aspects of the scheme described by Marr and 
Nishihara [2] could be incorporated into the Frameville architecture, then our pre- 
vious simulation results [8] suggest that much better performance can be expected 
from the neural network. Two problems must be addressed: (1) How do we obtain, 
from the observed raw data alone, a coarse estimate of the slot parameters at the 
highest level and (2) given these crude estimates how do we utilize them to recall 
default settings for the segmentation one level below? 
406 Utans and Gindi 
C 
Myk 
Rk 
Model Dala 
I 
Bootstrap 
Figure 4: Bootstrap computation for a network from a 3-level grammar. Analog frame variables at the 
top and intermediate level are initialized from data by a bootstrap computation (bold lines indicate the 
flow of information) 
2.2 Initialization of Coarse Scale Parameters 
We propose to aid convergence by supplying initial values for the analog variables p 
and {pj }; these must be computed from data without making use of the labelling. 
In general, it is not possible to solve for the analog parameters without knowledge 
of the correct permutation matrix M. However, for the purpose of obtaining an 
approximation b one can derive a new objective function that does not depend on 
M and the parameters {pj} by integrating over the {pj} and summing over all 
possible permutation matrices M: 
P(P, {Pk}) =  / d{pj}P(p, {pj}, {pk}, M) (4) 
{M}IM is a 
permutation 
This formulation leads to an Elastic Net type network [9, 7]. However, this imple- 
mentation of a separate network for the bootstrap computations is expensive. 
Here we use simpler computation where the coarse scale parameters are estimated 
by computing sample averages, corresponding to finding the solution for the Elastic 
Net in the high temperature limit [7]. For the position x we find, after integrating 
over the {xj }, 
and similarly for y. 
Since the assignment 
1 m 
of subparts k on the data side 
to subparts tim on the model side is not known at this point, the first term in 
equations (5) cannot be evaluated. After approximating the actual variance with 
Improving Convergence in Hierarchical Matching Networks for Object Recognition 407 
an average variance, these equations reduce to 
i i 1 
In terms of the objective function this translates into assuming that here the error 
terms for all parts are weighted equally. Since these weights would depend on the 
actual part match, this just corresponds to our ignorance regarding identity of the 
parts. This approximation assumes that the variances do not differ by a large 
amount, otherwise the approximation f will not be close to the true values. Since 
the model can be designed such that the part primitives used at the lowest level 
of the grammar are not highly specialized as would be the case for abstractions 
at higher levels of the model, the approximation proved sufficient for the problems 
studied here. 
The neural network can be used to perform the calculation. The Elastic Net for- 
mulation assigns approximately equal weights to all possible assignments at high 
temperatures. Thus, this behavior can be expressed in the original network with 
match variables by choosing M3mk = 1/(NSNm) V i,j. This leads to the following 
two-pass bootstrap computation. Using this specific choice for M only the analog 
variables need to be updated to compute the coarse scale estimates. The network 
with constant M is just the neural network implementation for computing  from 
equation (6). After these have converged,  can be used to compute j =  + u. 
Thus, the parameters for intermediate levels can by hypothesized from the coarse 
scale estimate  by adding the known transformation (recall that for intermediate 
levels, the part identity is preserved and no permutation steps takes place (see Fig- 
ure 2)). Then the network is restarred with random values for the match variables 
to compute the correct labelling and the correct parameters. 
2.3 Simulation Results 
The bootstrap procedure has been implemented for a 3-level hierarchical model. The 
model describes a "gingerbread man" as shown in Figure 3. The incorrect solutions 
observed did not, in the vast majority of cases, violate the permutation matrix 
constraint, i.e. the assignment was unique. However, even though the assignment 
is unique, parts where not always assigned correctly. Most commonly, the identity 
of neighboring parts was interchanged, in particular for cases with large variance. 
The advantage of using the bootstrap initialization is clear from Figure 5. For 
the simulation, cr = 2cr; the noise variance was identical for all parts. The net- 
work computed the solution reliably for large noise variances. In such cases the 
performance of the network without initialization deteriorates rapidly. Only one 
set of 10 experiments was used for the graph but in all simulations performed, 
the network with initialization consistently outperformed the network without ini- 
tialization. Figure 5(right) shows the time measured in the number of iterations 
necessary for the network to converge; it is almost unaffected by the increase in the 
noise variance. This is because the initial values derived from data are still close 
to the final solution. While in some cases, the random starting point happens to 
be close to the correct solution and the network without initialization converges 
rapidly, Figure 5 reflect the typical behavior and demonstrate the advantage of 
computing approximate initial values. 
408 Utans and Gindi 
lOO 
4o 
Success Rate 
300 
0 
1oo 
Convergence Speed 
: 
: 
;- ...... -4---, .... 
Figure 5: lesults Comparing the Network without and with Initialization (solid line). 
Left: The success rate indicates the rate at which the network converged to the correct solutions. a 
denotes the noise variance at the intermediate level of the model and a22 the noise variance at the lowest 
level. Only one set of 10 experiments was used for the graph but in all simulations performed, the 
network with initialization consistently outperformed the network without initialization. 
Right: The graph shows the average time it takes for the network to converge (as measured by the 
number of iterations) averaged over 10 experiments. Only simulations where the network converged to 
the correct solution are used to compute the average time for convergence. The stopping criterion used 
required all the match neurons to assume values M,j > 0.95 or M, 3 < 0.05. The error bars denote the 
standard deviation. 
Acknowledgements 
This work was supported in part by AFOSR grant AFOSR 90-0224. We thank 
E. Mjolsness and A. Rangarajan for many helpful discussions. 
References 
[1] G. Gindi, E. Mjolsness, and P. Anandan. Neural networks for model based recogni- 
tion. In Neural Networks: Concepts, Applications and Implementations, pages 144-173. 
Prentice-Hall, 1991. 
[2] David Marl Vision. W. H. Freeman and Co., New York, 1982. 
[3] E. Mjolsness. Bayesian inference on visual grammars by neural nets that optimize. 
Technical Report YALEU-DCS-TR-854, Yale University, Dept. of Computer Science, 
1991. 
[4] E. M)?lsness. Visual grammars and their neural nets. In R.P. Lippmann J.E. Moody, 
S.J. nanson, editor, Advances in Neural Information Processing Systems . Morgan 
Kaufmann Publishers, San Mateo, CA, 1992. 
[5] Eric Mjolsness, Gene Gindi, and P. Anandan. Optimization in model matching and 
perceptual organization: A first look. Research report yMeu/dcs/rr-634, Yale Univer- 
sity, Department of Computer Science, 1988. 
[6] Eric Mjolsness, Gene R. Gindi, and P. Anandan. Optimization in model matching and 
perceptuaJ organization. Neural Computation, vol. 1, no. 2, 1989. 
[7] Joachim Utans. Neural Networks for Object Recognition within Compositional Hierar- 
chies. PhD thesis, Department of Electrical Engineering, YaJe University, New Haven, 
CT 06520, 1992. 
[8] Joachim Utans, Gene R. Gindi, Eric Mjolsness, and P. Anandan. Neural networks 
for object recognition within compositional hierarchies: Initial experiments. Techni- 
caJ report 8903, Yale University, Center for Systems Science, Department Electrical 
Engineering, 1989. 
[9] A. L. Yuille. Generalized deformable models, statistical physics, and matching prob- 
lems. Neural Computation, 2(2):1-24, 1990. 
