A Neural Network Approach for 
Three- Dimensional Object Recognition 
olker Tresp 
Siemens AG, Central Research and Development 
Otto-Hahn-Ring 6, D-S000 Miinchen 83 
Germany 
Abstr&ct 
The model-based neural vision system presented here determines the po- 
sition and identity of three-dimensional objects. Two stereo images of 
a scene are described in terms of shape primitives (line segments derived 
from edges in the scenes) and their relational structure. A recurrent neural 
matching network solves the correspondence problem by assigning corre- 
sponding line segments in right and left stereo images. A 3-D relational 
scene description is then generated and matched by a second neural net- 
work against models in a model base. The quality of the solutions and 
the convergence speed were both improved by using mean field approxi- 
mations. 
1 INTRODUCTION 
Many machine vision systems and, to a large extent, also the human visual sys- 
tem, are model based. The scenes are described in terms of shape primitives and 
their relational structure, and the vision system tries to find a match between the 
scene descriptions and 'familiar' objects in a model base. In many situations, such 
as robotics applications, the problem is intrinsically ILD. Different approaches are 
possible. Poggio and Edelman (1990) describe a neural network that treats the 
object recognition problem as a multivariate approximation problem. A certain 
number of 2-D views of the object are used to train a neural network to produce 
the standard view of that object. After training, new perspective views can be 
recognized. 
In the approach presented here, the vision system tries to capture the true 3-D 
structure of the scene. Two stereo views of a scene are used to generate a 3-D 
306 
A Neural Network Approach for Three-Dimensional Object Recognition 307 
model scene 
P 0 -'" 0 "-- OPj type: line segment 
m 
type lne segment a Ij  p(I)(length):3cm 
P(  ) (length) \angle. 
3cm on-g; degrees P-30 degrees 
  
type ine segment rn(, type: line segment 
p(1) (length) P (1) (length): 5cm 
5cm 
Figure 1: Match of primitive 
ri] log(ll/lj) 
q ij = ni/li lj 
n i j 
() 
Figure 2: Definitions of v, /, and 0 (left). The function #0 (right). 
description of the scene which is then matched against models in a model base. The 
stereo correspondence problem and the model matching problem are solved by two 
recurrent neural networks with very similar architectures. A neuron is assigned to 
every possible match between primitives in the left and right images or, respectively, 
the scene and the model base. The networks are designed to find the best matches 
by obeying certain uniqueness constraints. 
The networks are robust against the uncertainties in the descriptions of both the 
stereo images and the 3-D scene (shadow lines, missing lines). Since a partial match 
is sufficient for a successful model identification, opaque and partially occluded 
objects can be recognized. 
2 THE NETWORK ARCHITECTURE 
Here, a general model matching task is considered. The activity of a match neuron 
mai (Figure 1) represents the certainty of a match between a primitive pa in the 
model base and pi in the scene description. The interactions between neurons can be 
derived from the network's energy function where the fixed points of the network 
correspond to the minima of the energy function. The first term in the energy 
308 Tresp 
function evaluates the match between the primitives 
= 
The function gat is zero if the type of primitive pa is not equal to the type of 
primitive pt. If both types are identical, gat evaluates the agreement between 
parameters p(k) and (k) which describe properties of the primitives. Here, 
g,,t = P(U [(k)- (k)l/)  mmum if the peters of pa d p mat& 
(Figur 1 d 2). 
The evaluation of the match betwn the relations of primitiv in the scene d 
data b is performed by the ener term (MjoMn, Gindi d Anndan, 1989) 
Es =-1/2  ,,,, m,,m. (2) 
The function Xoi = p( [po,(k)-pij(k)l/g)  mimumif the relation betwn 
p, d p match the relation betwn pi d pj. 
The constrent that  primitive in the ene should only mt to one or no primitive 
in the model b (column co,trent)  implemented by the dition (penUry-) 
ener term (Uts et ., 1989, p d Gindi, 1990) 
Ec is equ to zero only if in 1 column, the sum over the tiwtio of 1 neuro 
 equ to one or zero d pitive otherw. 
2.1 DYNAMIC EQUATIONS AND MEAN FIELD THEORY 
2.1.1 MFAI 
The neural network should make binary decisions, match or no match, but bi- 
nary recurrent networks get easily stuck in local miniran. Bd local miniran can 
be avoided by using an annealing strategy but annealing is time-consuming when 
simulated on a digital computer. Using a mean field approximation, one can ob- 
tain deterministic equations by retaining some of the advantages of the annealing 
process (Peterson and S'6derberg, 1989). The network is interpreted as a system of 
interacting units in thermal contact with a heat reservoir of temperature T. Such 
a system minimizes the free energy F = E - T where  is the entropy of the 
system. At T = 0 the energy E is minimized. The mean value vat =< mac > of 
a neuron becomes va = 1/(1 + e -u'/T) with uai = -OE/Ov,. These equations 
can be updated synchronously, asynchrononsl or solved iteratively by moving only 
a small distance from the old value of ua in the direction of the new mean field. 
At high temperatures T, the system is in the trivial solution va = 1/2 , i and 
the activations of all neurons are in the linear region of the sigmoid function. The 
system can be described by linearized equations. The magnitudes of all eigenvalues 
of the corresponding transfer matrix are less than 1. At a critical temperature T,, 
the magnitude of at least one of the eigenvalues becomes greater than one and the 
trivial solution becomes unstable. T, and favorable weights for the different terms 
in the energy function can be found by an eigenvalue analysis of the linearized 
equations (Peterson and S6derberg, 1989). 
A Neural Network Approach for Three-Dimensional Object Recognition 309 
2.1.2 MFA 
The column constraint is satisfied by states with exactly one neuron or no neuron 
'on' in every column. If only these states are considered in the derivation of the 
mean field equations, one can obtain another set of mean field equations, vi = 
1 x e"-dT/(1 +  e,,dT) with u = -OE/Ovo, i. 
The column constraint term (Equation 3) drops out of the energy function and 
the energy surface in simplified. The high temperature fixed point corresponds to 
voi= 1/(N + 1) Va, i where N is the number of rows. 
3 THE CORRESPONDENCE PROBLEM 
To olve the correspondence problem, corresponding lines in left and right images 
have to be identified. A good aumption is that the appearance of an object in 
the left image is a distortion and shifted version of the appearance of the object in 
the right image with approximately the same scale and orientation. The machinery 
just developed can be applied if the left image is interpreted az the scene and the 
right image az the model. 
Figure 3 shows two stereo images of a simple cene and the segmentation of left and 
right images into line segment which are the only primitives in this application. 
Lines correspond to the edges, structure and contours of the objects and shadow 
lines. The length of a line segment (1) = Ii is the descriptive parameter attached 
to each line segment pi. Relations between line segments are only considered if they 
are in a local neighborhood: Xo,,Id is equal to zero if not both a) 10 is attached to 
line segment p and b) line segment Pt is attached to line segment pj. Otherwise, 
= - + - + - where = is 
the angle between line segments, pj (2) = rij the logarithm of the ratio of their 
lengths and pj(3) = gig the attachment point (Shumaker etal., 1989) (Figure 2). 
Bets, we have two uniquene constraints: only at mot one neuron should be 
active in each column or each row. The row constraint is enforced by an energy 
term equivalent to E: Ea 
4 DESCRIPTION OF THE 3-D OBJECT STRUCTURE 
From the last section, we know which endpoints in the left image correspond to 
endpoints in the right image. If D is the separation of both (in parallel mounted) 
camerin, f the focal lengths ofthe cameras, zl, yl, zr, yr the coordinates of a particu- 
lar point in left and right images, the 3-D position of the point in camera coordinates 
z,y,z becomes z = Df/(z,. - z), y = zy,./f, z = zz,./f + D/2. This information 
is used to generate the 3-D description of the visible portion of the objects in the 
cene. 
Knowing the true -D position of the endpoints of the line segments, the system 
concludes that the chair and the wardrobe are two distinct and spatgaily separated 
objects and that line segments 12 and 13 in the right image and 12 in the left image 
are not connected to either the chair or the wardrobe. On the other hnd, it is not 
310 Tresp 
Figure 3: Stereo images of a scene and segmented images. 
:-:  -'..:.:.. :.. . 
.:::: .:.. .'  : .'". 
'.::' . - :' ]:.  ..:. 
.  '.::. ............ . .... 
. . . :::.:-:.' 
... :' ::.: .....'.'.'.".'.'.'.'-'..::.:.:. ?.,. ' 
...". ::: : ...: :S .': a. :"  
. .:.:..: .:....:...... .. ..:.. 
... . ..-.......:., .:........-.'- .. 
.... . g. .'.'...............: . ......>.:.:.:: -, 
::.- ..< :: .... g,. '::>.: 
........ ..... y -.. . .:..- .:. :g. ,:: .. :.:.. 
........ , ....::...-..- 
.. ..... . .... ..,................:.:...:.:.,:.: 
The stereo matching 
network matched all line segments that are present in both images correctly. 
obvious that the shadow lines under the wardrobe are not part of the wardrobe. 
5 MATCHING OBJECTS AND MODELS 
The scene description now must be matched with stored models describing the 
complete 3-D structures of the models in the data base. The model description 
might be constructed by either explicitly measuring the dimensions of the models 
or by incrementally assembling the 3-D structure from several stereo views of the 
models. Descriptive parameters are the (true 3-D) length of line segments I, the 
(true 3-D) angles b between line segments and the (true 3-D) attachment points 
q. The knowledge about the 3-D structure allows a segmentation of the scene into 
different objects and the row constraint is only applied to neurons relating to the 
ame object O in the scene ER, = o o[((Eo ma,)- 1)2 Eo v]. 
Figure 4 shows the network after convergence. Except for the occluded leg, all line 
segments belonging to the chair could be matched correctly. All not occluded line 
segments of the wardrobe could be matched correctly except for its left front leg. 
The shadow lines in the image did not find a match. 
6 3-D POSITION 
In many applications, one is also interested in determining the positions of the 
recognized objects in camera coordinates. In general, the transformation between 
A Neural Network Approach for Three-Dimensional Object Recognition 311 
4 
! 
node1 
node1 
I 
! 
4 
! 
iI 
14 
I! 
II 
Ii 
t I I I  I t T I ! 
I I I I  I II ' t I 
Figure 4: D matching network. 
312 Tresp 
an object in a standard frame of reference X0 = (z0, p0, z0) and the transformed 
frame of reference Xs = (z,,p,,z,) can be described by Xs = RXo, where R is a 
4 x 4 matrix describing a rotation followed by a translation. R can be calculated if 
X0 and Xs are known for at leazt 4 points using, for exmnple, the peudo inverse or 
an ADALINE. Knowing the coefficients of R, the object pozition can be calculated. 
If an ADALINE is ued, the error after convergence is a roeazure of the consistency 
of the transformation. A large error can be ued az an indication that either a 
wrong model waz matched, or certain primitives were misclaified. 
7 DISCUSSION 
Both MFAx and MFA2 were ued in the experiments. The same olutions were 
found in general, but due to the simpler energy surface, MFA2 allowed greater time 
steps and therefore converged 5 to 10 timez fmter. 
For more complex scenes, a hierarchical system could be considered. In the first 
step, simple objects such a squarez, rectangles, and circle would be identified. 
These would then form the primitives in a econd stage which would then recognize 
complete objects. It might  be perihie to combine these two matching net 
into one hierarchical net similar to the networi described by Mjolne, Gindi and 
Anndan (1989). 
Acknowledgement s 
I would like to acknowledge the contributions of Gene Gindi, Eric Mjolsne and 
Joachim Utans of Yale University to the design of the matching network. I thank 
Christian Evers for helping me to acquire the images. 
leferences 
Eric Mjolsne, Gene Gindi, P. Anndan. Neural Optimization in Model Matching 
and Perceptual Organization. Neural Computation 1, pp. 218-209, 1989. 
Carsten Peteron, Be SSderberg. A new method for mapping optimization problenm 
onto neural networks. International Journal of Neural Sitstems, Vol. 1, No. 1, pp. 
3-22, 1989. 
T. Pegtie, S. Edelman. A Network That Learns to Recognize Three-Dimensional 
Objects. Nature, No. 255, pp. 2(}3-266, iIanuary 1990. 
Grant Shumaker, Gene Gindi, Eric Mjolsnesa, P. Anndan. Stickville: 
Net for Object Recognition via Graph Matching. Tech. Report No. 
University, 1989. 
A Neural 
8908, Yale 
Volker Tresp, Gene Gindi. Invariant Object Recognition by Inexact Subgraph 
Matching with Applications in Industrial Part Recognition. International Neural 
Network Conference, Paris, pp. 95-98, 1990. 
Joachim Utans, Gene Gindi, Eric Mjolsne, P. Anndan. Neural Networks for Object 
Recognition within Compositional Hierarchies, Initial Experiments. Tech. Report 
No. 8903, Yale University, 1989. 
