JPMAX: Learning to Recognize Moving 
Objects as a Model-fitting Problem 
Suzanna Becker 
Department of Psychology, McMaster University 
Hamilton, Ont. L8S 4K1 
Abstract 
Unsupervised learning procedures have been successful at low-level 
feature extraction and preprocessing of raw sensor data. So far, 
however, they have had limited success in learning higher-order 
representations, e.g., of objects in visual images. A promising ap- 
proach is to maximize some measure of agreement between the 
outputs of two groups of units which receive inputs physically sep- 
arated in space, time or modality, as in (Becker and Hinton, 1992; 
Becker, 1993; de Sa, 1993). Using the same approach, a much sim- 
pler learning procedure is proposed here which discovers features 
in a single-layer network consisting of several populations of units, 
and can be applied to multi-layer networks trained one layer at 
a time. When trained with this algorithm on image sequences of 
moving geometric objects a two-layer network can learn to perform 
accurate position-invariant object classification. 
1 LEARNING COHERENT CLASSIFICATIONS 
A powerful constraint in sensory data is coherence over time, in space, and across 
different sensory modalities. An unsupervised learning procedure which can capital- 
ize on these constraints may be able to explain much of perceptual self-organization 
in the mammalian brain. The problem is to derive an appropriate cost function for 
unsupervised learning which will capture coherence constraints in sensory signals; 
we would also like it to be applicable to multi-layer nets to train hidden as well 
as output layers. Our ultimate goal is for the network to discover natural object 
classes based on these coherence assumptions. 
934 $uzanna Becker 
1.1 PREVIOUS WORK 
Successive images in continuous visual input are usually views of the same object; 
thus, although the image pixels may change considerably from frame to frame, 
the image usually can be described by a small set of consistent object descriptors, 
or lower-level feature descriptors. We refer to this type of continuity as temporal 
coherence. This sort of structure is ubiquitous in sensory signals, from vision as 
well as other senses, and can be used by a neural network to derive temporally 
coherent classifications. This idea has been used, for example, in temporal versions 
of the Hebbian learning rule to associate items over time (Weinshall, Edelman and 
Bfilthoff, 1990; FSldigk, 1991). To capitalize on temporal coherence for higher-order 
feature extraction and classification, we need a more powerful learning principle. 
A promising approach is to maximize some measure of agreement between the out- 
puts of two groups of units which receive inputs physically separated in space, time 
or modality, as in (Becker and Hinton, 1992; Becker, 1993; de Sa, 1993). This forces 
the units to extract features which are coherent across the different input sources. 
Becker and Hintoh's (1992) Imax algorithm maximizes the mutual information be- 
tween the outputs of two modules, a and , connected to different parts of the 
input, a and b. Becker (1993) extended this idea to the problem of classifying tem- 
porally varying patterns by applying the discrete case of the mutual information 
cost function to the outputs of a single module at successive time steps, a(t) and 
y-(t + 1). However, the success of this method relied upon the back-propagation of 
derivatives to train the hidden layer and it was found to be extremely susceptible 
to local optima. de Sa's method (1993) is closely related, and minimizes the prob- 
ability of disagreement between output classifications, a(t) and y-(t), produced by 
two modules having different inputs, e.g., from different sensory modalities. The 
success of this method hinges upon bootstrapping the first layer by initializing the 
weights to randomly selected training patterns, so this method too is susceptible to 
the problem of local optima. If we had a more flexible cost function that could be 
applied to a multi-layer network, first to each hidden layer in turn, and finally to 
theoutput layer for classification, so that the two layers could discover genuinely 
different structure, we might be able to overcome the problem of getting trapped 
in local optima, yielding a more powerful and efficient learning procedure. 
We can analyze the optimal solutions for both de Sa's and Becker's cost functions 
(see Figure 1 a) and see that both cost functions are maximized by having perfect 
one-to-one agreement between the two groups of units over all cases, using a one-of-n 
encoding, i.e., having only a single output unit on for each case. A major limitation 
of these methods is that they strive for perfect classifications by the units. While 
this is desirable at the top layer of a network, it is an unsuitable goal for training 
intermediate layers to detect low-level features. For example, features like oriented 
edges would not be perfect predictors across spatially or temporally nearby image 
patches in images of translating and rotating objects. Instead, we might expect that 
an oriented edge at one location would predict a small range of similar orientations 
at nearby locations. So we would prefer a cost function whose optimal solution was 
more like those shown in Figure 1 b) or c). This would allow a feature i in group a 
to agree with any of several nearby features, e.g. i - 1, i, or i q- 1 in group b. 
JPMAX 935 
a) b) c) 
Figure 1: Three possible joint distributions for the probability that the ith and jth 
units in two sets of m classification units are both on. White is high density, black 
is low density. The optimal joint distribution for Becker's and de Sa's algorithms 
is a matrix with all its density either in the diagonal as in a), or any subset of the 
diagonal entries for de Sa's method, or a permutation of the diagonal matrix for 
Becker's algorithm. Alternative distributions are shown in b) and c). 
1.2 THE JPMAX ALGORITHM 
One way to achieve an arbitrary configuration of agreement over time between two 
groups of units (as in Figure I b) or c)) is to treat the desired configuration as a 
prior joint probability distribution over their outputs. We can obtain the actual 
distribution by observing the temporal correlations between pairs of units' outputs 
in the two groups over an ensemble of patterns. We can then optimize the actual 
distribution to fit the prior. We now derive two different cost functions which 
achieve this result. Interestingly, they result in very similar learning rules. 
Suppose we have two groups of m units as shown in Figure 2 a), receiving inputs, x- 
and , from the same or nearby parts of the input image. Let Ca(t) and Cb(t) be 
the classifications of the two input patches produced by the network at time step t; 
the outputs of the two groups of units, Ya( ) and yb(t), represent these classification 
eneti(t) 
probabilities: 
yai(t) - P(Ca(t) = i) - Ejenet(t) 
eneti (t) 
ybi(t) -- P(Cb(t) -- i) -- Ejenet(t ) (1) 
(the usual "softmax" output function) where netai(t) and netj(t) are the weighted 
net inputs to units. We could now observe the expected joint probability distribu- 
tion qij - E[yai(t)yj(t 4. 1)It- E[P(Ca(t)-i,Cb(t 4- 1)- J)]t by computing the 
temporal covariances between the classification probabilities, averaged over the en- 
semble of training patterns; this joint probability is an m2-valued random variable. 
Given the above statistics, one possible cost function we could minimize is the 
- log probability of the observed temporal covariance between the two sets of units' 
outputs under some prior distribution (e.g. Figure I b) or c)). If we knew the 
actual frequency counts for each (joint) classification k = kn,..., km, k2,..., kmm, 
936 $uzanna Becker 
a) a' b) 
III I III Illl Jill 
Illlllllllllllll 
111llll[11111lll 
IIII III IIIll IIII 
I III III I II 
I111111 
At 
( (t) C A (t) 
Figure 2: a) Two groups of 15 units receive inputs from a 2D retina. The groups 
are able to observe each other's outputs across lateral links with unit time delays. 
b) A second layer of two groups of 3 units is added to the architecture in a). 
rather than just the observed joint probabilities, qij = E [k_], then given our prior 
model, pxi,... ,Pmm, we could compute the probability of the observations under a 
multinomial distribution: 
P() -- l-[i,j kij! .. Pij (2) 
Using the de Moivre-Laplace approximation leads to the following: 
1 (kij - npij)2 ) 
- (3) 
exp   npij 
P()  V/(27rn) mJ-1 ni,jpij i,j 
Taking the derivative of the - log probability wrt kij leads to a very simple learning 
rule which depends only on the observed probabilities qij and priors Pij: 
0 - log P() npij - kij 
Ok o 
npij 
Pij -- qij 
Pij 
(4) 
G(p,q) = -  Z (pij logpij - pij logqij) 
i j 
(5) 
To obtain the final weight update rule, we just multiply this by o_. One problem 
n Owl ' 
with the above formulation is that the priors Pij must not be too close to zero for 
the de Moivre-Laplace approximation to hold. In practice, this cost function works 
well if we simply ignore the derivative terms where the priors are zero. 
An alternative cost function (as suggested by Peter Dayan) which works equally well 
is the Kullback-Liebler divergence or G-error between the desired joint probabilities 
Pij and the observed probabilities qij: 
JPMAX 93 7 
Figure 3:10 of the 1500 training patterns: geometric objects centered in 36 possible 
locations on a 1-by-1 pixel grid. Object location varied randomly between patterns. 
The derivative of G wrt qij is: 
OG _ pij (6) 
Oqij qij 
subject to -.ij qij -- I (enforced by the softmax output function). Note the simi- 
larity between the learning rules given by equations 4 and 6. 
2 EXPERIMENTS 
The network shown in Figure 2 a) was trained to minimize equation 5 on an en- 
semble of pattern trajectories of circles, squares and triangles (see Figure 3) for five 
runs starting from different random initial weights, using a gradient-based learning 
method. For ten successive frames, the same object would appear, but with the cen- 
tre varying randomly within the central six-by-six patch of pixels. In the last two 
frames, another randomly selected object would appear, so that object trajectories 
overlapped by two frames. These images are meant to be a crude approximation 
to what a moving observer would see while watching multiple moving objects in a 
scene; at any given time a single object would be approximately centered on the 
retina but its exact location would always be jittering from moment to moment. 
In these simulations, the prior distribution for the temporal covariances between 
the two groups of units' outputs was a block-diagonal configuration as in Figure 
I c), but with three five-by-five blocks along the diagonal. Our choice of a block- 
diagonal prior distribution with three blocks encodes the constraint that units in a 
given block in one group should try to agree with units in the same block in the 
other group; so each group should discover three classes of features. The number of 
units within a block was varied in preliminary experiments, and five units was found 
to be a reasonable number to capture all instances of each object class (although 
the performance of the algorithm seemed to be robust with respect to the number of 
units per block). The learning took about 3000 iterations of steepest descent with 
938 Suzanna Becker 
Figure 4: Weights learned by one of the two groups of 15 units in the first layer. 
White weights are positive and black are negative. 
momentum to converge, but after about 1000 iterations only very small weight 
changes were made. 
Weights learned on a typical run for one of the two groups of fifteen units are shown 
in Figure 4. The units' weights are displayed in three rows corresponding to units 
in the three blocks in the block-diagonal joint prior matrix. Units within the same 
block each learned different instances of the same pattern class. For example, on 
this run units in the first block learned to detect circles in specific positions. Units 
in the second block tended to learn combinations of either horizontal or vertical 
lines, or sometimes mixtures of the two. In the third block, units learned blurred, 
roughly triangular shape detectors, which for this training set were adequate to 
respond specifically to triangles. In all five runs the network converged to equivalent 
solutions (only the groups' particular shape preferences varied across runs). 
Varying the number of units per block from three to five (i.e. three three-by-three 
blocks versus three five-by-five blocks of units) produced similar results, except 
that with fewer units per block, each unit tended to capture multiple instances of 
a particular object class in different positions. 
A second layer of two groups of three units was added to the network, as shown in 
Figure 2 b). While keeping the first layer of weights frozen, this network was trained 
using exactly the same cost function as the first layer for about 30 iterations using 
a gradient-based learning method. This time the prior joint distribution for the 
two classifications was a three-by-three matrix with 80% of the density along the 
diagonal and 20% evenly distributed across the remainder of the distribution. Units 
in this layer learned to discriminate fairly well between the three object classes, 
as shown in Figure 5 a). On a test set with the ambiguous patterns removed 
(i.e., patterns containing multiple objects), units in the second layer achieved very 
JPMAX 939 
a) 
b) 
Layer 2 Unit Responses on Training Set 
I Circles 
mac Sqes 
O. 4O 
0.20 
UnR 1 UnR 2 UnR 3 UnR 4 Unit 5 UnR  
Leyer 2 Unit Responses on Test Set (no overlaps) 
 0.80 
040 
it 1 UnR 2 UnR  it 4 Unit 5 Unit  
Figure 5: Response probabilities for the six output units to each of the three shapes. 
accurate object discrimination as shown in Figure 5 b). 
On ambiguous patterns containing multiple objects, the network's performance was 
disappointing. The output units would sometimes produce the "correct" response, 
i.e., all the units representing the shapes present in the image would be partially 
active. Most often, however, only one of the correct shapes would be detected, and 
occasionally the network's response indicated the wrong shape altogether. It was 
hoped. that the diagonally dominant prior mixed with a uniform density would allow 
units to occasionally disagree, and they would therefore be able to represent cases 
of multiple objects. It may have helped to use a similar prior for the hidden layer; 
however, this would increase the complexity of the learning considerably. 
3 DISCUSSION 
We have shown that the algorithm can learn 2D-translation-invariant shape recog- 
nition, but it should handle equally well other types of transformations, such as 
rotation, scaling or even non-linear transformations. In principle, the algorithm 
should be applicable to real moving images; this is currently being investigated. Al- 
though we have focused here on the temporal coherence constraint, the algorithm 
could be applied equally well using other types of coherence, such as coherence 
940 $uzanna Becker 
across space or across different sensory modalities. 
Note that the units in the first layer of the network did not learn anything about 
the geometric transformations between translated versions of the same object; they 
simply learned to associate different views together. In this respect, the represen- 
tation learned at the hidden layer is similar to that predicted by the "privileged 
views" theory of viewpoint-invariant object recognition advocated by Weinshall et 
al. (1990) (and others). Their algorithm learns a similar representation in a single 
layer of competing units with temporal Hebbian learning applied to the lateral con- 
nections between these units. However, the algorithm proposed here goes further 
in that it can be applied to subsequent stages of learning to discover higher-order 
object classes. 
Yuille et al. (1994) have previously proposed an algorithm based on similar prin- 
ciples, which also involves maximizing the log probability of the network outputs 
under a prior; in one special case it is equivalent to Becker and Hinton's Ima algo- 
rithm. The algorithm proposed here differs substantially, in that we are dealing with 
the ensemble-averaged joint probabilities of two populations of units, and fitting this 
quantity to a prior; further, Yuille et al's scheme employs back-propagation. 
One challenge for future work is to train a network with smaller receptive fields 
for the first layer units, on images of objects with common lowqevel features, such 
as squares and rectangles. At least three layers of weights would be required to 
solve this task: units in the first layer would have to learn local object parts such as 
corners, while units in the next layer could group parts into viewpoint-specific whole 
objects and in the top layer viewpoint-invariance, in principle, could be achieved. 
Acknowledgements 
Helpful comments from Geoff Hinton, Peter Dayan, Tony Plate and Chris Williams 
are gratefully acknowledged. 
References 
Becker, S. (1993). Learning to categorize objects using temporal coherence. In 
Advances in Neural Information Processing Systems 5, pages 361-368. Morgan 
Kaufmann. 
Becker, S. and Hinton, G. E. (1992). A self-organizing neural network that dis- 
covers surfaces in random-dot stereograms. Nature, 355:161-163. 
de Sa, V. R. (1993). Minimizing disagreement for self-supervised classification. 
In Proceedings of the 1993 Connectionist Models Summer School, pages 300-307. 
Lawrence Erlbaum associates. 
FSldik, P. (1991). Learning invariance from transformation sequences. Neural 
Computation, 3 (2) :194-200. 
Weinshall, D., Edelman, S., and Biilthoff, H. H. (1990). A self-organizing 
multiple-view representation of 3D objects. In Advances in Neural Information 
Processing Systems 2, pages 274-282. Morgan Kaufmann. 
Yuille, A. L., Stelios, M. S., and Xu, L. (1994). Bayesian Self-Organization. 
Technical Report No. 92-10, Harvard Robotics Laboratory. 
