A 
Constructive Learning Algorithm 
Discriminant Tangent Models 
for 
Diego Sona Alessandro Sperduti Antonina Starira 
Dipartimento di Informatica, Universirk di Pisa 
Corso Italia, 40, 56125 Pisa, Italy 
email: {sona,perso,starita)di.unipi.it 
Abstract 
To reduce the computational complexity of classification systems 
using tangent distance, Hastie et al. (HSS) developed an algo- 
rithm to devise rich models for representing large subsets of the 
data which computes automatically the "best" associated tan- 
gent subspace. Schwenk & Milgram proposed a discriminant mod- 
ular classification system (Diabolo) based on several autoassociative 
multilayer percepttons which use tangent distance as error recon- 
struction measure. 
We propose a gradient based constructive learning algorithm for 
building a tangent subspace model with discriminant capabilities 
which combines several of the the advantages of both HSS and 
Diabolo: devised tangent models hold discriminant capabilities, 
space requirements are improved with respect to HSS since our 
algorithm is discriminant and thus it needs fewer prototype models, 
dimension of the tangent subspace is determined automatically by 
the constructive algorithm, and our algorithm is able to learn new 
transformations. 
I Introduction 
Tangent distance is a well known technique used for transformation invariant pat- 
tern recognition. State-of-the-art accuracy can be achieved on an isolated hand- 
written character task using tangent distance as the classification metric within a 
nearest neighbor algorithm [SCD93]. However, this approach has a quite high com- 
putational complexity, owing to the inefficient search and large number of Euclidean 
and tangent distances that need to be calculated. Different researchers have shown 
how such time complexity can be reduced [Sim94, SS95] at the cost of increased 
space complexity. 
A Constructive Learning Algorithm for Discriminant Tangent Models 787 
A different approach to the problem was used by Hastie et al. [HSS95] and Schwenk 
&: Milgram [SM95b, SM95a]. Both of them used learning algorithms for reducing 
the classification time and space requirements, while trying to preserve the same 
accuracy. Hastie et al. [HSS95] developed rich models for representing large subsets 
of the prototypes. These models are learned from a training set through a Singular 
Value Decomposition based algorithm which minimizes the average 2-sided tangent 
distance from a subset of the training images. A nice feature of this algorithm 
is that it computes automatically the "best" tangent subspace associated with 
the prototypes. Schwenk & Milgram [SM95b] proposed a modular classification 
system (Diabolo) based on several autoassociative multilayer perceptrons which use 
tangent distance as the error reconstruction measure. This original model was then 
improved by adding discriminant capabilities to the system [SM95a]. 
Comparing Hastie et al. algorithm (HSS) versus the discriminant version of Diabolo, 
we observe that: Diabolo seems to require less memory than HSS, however, learning 
is faster in HSS; Diabolo is discriminant while HSS is not; the number of hidden 
units to be used in Diabolo's autoassociators must be decided heuristically through 
a trial and error procedure, while the dimension of the tangent subspaces in HSS 
can be controlled more easily; Diabolo uses predefined transformations, while HSS 
is able to learn new transformations (like style transformations). 
In this paper, we introduce the tangent distance neuron (TD-neuron), which imple- 
ments the 1-sided version of the tangent distance, and we devise a gradient based 
constructive learning algorithm for building a tangent subspace model with dis- 
criminant capabilities. In this way, we are able to combine the advantages of both 
HSS and Diabolo: the model holds discriminant capabilities, learning is just a bit 
slower than HSS, space requirements are improved with respect to HSS since the 
TD-neuron is discriminant and thus it needs fewer prototype models, the dimension 
of the tangent subspace is determined automatically by the constructive algorithm, 
and TD-neuron is able to learn new transformations. 
2 Tangent Distance 
In several pattern recognition problems Euclidean distance fails to give a satis- 
factory solution since it is unable to account for invariant transformations of the 
patterns. Simard et al. [SCD93] suggested dealing with this problem by generating 
a parameterized 7-dimensional manifold for each image, where each parameter ac- 
counts for one such invariance. The underlying idea consists in approximating the 
considered transformations locally through a linear model. 
For the sake of exposition, consider rotation. Given a digitalized image Xi of a 
pattern i, the rotation operation can be approximated by .ffi(O) = Xi + TxO, 
where 0 is the rotation angle, and Tx is the tangent vector to the rotation curve 
generated by the rotation operator for Xi. The tangent vector Tx can easily be 
computed by finite difference. Now, instead of measuring the distance between two 
images as D(Xi,Xj) = [[Xi-Xjl [ for any norm [[-[[, Simard et al. proposed using 
the tangent distance DT(Xi, Xj ) -- min,,s [[.ti(Oi) - fftj(Oj )[[. 
If k types of transformations are considered, there will be k different tangent vectors 
per pattern. If [[. [[ is the Euclidean norm, computing the tangent distance is a 
simple least-squares problem. A solution for this problem 1 can be found in Simard 
et al. [SCD93], where the authors used DT to drive a 1-NN classification rule. 
1A special caqe of tangent distance, i.e., the one sided tangent distance 
ol--sided( ', -- , . 
. . X) mine I[Xi()- Xjll can be computed more efficiently [SS95] 
788 D. Sona, A. Sperduti and A. Starita 
Figure 1' Geometric interpretation of equation 1. Note that net = (D, -'ided ). 
Unfortunately, 1-NN is expensive. To reduce the complexity of the above approach, 
Hastie et al. [HSS95] proposed an algorithm for the generation of rich models 
representing large subsets of patterns. This algorithm computes for each class a 
prototype (the centroid), and an associated subspace (described by the tangent 
vectors), such that the total tangent distance of the centroid with respect to the 
prototypes in the training set is minimised. Note that the associated subspace is 
not predefined as in the case of standard tangent distance, but is computed on the 
basis of the training set. 
3 Tangent Distance Neuron 
In this section we define the Tangent Distance neuron (TD-neuron), which is the 
computational model studied in this paper. A TD-neuron is characterized by a set 
of n q- 1 vectors, of the same dimension as the input vectors (in our case, images). 
One of these vectors, W is used as reference vector (centroid), while the remaining 
vectors, Ti (i = x,... ,n), are used as tangent vectors. Moreover, the set of tangent 
vectors constitutes an ortho-normal basis. 
Given an input vector I the input net of the TD-neuron is computed as the square 
of the 1-sided tangent distance between I and the tangent model {W, T,..., Tn} 
(see Figure 1) 
d 
where we have used the fact that the tangent vectors constitute an ortho-normal 
basis. For the sake of notation, d denotes the difference between the input pattern 
and the centroid, and the projection of d over the i-th tangent vector is denoted by 
'i. Note that, by definition, net is non-negative. 
The output o of the TD-neuron is then computed by transforming the net through 
a nonlinear monotone function f. In our experiments, we have used the following 
function 
o = net) = 
 + net 
where ct controls the steepness of the function. Note that o is positive since net is 
always positive and within the range (o, x]. 
A Constructive Learning Algorithm for Discriminant Tangent Models 789 
4 Learning 
The TD-neuron can be trained to discriminate between patterns belonging to two 
different classes through a gradient descent technique. Thus, given a training set 
{(I,t),...,(IN, tN)}, where ti  {o,x} is the i-th desired output, and N is 
the total number of patterns in the training set, we can define the error function as 
N 
E = - tk -- ok 
2 
(3) 
where ok is the output of the TD-neuron for the k-th input pattern. 
Using equations (1-2), it is trivial to compute the changes for the tangent vectors, 
the centtold and 
(6) 
N 
ATi = --rl = 2otrl --  d 
aw= -.: - (5) 
k= i= 
= = - - o 
k 
where  and a are learning parameters. 
The learning algorithm initializes the centroid W to the average of the patterns with 
1 N 
target x, i.e., W = N k= Ik, where N is the number of patterns with target 
equal to x, and the tangent vectors to random vectors with small modulus. Then 
ct, the centroid W and the tangent vectors Ti are changed according to equations 
(4-6). Moreover, since the tangent vectors must constitute an ortho-normal basis, 
after each epoch of training the vectors Ti are ortho-normalized. 
5 The Constructive Algorithm 
Before training the TD-neuron using equations (4-6), we have to set the tangent 
subspace dimension. The same problem is present in HSS and Diabolo (i.e., number 
of hidden units). To solve this problem we have developed a constructive algorithm 
which adds tangent vectors one by one according to the computational needs. 
The key idea is based on the observation that a typical run of the learning algorithm 
described in Section 4 leads to the sequential convergence of the vectors according to 
their relative importance. This means that the tangent vectors all remain random 
vectors while the centroid converges first. 
Then one of the tangent vectors converges to the most relevant transformation 
(while the remaining tangent vectors are still immature), and so on till all the 
tangent vectors converge, one by one, to less and less relevant transformations. 
This behavior suggests starting the training using only the centroid (i.e., without 
tangent vectors) and allow it to converge. Then, as in other constructive algorithms, 
the centroid is frozen and one random tangent vector T is added. Learning is 
resumed till changes in T become irrelevant. During learning, however, T is 
normalized after each epoch. At convergence, T is frozen, a new random tangent 
vector T is added, and learning is resumed. New tangent vectors are iteratively 
added till changes in the classification accuracy becomes irrelevant. 
790 D. Sona, A. Sperduti and A. Starita 
HSS TD-neuron 
# Tan. % Cor % Err % Cor % Pei % Err 
0 -- -- 73.78 7.24 18.98 
I 78.74 21.26 72.06 10.48 17.46 
2 79.10 20.90 77.99 8.05 13.96 
3 79.94 20.06 81.14 7.17 11.69 
4 81.47 18.53 82.68 6.84 10.48 
5 76.87 23.13 84.25 5.63 10.12 
6 71.29 28.71 85.21 5.14 9.65 
'7 -- -- 86.16 4.76 9.08 
8 -- -- 86.37 4.89 8.74 
Table 1' The results obtained by the HSS algorithm and the TD-neuron. 
6 Results 
We have tested our constructive algorithm versus the HSS algorithm (which uses 
the 2-sided tangent distance) on 10587 binary digits from the NIST-3 dataset. The 
binary 128x128 digits were transformed into a 64-grey level 16x16 format by a 
simple local counting procedure. No other pre-processing transformation 
was performed. The training set consisted of 3000 randomly chosen digits, while 
the remaining digits where used in the test set. A single tangent model for each 
class of digit was computed using both algorithms. The classification of the test 
digits was performed using the label of the closest model for HSS and the output 
of the TD-neurons for our system. The TD-neurons used a rejection criterion with 
parameters adapted during training. 
In Table 1 we have reported the performances on the test set of both HSS and our 
system. Different numbers of tangent vectors were tested for both of them. From the 
results it is clear that the models generated by HSS reach a peak in performance with 
4 tangent vectors and then a sharp degradation of the generalization is observed 
by adding more tangent vectors. On the contrary, the TD-neurons are able to 
steadly increase the performance with an increasing number of tangent vectors. 
The improvement in the performance, however, seems to saturate when using many 
tangent vectors. Table 2 presents the confusion matrix obtained by the TD-neurons 
with 8 tangent vectors. 
For comparison, we display some of the tangent models computed by HSS and 
by our algorithm in Figure 2. Note how tangent models developed by the HSS 
algorithm tend to be more blurred than the ones developed by our algorithm. This 
is due to the lake of discriminant capabilities by the HSS algorithm and it is the 
main cause of the degradation in performance observed when using more than 4 
tangent vectors. 
It must be pointed out that, for a fixed number of tangent vectors, the HSS algo- 
rithm is faster than ours, because it needs only a fraction of the training examples 
(only one class). However, our algorithm is remarkably more efficient when a family 
of tangent models with an increasing number of tangent vectors must be generate&. 
Moreover, since a TD-neuron uses the one sided tangent distance, it is faster in com- 
puting the output. 
7 Conclusion 
We introduced the tangent distance neuron (TD-neuron), which implements the 
1-sided version of the tangent distance and gave a constructive learning algorithm 
for building a tangent subspace with discriminant capabilities. As stated in the in- 
The tangent model computed by HSS depends on the number of tangent vectors. 
References
