Tangent Prop- A formalism for specifying 
selected invariances in an adaptive network 
Patrice Simard 
AT&T Bell Laboratories 
101 Crawford Corner Rd 
Holmdel, NJ 07733 
Bernard Victorri 
Universit de Caen 
Caen 14032 Cedex 
France 
Yann Le Cun 
AT&T Bell Laboratories 
101 Crawford Corner Rd 
Holmdel, NJ 07733 
John Denker 
AT&T Bell Laboratories 
101 Crawford Corner Rd 
Holmdel, NJ 07733 
Abstract 
In many machine learning applications, one has access, not only to training 
data, but also to some high-level a priori knowledge about the desired be- 
havior of the system. For example, it is known in advance that the output 
of a character recognizer should be invariant with respect to small spa- 
tial distortions of the input images (translations, rotations, scale changes, 
etcetera). 
We have implemented a scheme that allows a network to learn the deriva- 
tive of its outputs with respect to distortion operators of our choosing. 
This not only reduces the learning time and the amount of training data, 
but also provides a powerful language for specifying what generalizations 
we wish the network to perform. 
I INTRODUCTION 
In machine learning, one very often knows more about the function to be learned 
than just the training data. An interesting case is when certain directional deriva- 
tives of the desired function are known at certain points. For example, an image 
895 
896 Simard, Victorri, Le Cun, and Denker 
?' 
I / 
J / 
Figure 1: Top: Small rotations of an original digital image of the digit "3" (center). 
Middle: Representation of the effect of the rotation in the input vector space space 
(assuming there are only 3 pixels). Bottom: Images obtained by moving along the 
tangent to the transformation curve for the same original digital image (middle). 
recognition system might need to be invariant with respect to small distortions of 
the input image such as translations, rotations, scalings, etc.; a speech recognition 
system n.ight need to be invariant to time distortions or pitch shifts. In other 
words, the derivative of the system's output should be equal to zero when the input 
is transformed in certain ways. 
Given a large amount of training data and unlimited training time, the system 
could learn these invariances from the data alone, but this is often infeasible. The 
limitation on data can be overcome by training the system with additional data 
obtained by distorting (translating, rotating, etc.) the original patterns (Baird, 
1990). The top of Fig. 1 shows artificial data generated by rotating a digital image of 
the digit "3" (with the original in the center). This procedure, called the "distortion 
model", has two drawbacks. First, the user must choose the magnitude of distortion 
and how many instances should be generated. Second, and more importantly, the 
distorted data is highly correlated with the original data. This makes traditional 
learning algorithms such as backpropagation very inefficient. The distorted data 
carries only a very small incremental amount of information, since the distorted 
patterns are not very different from the original ones. It may not be possible to 
adjust the learning system so that learning the invariances proceeds at a reasonable 
rate while learning the original points is non-divergent. 
The key idea in this paper is that it is possible to directly learn the effect on 
the output of distorting the input, independently from learning the undistorted 
Tangent Prop--A formalism for specifying selected invariances in an adaptive network 897 
F(x) 
I I I I  I I I : = 
xl x2 x3 x4 x xl x2 x3 x4 x 
Figure 2: Learning a given function (solid line) from a limited set of example 
to x4). The fitted curves are shown in dotted line. Top: The only constraint is that 
the fitted curve goes through the examples. Bottom: The fitted curves not only 
goes through each examples but also its derivatives evaluated at the examples agree 
with the derivatives of the given function. 
patterns. When a pattern P is transformed (e.g. rotated) with a transformation 
s that depends on one parameter a (e.g. the angle of the rotation), the set of all 
the transformed patterns S(P) = {s(a,P) a} is a one dimensional curve in the 
vector space of the inputs (see Fig. 1). In certain cases, such as rotations of digital 
images, this curve must be made continuous using smoothing techniques, as will be 
shown below. When the set of transformations is parameterized by n parameters 
ai (rotation, translation, scaling, etc.), S(P) is a manifold of at most n dimensions. 
The patterns in S(P) that are obtained through small transformations of P, i.e. 
the part of S(P) that is close to P, can be approximated by a plane tangent to 
the manifold S(P) at point P. Small transformations of P can be obtained by 
adding to P a linear combination of vectors that span the tangent plane (tangent 
vectors). The images at the bottom of Fig. I were obtained by that procedure. 
More importantly, the tangent vectors can be used to specify high order constraints 
on the function to be learned, as explained below. 
To illustrate the method, consider the problem of learning a single-valued function 
F from a limited set of examples. Fig. 2 (left) represents a simple case where the 
desired function F (solid line) is to be approximated by a function G (dotted line) 
from four examples {(xi,F(xi))}i=,2,a,4. As exemplified in the picture, the fitted 
function G largely disagrees with the desired function F between the examples. If 
the functions F and G are assumed to be differentiable (which is generally the case), 
the approximation G can be greatly improved by requiring that G's derivatives 
evaluated at the points {xi} are equal to the derivatives of F at the same points 
(Fig. 2 right). This result can be extended to multidimensional inputs. In this case, 
we can impose the equality of the derivatives of F and G in certain direclions, not 
necessarily in all directions of the input space. 
Such constraints find immediate use in traditional learning problems. It is often the 
case that a priori knowledge is available on how the desired function varies with 
898 
Simard, Victorri, Le Cun, and Denker 
pattern P 
pattern P 
rotated by a 
tangent 
vector 
Figure 3: How to compute a tangent vector for a given transformation (in this case 
a rotation). 
respect to some transformations of the input. It is straightforward to derive the 
corresponding constraint on the directional derivatives of the fitted function G in 
the directions of the transformations (previously named tangent vectors). Typical 
examples can be found in pattern recognition where the desired classification func- 
tion is known to be invariant with respect to some transformation of the input such 
as translation, rotation, scaling, etc., in other words, the directional derivatives of 
the classification function in the directions of these transformations is zero. 
2 IMPLEMENTATION 
The implementation can be divided into two parts. The first part consists in com- 
puting the tangent vectors. This part is independent from the learning algorithm 
used subsequently. The second part consists in modifying the learning algorithm 
(for instance backprop) to incorporate the information about the tangent vectors. 
Part I: Let x be an input pattern and s be a transformation operator acting 
on the input space and depending on a parameter a. If s is a rotation operator 
for instance, then s(a, x) denotes the input x rotated by the angle a. We will 
require that the transformation operator s be differentiable with respect to a and 
x, and that s(0, x)= x. The tangent vector is by definition Os(a, x)/Oa. It can be 
approximated by a finite difference, as shown in Fig. 3. In the figure, the input space 
is a 16 by 16 pixel image and the patterns are images of handwritten digits. The 
transformations considered are rotations of the digit images. The tangent vector 
is obtained in two steps. First the image is rotated by an infinitesimal amount a. 
This is done by computing the rotated coordinates of each pixel and interpolating 
the gray level values at the new coordinates. This operation can be advantageously 
combined with some smoothing using a convolution. A convolution with a Gaussian 
provides an efficient interpolation scheme in O(nrn) multiply-adds, where n and m 
are the (gaussian) kernel and image sizes respectively. The next step is to subtract 
(pixel by pixel) the rotated image from the original image and to divide the result 
Tangent Prop--A formalism for specifying selected invariances in an adaptive network 899 
by the scalar a (see Fig. 3). If k types of transformations are considered, there 
will be k different tangent vectors per pattern. For most algorithms, these do not 
require any storage space since they can be generated as needed from the original 
pattern at negligible cost. 
Part II: Tangent prop is an extension of the backpropagation algorithm, allowing 
it to learn directional derivatives. Other algorithms such as radial basis functions 
can be extended in a similar fashion. 
To implement our idea, we will modify the usual weight-update rule: 
OE 0 
Aw = -/ww is replaced with Aw = -/ww(E + ItEr) (1) 
where r/is the learning rate, E the usual objective function, Er an additional objec- 
tive function (a regularizer) that measures the discrepancy between the actual and 
desired directional derivatives in the directions of some selected transformations, 
and It is a weighting coefficient. 
Let x be an input pattern, y = G(x) be the input-output function of the network. 
The regularizer E is of the form 
= 
x ( t rainingset 
where E (z) is 
E,(x) = .[[Ki(x)-, ' [[' (2) 
Here, Ki(x) is the desired directional derivative of G in the direction induced by 
transformation si applied to pattern x. The second term in the norm symbol is the 
actual directional derivative, which can be rewritten as 
I = G'(x). I 
0 a=0 0 a=0 
where G'(x) is the Jacobian of G for pattern x, and Osi((,x)/O( is the 
vector associated to transformation si as described in Part I. Multiplying the tangent 
vector by the Jacobian involves one forward propagation through a "linearized" 
version of the network. In the special case where local invariance with respect to 
the si's is desired, Ki(x) is simply set to 0. 
Composition of transformations: The theory of Lie groups (Gilmore, 1974) 
ensures that compositions of local (small) transformations si correspond to linear 
combinations of the corresponding tangent vectors (the local transformations 
have a structure of Lie algebra). Consequently, if E(z) = 0 is verified, the network 
derivative in the direction of a linear combination of the tangent vectors is equal 
to the same linear combination of the desired derivatives. In other words if the 
network is successfully trained to be locally invariant with respect to, say, horizontal 
translation and vertical translations, it will be invariant with respect to compositions 
thereof. 
We have derived and implemented an efficient algorithm, "tangent prop", for per- 
forming the weight update (Eq. 1). It is analogous to ordinary backpropagation, 
900 Simard, Victorri, Le Cun, and Denker 
ki 
t!+l 
!-I !-I 
Network 
Jacobism nework 
Figure 4: forward propagated variables (a, x, a,), and backward propagated vari- 
ables (b,y,l, b) in the regular network (roman symbols) and the Jacobian (lin- 
earized) network (greek symbols) 
but in addition to propagating neuron activations, it also propagates the tangent 
vectors. The equations can be easily derived from Fig. 4. 
Forward propagation: 
1 y 1 i-1 
a i = Wijgg j 
J 
Tangent forward propagation: 
! -_ I M-1 
Ot i: wij j 
J 
Tangent gradient backpropagation: 
' , !q-l,l,i+l 
k 
Gradient backpropagation: 
Weight update: 
=o'(al)a i (4) 
=  (ai) (5) 
-- Wki Yk 
O[E(W, V,,) + ,E(W, Vp,Tp)] t- t ,-., 
(6) 
(7) 
Tangent Prop---A formalism for specifying selected invariances in an adaptive network 901 
*/,,Error on 
the test set 
10 20 40 80 160 320 
Training set aize 
Figure 5: Generalization performance curve as a function of the training set size for 
the tangent prop and the backprop algorithms 
The regularization parameter p is tremendously important, because it determines 
the tradeoff between minimizing the usual objective function and minimizing the 
directional derivative error. 
3 RESULTS 
Two experiments illustrate the advantages of tangent prop. The first experiment 
is a classification task, using a small (linearly separable) set of 480 binarized hand- 
written digit. The training sets consist of 10, 20, 40, 80, 160 or 320 patterns, and 
the training set contains the remaining 160 patterns. The patterns are smoothed 
using a gaussian kernel with standard deviation of one half pixel. For each of the 
training set patterns, the tangent vectors for horizontal and vertical translation 
are computed. The network has two hidden layers with locally connected shared 
weights, and one output layer with 10 units (5194 connections, 1060 free parame 
ters) (Le Cun, 1989). The generalization performance as a function of the training 
set size for traditional backprop and tangent prop are compared in Fig. 5. We have 
conducted additional experiments in which we implemented not only translations 
but also rotations, expansions and hyperbolic deformations. This set of 6 gener- 
ators is a basis for all linear transformations of coordinates for two dimensional 
images. It is straightforward to implement other generatom including gray-level- 
shifting, "smooth" segmentation, local continuous coordinate transformations and 
independent image segment transformations. 
The next experiment is designed to show that in applications where data is highly 
902 Simard, Victorri, Le Cun, and Denker 
0.15 
0.1 
Average NMSE w age 
 I : :__ '" 
1000 2000 3000 4000 5000 600 7000 8000 9000 10000 
.15 
.1 
0 
0 
Ave'age NMSE v age 
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 
1 
0.5 
0 
-0.5 
-1 
-1.5 
-1.5 
1 
.5 
0 
-1 
-1.5 
-1.5 
/ 
/ 
-1 .-0.5 0 0.5 1 1.5 -1 -.5 0 .5 1 1.5 
Distortion model Tangent prop 
Figure 6: Comparison of the distortion model (left column) and tangent prop (right 
column). The top row gives the learning curves (error versus number of sweeps 
through the training set). The bottom row gives the final input-output function of 
the network; the dashed line is the result for unadorned back prop. 
Tangent Prop--A formalism for specifying selected invariances in an adaptive network 903 
correlated, tangent prop yields a large speed advantage. Since the distortion model 
implies adding lots of highly correlated data, the advantage of tangent prop over 
the distortion model becomes clear. 
The task is to approximate a function that has plateaus at three locations. We want 
to enforce local invariance near each of the training points (Fig. 6, bottom). The 
network has one input unit, 20 hidden units and one output unit. Two strategies are 
possible: either generate a small set of training point covering each of the plateaus 
(open squares on Fig. 6 bottom), or generate one training point for each plateau 
(closed squares), and enforce local invariance around them (by setting the desired 
derivative to 0). The training set of the former method is used as a measure the 
performance for both methods. All parameters were adjusted for approximately 
optimal performance in all cases. The learning curves for both models are shown in 
Fig. 6 (top). Each sweep through the training set for tangent prop is a little faster 
since it requires only 6 forward propagations, while it requires 9 in the distortion 
model. As can be seen, stable performance is achieved after 1300 sweeps for the 
tangent prop, versus 8000 for the distortion model. The overall speedup is therefore 
about 10. 
Tangent prop in this example can take advantage of a very large regularization term. 
The distortion model is at a disadvantage because the only parameter that effec- 
tively controls the amount of regularization is the magnitude of the distortions, and 
this cannot be increased to large values because the right answer is only invariant 
under small distortions. 
4 CONCLUSIONS 
When a priori information about invariances exists, this information must be made 
available to the adaptive system. There are several ways of doing this, including the 
distortion model and tangent prop. The latter may be much more efficient in some 
applications, and it permits separate control of the emphasis and learning rate for 
the invariances, relative to the original training data points. Training a system to 
have zero derivatives in some directions is a powerful tool to express invariances to 
transformations of our choosing. Tests of this procedure on large-scale applications 
(handwritten zipcode recognition) are in progress. 
References 
Baird, H. S. (1990). Document Image Defect Models. In IAPR 1990 Workshop on 
Sylaclic and Structural Pattern Recognition, pages 38-46, Murray Hill, NJ. 
Gilmore, R. (1974). Lie Groups, Lie Algebras and some of their Applications. Wiley, 
New York. 
Le Cun, Y. (1989). Generalization and Network Design Strategies. In Pfeifer, R., 
Schreter, Z., Fogelman, F., and Steels, L., editors, Connectionism in Perspec- 
tive, Zurich, Switzerland. Elsevier. an extended version was published as a 
technical report of the University of Toronto. 
