348 
Minkowski-r Back-Propagation: Learning in Connectionist 
Models with Non-Euclidian Error Signals 
Stephen Jos6 Hanson and David J. Burr 
Bell Communications Research 
Morristown, New Jersey 07960 
Abstract 
Many connectionist learning models are implemented using a gradient descent 
in a least squares error function of the output and teacher signal. The present model 
generalizes, in particular, back-propagation [1] by using Minkowski-r power metrics. 
For small r's a "city-block" error metric is approximated and for large r's the 
"maximum" or "suprcmum" metric is approached, while for r=2 the standard back- 
propagation model results. An implementation of Minkowski-r back-propagation is 
described, and several experiments are done which show that different values of r 
may be desirable for various purposes. Different r values may be appropriate for the 
reduction of the effects of outliers (noise), modeling the input space with more 
compact clusters, or modeling the statistics of a particular domain more naturally or 
in a way that may be more perccptually or psychologically meaningful (e.g. speech or 
vision). 
1. Introduction 
The recent resurgence of connectionist models can be traced to their ability to 
do complex modeling of an input domain. It can be shown that neural-like networks 
containing a single hidden layer of non-linear activation units can learn to do a 
piece-wise linear partitioning of a feature space [2]. One result of such a partitioning 
is a Complex gradient surface on which decisions about new input stimuli will be 
made. The generalization, categorization and clustering properties of the network arc 
therefore determined by this mapping of input stimuli to this gradient surface in the 
output space. This gradient surface is a function of the conditional probability 
distributions of the output vectors given the input feature vectors as well as a function 
of the error relating the teacher signal and output. 
349 
Presently many of the models have been implemented using least squares error. 
In  paper we describe a new model of gradient descent back-propagation [1] using 
Minkowski-r power error metrics. For small r's a "city-block" error measure (r=l) is 
approximated and for larger r's a "maximum" or supremum error measure is 
approached, while the standard case of Euclidian back-propagation is a special case 
with r=2. First we derive the general case and then discuss some of the implications 
of varying the power in the general metric. 
2. Derivation of Minkowski-r Back-propagation 
The standard back-propagation is derived by minimizing least squares error as 
a function of connection weights within a completely connected layered network. 
The error for the Euclidian case is (for a single input-output pair), 
1 
= (yi-i) 2. O) 
i 
where y is the activation of a unit and  represents an independent teacher signal. 
The activation of a unit 0') is typically computed by normalizing the input from other 
units (x) over the interval (0,1) while compressing the high and low end of this range. 
A common function used for this normalization is the logistic, 
1 
yi = (2) 
l+e- 
The input to a unit (x) is found by summing products of the weights and 
corresponding activations from other units, 
h 
where Yn represents units in the fan in of unit i and wa represents the strength of the 
connection between unit i and unit h. 
A gradient for the Euclidian or standard back-propagation case could be found 
by finding the partial of the error with respect to each weight, and can be expressed in 
this three term differential, 
350 
which from the equations before turns out to be, 
= O) 
Generalizing the error for Minkowski-r power metrics (see Figure 1 for the 
family of curves), 
r i 
Figure 1: Minkow$'-r Family 
Using equations 2-4 above with equation 6 we can easily find an expression for the 
gradient in the general Minkowski-r case, 
OE = ( l yi - i l )r-lyi(1-yi)y, sgn (Yi - i) (7) 
This gradient is used in the weight update role proposed by Rumelhart, Hinton and 
Williams [1], 
351 
wu.(n+l) = a i}" + wai(n) 
(8) 
Since the gradient computed for the hidden layer is a function of the gradient for the 
output, the hidden layer weight updating proceeds in the same way as in the 
Euclidian case [1], simply substituting this new Minkowski-r gradient. 
It is also possible to define a gradient over r such that a minimum in error 
would be sought. Such a gradient was suggested by White [3, see also 4] for 
maximum likelihood estimation of r, and can be shown to be, 
dlog(E) = (1-1/r)(1/r) + (1/r)21og (r) + (l/r) 2(1/r) + (l/r) 2 lyi-il 
-( 1 / r )( I Yi - i I )r log ( I Yi - i I ) 
(9) 
An approximation of this gradient (using the last term of equation 9) has been 
implemented and investigated for simple problems and shown to be fairly robust in 
recovering similar r values. However, it is important that the r update rule changes 
slower than the weight update rule. In the simulations we ran r was changed once for 
every 10 times the weight values were changed. This rate might be expected to va-D' 
with the problem and rate of convergence. Local minima may be expected in larger 
problems while seeking an optimal r. It may be more informative for the moment to 
examine different classes of problems with fixed r and consider the specific rationale 
for those classes of problems. 
3. Variations in r 
Various r values may be useful for various aspects of representing information 
in the feature domain. Changing r basically results in a reweighting of errors from 
output bits I. Small r's give less weight for large deviations and tend to reduce the 
influence of outlier points in the feature space during learning. In fact, it can be 
shown that if the distributions of feature vectors are non-gaussian, then the r=2 case 
It is possible to entertain r values that are negative, which would give largest weight to small errors 
close to zero and smallest weight to very large errs. Values of r less than I generally are non-metric, 
i,e. they violate at least one of the metric axioms. For example. r<O violates the triangle inequality. 
For aome problems this may make sense and the need for a metric error weighting may be urmege. 
These issues are not explored in this paper. 
352 
will not be a maximum likelihood estimator of the weights [5]. The city block case, 
r=-l, in fact, arises if the underlying conditional probability distributions are Laplace 
[5]. More generally, r's less than two will tend to model non-gaussian distributions 
where the tails of the distributions are more pronounced than in the gaussian. Better 
estimators can be shown to exist for general noise reduction and have been studied in 
the area of robust estimation procedures [5] of which the Minkowski-r metric is only 
one possible case to consider. 
r<2. It is generally recommended that r=l.5 may be optimal for many noise 
reduction problems [6]. However, noise reduction may also be expected to vary with 
the problem and nature of the noise. One example we have looked at involves the 
recovery of an arbitrary 3 dimensional smooth surface as shown in Figure 2a, after 
the addition of random noise. This surface was generated from a gaussian curve in the 
2 dimensions. Uniform random noise equal to the width (standard deviation) of the 
surface shape was added point-wise to the surface producing the noise plus surface 
shape shown in Figure 2b. 
Figure 2: Shape surface (2a), Shape plus noise surface (2b) and recovered Shape 
surface (2c) 
The shape in Figure 2a was used as target points for Minkowski-r back-propagation 2 
and recovered with some distortion of the slope of the shape near the peak of the 
All simulation runs, unless otherwise staled, used the same learning rate (.05) and smoothing value (.9) 
and stopping criterion defined in terms of absolute mean deviation. The number of iterations to meet 
the stopping criterion varied considerably as r was changed (see below). 
353 
surface (see Figure 2c). Next the noise plus shape surface was used as target points 
for the learning procedure with r=2. The shape shown in Figure 3a was recovered, 
however, with considerable distortion iaround the base and peak. The value of r was 
reduced to 1.5 (Figure 3b) and then finally to 1.2 (Figure 3c) before shape distortions 
were eliminated. Although, the major properties of the shape of the surface were 
recovered, the scale seems distorted (however, easily restored with tenorrealization 
into the 0,1 range). 
Figure 3: Shape surface recovered with r=2 (3a), r=l.5 (3b) and r=12 (3c) 
r>2. Large r's tend to weight large deviations. When noise is not possible in 
the feature space (as in an arbitrary boolean problem) or where the token clusters are 
compact and isolated then simpler (in the sense of the number and placement of 
partition planes) genei'alizafion surfaces may be created with larger r values. For 
example, in the simple XOR problem, the main effect of increasing r is to pull the 
decision boundaries closer into the non-zero targets (compare high activation regions 
in Figure 4a and 4b). 
In this particular problem clearly such compression of the target regions does not 
constitute simpler decision surfaces. However, if more hidden units are used than are 
needed for pattern class separation, then increasing r during training will tend to 
reduce the number of cuts in the space to the minimum needed. This seems to be 
primarily due to the sensitivity of the hyper-plane placement in the feature space to 
the geometry of the targets. 
A more complex case illustrating the same idea comes from an example 
suggested by Minsky & Papert [7] called "the mesh". This type of pattern 
recognition problem is also, like XOR, a non-linearly separable problem. An optimal 
354 
I( 
Ol 
oo 
i! 
Figure 4: XOR solved with r=2 (4a) and r=4 (4b) 
olution involves only three cuts in feature space to meparate the two "meshed" 
clusm's (see Figure 5a). 
feature I 
Mesh (Mly & Paperr, 1969) 
o  
o 
Figure 5: Mesh problem with minimum cut solution (5a) and Performance Surface(5b) 
Typical solutions for r=2 in this case tend to use a large number of hidden units to 
separate the two sets of exemplars (see Figure 5b for a performance surface). For 
example, in Figure 6a notice that a typical (based on several runs) Euclidian back- 
prop starting with 16 hidden units has found a solution involving five decision 
boundaries (lines shown in the plane also representing hidden units) while the r=3 
case used primarily three decision boundaries and placed a number of other 
355 
boundaries redundanfiy near the center of the meshed region (see Figure 6b) where 
there is maximum uncertainty about the cluster identification. 
0.0 0.2 0.4 0.6 0.8 1.0  
0.0 0.2 0.4 0.6 0.8 1.0 
Figure 6: Mesh solved with r=2 (6a) and r=3 (6b) 
Speech Recognition. A final case in which large r's may be appropriate is data 
that has been previously processed with a transformation that produced compact 
regions requiring separation in the feature space. One example we have looked at 
involves spoken digit recognition. The first 10 cepstral coefficients of spoken digits 
("one" through "ten") were used for input to a network. In this case an advantage is 
shown for larger r's with smo_ 11er training set sizes. Shown in Figure 7 are transfer 
data for 50 spoken digits replicated in ten different runs per point (bars show standard 
error of the mean). Transfer shows a training set size effect for both r=-2 and r=3, 
however for the larger r value at smaller training set sizes (10 and 20) note that 
transfer is enhanced. 
We speculate that this may be due to the larger r bacp creating discrimination 
regions that are better able to capture the compactness of the clusters inherent in a 
small number of training points. 
4. Convergence Properties 
It should be generally noted that as r increases, convergence time tends to grow 
roughly linearly (although this may be problem dependenO. Consequenfiy, 
decreasing r can significantly improve convergence, without much change to the 
nature of solution. Further, if noise is present decreasing r may reduce it 
dramatically. Note finally that the gradient for the Minkowski-r back-propagation is 
nonlinear and therefore more complex for irnplemcnting learning procedures. 
356 
/ R=2 .,.' '"'J. 
10 replications of 50 transfer points 
0 10 20 30 40 50 
TRAINING SET SIZE 
Figure 7: Digit Recognition Set Size Effect 
$. Summary and Conclusion 
A new procedure which is a variation on the Back-propagation algorithm is 
derived and simulated in a number of different problem domains. Noise in the target 
domain may be reduced by using power values less than 2 and the sensitivity of 
partition planes to the geomeu7 of the problem may be increased with increasing 
power values. Other types of objective functions should be explored for their 
potential consequences on network resources and ensuing pattern recognition 
capabilities. 
References 
1. Rumelhart D. E., Hinton G. E., Williams R., Learning Internal Representations by 
error propagation. Nature, 1986. 
2. Burr D. J. and Hanson S. J., Knowledge Representation in Connectionist Networks, 
Bellcore, Technical Report, 
3. White, H. Personal Communication, 1987. 
4. White, H. Some Asymptotic Results for Learning in Single Hidden Layer 
Feedforward Network Models, Unpublished Manuscript, 1987. 
357 
5. Mostllcr, F. & Tukcy, $. Robust Estimation Procedures, Addison Wesley, 1980. 
6. Tukcy, $. Personal Communication, 1987. 
7. Minsky, M. & Papeft, S., Percepttons: An Introduction to Computational 
Geometry, MIT Press, 1969. 
