Extensions of a Theory of Networks for 
Approximation and Learning: Outliers and 
Negative Examples 
Federico Girosi 
AI Lab. M.I.T. 
Cainbridge, MA 02139 
Tolnaso Poggio 
A1 Lab. M.I.T. 
Cainbridge, MA 02139 
Abstract 
Bruno Caprile 
I.R.S.T. 
Povo, Italy, 38050 
Learning an input-output mapping fi'om a set of examples can be regarded 
as synthesizing an approxilnation of a lnulti-dilnensional fimction. Front 
this point of view, this form of learning is closely related to regula. rization 
theory, and we have previously shown (Poggio and Girosi, 1990a, 1990b) 
the equivalence between regularization and a. class of three-layer networks 
that we call regularization networks. In this note, we extend the theory 
by introducing ways of dealing with two aspects of learning: learning in 
presence of unreliable examples or outliers, and learning fi'om positive and 
negative examples. 
1 Introduction 
In previous papers (Poggio and Girosi, 1990a, 1990b) we have shown the equivalence 
between certain regularization techniques and a class of three-layer networks - that 
we call regularization networks - which are related to the Radial Basis Functions 
interpolation method (Powell, 1987). In this note we indicate how it is possible 
to extend our theory of learning in order to deal with 1) occurence of unreliable 
examples, 2) negative examples. Both problems are also interesting from the point 
of view of classical approximation theory: 
1. discounting "bad" examples corresponds to disca.rding, in the approximation 
of a function, data points that are outliers. 
2. learning by using negative examples - in addition to positive ones - corresponds 
to approximating a function considering not only points which the filnction 
750 
Extensions of a Theory of Networks for Approximation and Learning 751 
ought to be close to, but also points - or regions - that the function must 
avoid. 
2 Unreliable data 
Suppose that a set g = {(xi, yi) G R" x R}N=x of data has been obtained by randomly 
sampling a function f, defined in R '*, in presence of noise, in a way that we can 
write 
Yi -- f(xi)+ei, i= 1,...,N 
where ei are independent random variables. We are interested in recovering an 
estimate of the function f from the set of data g. Taking a. probabilistic approach, 
we can regard the function f a.s the realization of a ra.ndom field with specified 
prior probability distribution. Consequently, the data g and the function f are lion 
independent random variables, and, by using Bayes rule, it is possible to express 
the conditional probability 7'[fig] of the finction f, given the exalnples g, in terlns 
of the prior probability of f, 7'[f], and the conditiolml probability of g given f, 
7'[glf]: 
7'[fig] c 7'[glf] 7'If]. (1) 
A common choice (Marroquin ct al., 1987) ibr the prior probability distribution 
7'If] is 
7'[f] o<, -xll.tlt 
(2) 
where P is a differential operator (the so called stabilizer), II '[I is the L 2 norrn, and A 
is a positive real number. This form of probability distribution assignes significant 
probability only to those functions for which the terln liPf I] 2 is "small", that is to 
functions that do not vary too "quickly" in their dolnain. 
If the noise is Gaussian, the probability 7'[glf] can be written as: 
2N N 
7'[glf] = II 
i=1 
(3) 
where fii =  
k-,, ' and ai is the variance of the noise related to the i-th data point. 
The values of the variances are usually assumed to be equal to some known value 
a, that reflects the accuracy of the measurement apparatus. However, in lnany 
cases we do not have access to such an inibrmation, and weaker assumptions have 
to be made. A fairly natural and general one consists in regarding the variances of 
the noise, as well as the function f, as randoln variables. Of course, some a priori 
knowledge about these variables, represented by an appropriate prior probability 
distribution, is needed. Let us denote by/ the set of random variables {/i}N=l. By 
752 Girosi, Poggio, and Caprile 
means of Bayes rule we can compute the joint probability of the variables f and/. 
Assuming that the field f and the set/3 are conditionally independent we obtain: 
72[f,/31(J] o< 72[1f,/3] 72[f] 72[] 
(4) 
where 72[/3] is the prior probability of the set of variances /3 and 72tall,/3] is the 
same as n eq. (3). Given the posterior probability (4) we are mainly interested in 
computing an estimate of f. Thus what we really need to compute is the marginal 
posterior probability of f, Pm[f], that is obtained integrating equation (4) over the 
variables 
Pm[f] ,'x ]i 72[f,,1,.] H d/3 (5) 
i=1 
A simple way to obtain an estilnate of the function f fi'om the probability distribu- 
tion (5) consists in computing the so called MAP (Maxilnum A PosterJori) estilnate, 
that is the function that maxinfizes the posterior probability Pm[f]. The problem 
of recovering the function f from the set of data g, with partial ilfformation about 
the amount of Gaussian noise affecting the data, is therefore equivalent to solving 
an appropriate variational problem. The specific fol'm of the functional that has to 
be maximized - or minimized - depends on the probability distributions 72[f] and 
72[/3]. 
Here we consider the following situation: we have knowledge that a given percentage, 
(1 - e) of the data is characterized by a Gaussian uoise distribution of variance 
1 ---- (2/31) -, whereas for the rest of the data the variance of the noise is a very 
large number rr2 = (2/32)- (we will call these data. "outliel's"). This situation 
yields the following probability distributiou: 
N 
72[/3] = rl[( _ ,),(/3 _/3,) +, ,(/3 _ &)]. 
i=l 
In this case, choosing T'[f] as in eq. (2), we ca,, show that P,,.[f] cre -mDq 
(6) 
where 
N 
H[I]=  v(e.) + ,XllPfll; 
Here V represents the effective potential 
(7) 
V(x)-- 1 x2-- In ( 
depicted in fig. (1) for different values of fl.,. 
Extensions of a Theory of Networks for Approximation and Learning 753 
v(x) 
6.50 1 
6.OO 
3,.50 
3.00 -- 
2.50 
1.50 
1.OO 
0.50, .. 
X 
-4.OO -2.00 0.00 ZOO 4.00 
Figure 1: The effective potenl;ial V(x) lb,' e = 0.1, 1 '-- 3.0 and three different. 
values of/32 0.1, 0.03, 0.001 
The MAP estimate is therefore obtained minimizing the fu,ctional (7). The first 
term enforces closeness to the data., while the second term enforces smoothness of 
the solution, the trade off between these two opposite tendencies being controlled 
by the parameter ,X. Looking a.t fig. (1) we notice that, in the limit of/Y2 -* 
0, the effective potential V is quadratic if the absolute value of its argument is 
smaller than a threshold, and constant otherwise (fig. 1). Therefore, data points 
are taken in account when the interpolatio, error is smaller than a threshold, and 
their contribution neglected otherwise. 
If/3 = /32 = /0, that is if the distribution of the variables/3i is a delta function 
centered on some value/, the effective potential V(x) = /x 2 is oblained. There- 
fore, this method becomes equivalent to the so called "regularization technique" 
(Tikhonov and Arsenin, 1977) tha.t has been extensively used to solve ill-posed 
problems, of which the one we have jnst. outlined is a particular example (Poggio 
and Girosi, 1990a, 1990b). Suitable choices of distribution/9[/3] result in other ef- 
fective potentials (for exmnple the potential V(x) = x/cx2 + x 2 can be obtained), 
and the corresponding estimators turn out to be similar to the well known robust 
smoothing splines (Eubank, 1988). 
The functional (7), with the choice expressed by eq. (2), admits a simple physical in- 
terpretation. Let us consider for simplicity a function defined on a one-dimensional 
lattice. The value of the function f(xi) at site i is regarded as the position of a 
particle that can move only in the vertical direction. The particle is attracted - 
according to a spring-like potential V - towards the data point and the neighboring 
754 Girosi, Poggio, and Caprile 
particles as well. The natural trend of tile systeln will be to minimize its total 
energy which, in this scheme, is expressed by the functional (7): the first term is 
associated to the springs connecting the particle to the data point, and the secolid 
one, being associated to the the springs connecting neighboring particles, enforces 
the smoothness of the final configuration. Notice that the potential energy of the 
springs connecting the particle to the data point is not quadratic, as for the "stan- 
dard" springs, resulting this in a. non-linear relationship between the force and the 
elongation. The potential energy becomes constant when the elongation is larger 
than a fixed threshold, and the force (which is proportional to the first derivative of 
the potential energy) goes to zero. In this sense we can say that the springs "break" 
when we try to stretch them too much (Geiger and Girosi, 1990). 
3 Negative examples 
In many situations, further informa.tion about a function may consist in knowing 
that its value at some given point has to be far from a given value (which, in this 
context, can be considered as a "negative example"). We shall account for the 
presence of negative examples by adding to the fillCtional (7) a quadratic repulsive 
term for each negative example (for a related trick, see Kass et al., 1987). How- 
ever, the introduction of such a "repulsive spring" may make the filnctional (7) 
unbounded from below, because tile repulsive terlns tend to push the value of the 
function up to infinity. The silnplest way to prevent this occurency is either to 
allow the spring constant to decrease with the increasilg elongation, or, in the ex- 
treme case, to break at some point. Hence, we can use the same model of nonlinear 
spring of the previous section, and just reverse the sign of the associated poten- 
tial. If {(t,y) e R" x /}//x_ 1 is the set of negative examples, and if we define 
A = ya - f(ta) the functional (7) becolnes: 
N K 
H[I]-- V(/X,)- XllPfll 2 
i=1 c=l 
4 Solution of the variational problem 
An exhaustive discussion of the solution of the variational probleln associated to 
the functional (7) cannot be given here. We reibl' the reader to the papers of Poggio 
and Girosi (1990a, 1990b) and Girosi, Poggio and Caprile (1990), and just sketch 
the form of the solution. In both cases of unreliable and negative data., it can be 
shown that the solution of the variatiolal problem always has the form 
N k 
i=1 i=1 
(9) 
where G is the Green's function of the operator/3p (/3 denoting the adjoint oper- 
ator of P), and {i(X)}/k=l is a, basis of functions for the null space of P (usually 
polynomials oflow degree) and a.nd 
{(i}i=l are coefficients to be COlnputed. 
Extensions of a Theory of Networks for Approximation and Learning 755 
Substituting the expansion (9)in the functional (7), the function H*(e, c) = H[f*] 
is defined. The vectors c and c can then be found by minimizing the function 
We shall finally notice that the solution (9) has a silnple interpretation in Letins 
of feedforward networks with one layer of hidden units, of the same class of the 
regularization networks introduced in previous papers (Poggio and Girosi, 1990a, 
1990b). The only difference between these networks and the regularization networks 
previously introduced consists in the function that has to be minimized in order to 
find the weights of the network. 
5 Experimental Results 
In this section we report two examples of the application of these techniques to very 
simple one-dimensional problems. 
5.1 Unreliable data 
The data set consisted of seven examples, randomly taken, within the interval 
[-1,1], from the graph of f(x) = cos(x). In order to create an outlier in the 
data set, the value of the fourth point has been substituted with the value 1.5. The 
Green's function of the problem was a. Gaussian of variance r = 0.3, the partmeter 
e was set to 0.1, the value of the regularization parameter ,X was 10 -", and the 
ptrameters/31 and/32 were set respectively to 10.0 and 0.003. With this choice of 
the parameters the effective potential was approximately constant for values of its 
argument larger than 1. In figure (2a) we show the result that is obtained after 
only 10 iterations of gradient descent: the spring of the outlier breaks, and it does 
not influence the solution any more. The "hole" that the solution shows nearby the 
outlier is a combined effect of the fact that the variance of the Gaussian Green's 
function is small ( = 0.3), and of the lack of data next to the outlier itself. 
5.2 Negative examples 
Again data to be approximated ctme from a. random sampling of the function 
f(x) = cos(x), in the interval I-l, l]. The fourth data point was selected as the 
negative example, and the parameters were set in a way that its spring would break 
when the elongation exceeded the value 1. In figure (2b) we show a result obtained 
with 500 iterations of a stochastic gradient descent algorithm, with a Gaussian 
Green's function of variance rr = 0.4. 
Acknowledgements We thank Cesare Furlanello for useful discussions and for a crit- 
ical reading of the manuscript. 
756 Girosi, Poggio, and Caprile 
Figure 2: (a) Approximation ill presence of all outlier (tile data point whose value 
is 1.,5). (b) Approximation in presence of a negative exalnple. 
References 
[1] R.L. Eubank. Spline Smoothing and Nonparametric Rcgressiot, volulne 90 of 
Statistics: Teztbooks and Monographs. Marcel Dekker, Inc., New York, 1988. 
[2] D. Geiger and F. Girosi. Parallel and deterlninistic algorithms for MRFs: sur- 
face reconstruction and integration. In O. Faugeras, editor, Lecture Notes in 
Computer Science, Vol. 27: Computer Vision - ECCV 90. Springer-Verlag, 
Berlin, 1990. 
[3] F. Girosi, T. Poggio, and B. Caprile. Extensions of a theory of networks for 
approximation and learning: outliers and negative examples. A.I. Memo 1220, 
Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 1990. 
[4] M. Ks, A. Witkin, and D. Terzopoulos. Snakes: Active contour models. In 
Proceedings of the First International Confercncc on Computer Vision, LolldOll, 
1987. IEEE Computer Society Press, Washington, D.C. 
[5] J. L. Marroquin, S. Mitter, and T. Poggio. Probabilistic solution of ill-posed 
problems in computational vision. J. Amer. Sial. Asoc., 82:76-89, 1987. 
[6] T. Poggio and F. Girosi. A theory of networks for learning. Science, 247:978-982, 
1990a. 
[7] T. Poggio and F. Girosi. Networks for approxima.ion and learning. Proceedings 
or,he IEEE, 78(9), September 1990b. 
[8] M. J. D. Powell. Radial basis timer. ions for multivariable interpolation: a. re- 
view. In J. C. Mon and M. G. Cox, editors, Algorithms for Approimal'on. 
Clarendon Press, Oxford, 1987. 
[9] A. N. Tikhonov and V. Y. Arsenin. Solutions of Ill-posed Problems. W. II. 
Winston, Whingmn, D.C., 1977. 
