From Regularization Operators 
to Support Vector Kernels 
Alexander J. Smola 
GMD FIRST 
Rudower Chaussee 5 
12489 Berlin, Germany 
smola@first.gmd.de 
Bernhard Schiilkopf 
Max-Planck-Institut far biologische Kybernetik 
Spemannstrage 38 
72076 Ttibingen, Germany 
bs@mpik~tueb.mpg.de 
Abstract 
We derive the correspondence between regularization operators used in 
Regularization Networks and Hilbert Schmidt Kernels appearing in Sup- 
port Vector Machines. More specifically, we prove that the Green's Func- 
tions associated with regularization operators are suitable Support Vector 
Kernels with equivalent regularization properties. As a by-product we 
show that a large number of Radial Basis Functions namely condition- 
ally positive definite functions may be used as Support Vector kernels. 
1 INTRODUCTION 
Support Vector (SV) Machines for pattern recognition, regression estimation and operator 
inversion exploit the idea of transforming into a high dimensional feature space where 
they perform a linear algorithm. Instead of evaluating this map explicitly, one uses Hilbert 
Schmidt Kernels k(x, y) which correspond to dot products of the mapped data in high 
dimensional space, i.e. 
k(x,y) = 
with I,  ' --> ' denoting the map into feature space. Mostly, this map and many of 
its properties are unknown. Even worse, so far no general rule was available which kernel 
should be used, or why mapping into a very high dimensional space often provides good 
results, seemingly defying the curse of dimensionality. We will show that each kernel 
k(x, y) corresponds to a regularization operator/5, the link being that k is the Green's 
function of/5,/5 (with/5* denoting the adjoint operator of/5). For the sake of simplicity 
we shall only discuss the case of regression -- our considerations, however, also hold true 
foi the other cases mentioned above. 
We start by briefly reviewing the concept of SV Machines (section 2) and of Regularization 
Networks (section 3). Section 4 contains the main result stating the equivalence of both 
344 A. J. Smola and B. Sch61kopf 
methods. In section 5, we show some applications of this finding to known SV machines. 
Section 6 introduces a new class of possible SV kernels, and, finally, section 7 concludes 
the paper with a discussion. 
2 SUPPORT VECTOR MACHINES 
The SV algorithm for regression estimation, as described in [Vapnik, 1995] and [Vapnik 
et al., i 997], exploits the idea of computing a linear function in high dimensional feature 
space )r (furnished with a dot product) and thereby computing a nonlinear function in the 
space of the input data 1 '. The functions take the form f(x) = (w. I,(x)) + b with 
I,: l ' --> Y'and w  Y'. 
In order to infer f from a training set ( (xi, Yi) I i = 1,..., g, xi C 1 ', yi C 1}, one tries 
to minimize the empirical risk functional -Re,,v[f] together with a complexity term IIll 2, 
thereby enforcing flatness in feature space, i.e. to minimize 
Area[f]- emp[f] + ll11 
:  E (f(xi),ti) - llll 
i=1 
(2) 
with c(f(xi),Yi) being the cost function determining how deviations of f(xi) from the 
target values yi should be penalized, and A being a regularization constant. As shown in 
[Vapnik, 1995] for the case of e-insensitive cost functions, 
c(f(x),y)= { f(x)-y,-e for ,f(x)-yl_>e 
otherwise ' (3) 
(2) can be minimized by solving a quadratic programming problem formulated in terms 
of dot products in f. It turns out that the solution can be expressed in terms of Support 
Vectors, w = -i=1 i(I(xi) , and therefore 
i=1 i----1 
(4) 
where k(xi, x) is a kernel function computing a dot product in feature space (a concept 
introduced by Aizerman et al. [1964]). The coefficients cti can be found by solving a 
quadratic programming problem (with tij '= k(Xi, Xj) and cti =/3i -/3?): 
minimize 
subject to 
(5) 
Note that (3) is not the only possible choice of cost functions resulting in a quadratic pro- 
gramming problem (in fact quadratic parts and infinities are admissible, too). For a detailed 
discussion see [Smola and Sch61kopf, 1998]. Also note that any continuous symmetric 
function k(x, y)  L2  L2 may be used as an admissible Hilbert-Schmidt kernel if it 
satisfies Mercer's condition 
/ / k(x, y)g(x)g(y)dxdy _> 0 for all g  L2(] n). (6) 
3 REGULARIZATION NETWORKS 
Here again we start with minimizing the empirical risk functional Remp[f] plus a regu- 
larization term 1115fll 2 defined by a regularization operator t 5 in the sense of Arsenin and 
From Regularization Operators to Support Vector Kernels 345 
Tikhonov [1977]. Similar to (2), we minimize 
1 
-re9[f] ---- e,,,p q- 1115fll 2 --   c(f(xi),yi) + 11Pfll 2. (7) 
i=1 
Using an expansion of f in terms of some symmetric function k(xi, xj) (note here, that k 
need not fulfil Mercer's condition), 
f(x) = y ik(x,, x) + b, (8) 
i 
and the cost function defined in (3), this leads to a quadratic programming problem similar 
to the one for SVs: by computing Wolfe's dual (for details of the calculations see [Smola 
and Sch61kopf, 1998]), and using 
Dij :- ((Pk)(xi, .). (Pk)(xj,.)) (9) 
((f  g) denotes the dot product of the functions f and g in Hilbert Space, i.e. 
f f(x)g(x)dx), we get G = D - K(/ -/*), with/3i,/37 being the solution of 
minimize 
subject to 
i , , 
 E (/i -- /3i)(/3; -- flj)(KD-1I()ij - E(/3i -/3i)Yi- (/3 +/3i)e 
i,j=l i-1 
i----1 
(o) 
Unfortunately this setting of the problem does not preserve sparsity in terms of the coef- 
ficients, as a potentially sparse decomposition in terms of/3i and/3 is spoiled by D-Xlf, 
which in general is not diagonal (the expansion (4) on the other hand does typically have 
many vanishing coefficients). 
4 THE EQUIVALENCE OF BOTH METHODS 
Comparing (5) with (10) leads to the question if and under which condition the two methods 
might be equivalent and therefore also under which conditions regularization networks 
might lead to sparse decompositions (i.e. only a few of the expansion coefficients in f 
would differ from zero). A sufficient condition is D =/( (thus KD-XK = K), i.e. 
k(xi, xj): ((P)(x, .). (Pk)(xj,.)) (1 ) 
Our goal now is twofold: 
 Given a regularization operator/5, find a kernel k such that a SV machine using k 
will not only enforce flatness in feature space, but also correspond to minimizing 
a regularized risk functional with/5 as regularization operator. 
 Given a Hilbert Schmidt kernel k, find a regularization operator 15 such that a SV 
machine using this kernel can be viewed as a Regularization Network using/5. 
These two problems can be solved by employing the concept of Green's functions as de- 
scribed in [Girosi et al., 1993]. These functions had been introduced in the context of 
solving differential equations. For our purpose, it is sufficient to know that the Green's 
functions Gx, (x) of/5*/5 satisfy 
(/5*PG,,)(x): 6x,(X). (12) 
Here, 6x (x) is the 6-distribution (not to be confused with the Kronecker symbol 6i) which 
has the property that (f  6x) -- f(xi). Moreover we require for all xi the projection of 
Gx (x) onto the null space of/5,/5 to be zero. The relationship between kernels and 
regularization operators is formalized in the following proposition. 
346 A. J. Smola and B. SchOlkopf 
'Proposition 1 
Let 16 be a regularization operator, and G be the Green's function of 16'16. Then G is a 
Hilbert Schmidt-Kernel such that D = K. SV machines using G minimize risk functional 
(7) with 13 as regularization operator. 
Proof: Substituting (12)into Gx (xi) = (Gx (.)' 8x (.)) yields 
axi(X,): ((16ax,)(.). Ox,(Xj), (13) 
hence G(xi, xj) := Gx,(XS) is symmetric and satisfies (11). Thus the SV optimization 
problem (5) is equivalent to the regularization network counterpart (10). Furthermore G is 
an admissible positive kernel, as it can be written as a dot product in Hilbert Space, namely 
G(xi,xj) = ((I)(xi). (I)(xj)) with (I) 'xi, > (16Gxi)(.). (14) 
In the following we will exploit this relationship in both ways: to compute Green's func- 
tions for a given regularization operator/3 and to infer the regularization operator from a 
given kernel k. 
5 TRANSLATION INVARIANT KERNELS 
Let us now more specifically consider regularization operators 16 that may be written as 
multiplications in Fourier space [Girosi et al., 1993] 
with ./(co) denoting the Fourier transform of f(x), and P(co) =/>(-co) real valued, non- 
negative and converging uniformly to 0 for Icol  o and f -- supp[P(co)]. Small values 
of/>(co) correspond to a strong attenuation of the corresponding frequencies. 
For regularization operators defined in Fourier Space by (15) it can be shown by exploiting 
P(co) = P(-co) = P(co) that 
G(xi,x) = I jf ei(x'-x)P(co)&v (16) 
(27r)n/2 
is a corresponding Green's function satisfying translational invariance, i.e. G(xi, xj) -- 
G(xi - x5), and (2(co) = P(co). For the proof, one only has to show that G satisfies (11). 
This provides us with an efficient tool for analyzing SV kernels and the types of capacity 
control they exhibit. 
Example 1 (Bq-splines) 
Vapnik et al. [1997]propose to use Bq-splines as building blocks for kernels, i.e. 
k(X) : H Bq(Xi) (17) 
i=1 
with x C I 7. For the sake of simplicity, we consider the case n -- 1. Recalling the 
definition 
Bq = q+l 1[-o.,o.] 
(18) 
( denotes the convolution and l x the indicator function on X), we can utilize the above 
result and the Fourier-Plancherel identity to construct the Fourier representation of the 
corresponding regularization operator. Up to a multiplicative constant, it equals 
P(co) = c(co) = sinc(q+)(). (19) 
From Regularization Operators to Support Vector Kernels 347 
This shows that only B-splines of odd order are admissible, as the even ones have negative' 
parts in the Fourier spectrum (which would result in an amplification of the corresponding 
frequency components). The zeros in [c stem from the fact that Bt has only compact support 
[-(k + 1)/2, (k + 1)/2]. By using this kernel we trade reduced computational complexity in 
calculating f (we only have to take points with Ilxi - Kill _< c from some limited neighbor- 
hood determined by c into. account)for a possibly worse performance of the regularization 
operator as it completely removes frequencies w v with [c(wv) = O. 
Example 2 (Dirichlet kernels) 
In [Vapnik et al., 1997], a class of kernels generating Fourier expansions was introduced, 
k(x) = sin(2N + 1)x/2 (20) 
sin x/2 
(As in example 1 we consider x C I 1 to avoid tedious notation.) By construction, this 
1 N 5(w -- i). A regularization operator with these 
kernel corresponds to P():  Yi=-N 
properties, however, may not be desirable as it only damps a fthire number of frequencies 
and leaves all other frequencies unchanged which can lead to overfitting (Fig. 1). 
-lO -5 o 5 '-o 15 -15 15 
Figure 1' Left: Interpolation with a Dirichlet Kernel of order N -- 10. One can clearly ob- 
serve the overfitting (dashed line: interpolation, solid line: original data points, connected 
by lines). Right: Interpolation of the same data with a Gaussian Kernel of width o -2 -- 1. 
Example 3 (Gaussian kernels) 
Following the exposition of Yuille and Grzywacz [1988] as described in [Girosi et al., 
1993], one can see that for 
o.2m 
with i02m = A m and 102m+1 = 7 A m, A being the Laplucian and 7 the Gradient opera- 
tor, we get Gaussians kernels 
k(x): exp ('{x"2) (22) 
2o.2  
Moreover, we can provide an equivalent representation of 15 in terms of its Fourier prop- 
erties, i.e. 
2 ) up to a multiplicative constant. Training a SV machine 
with Gaussian RBF kernels [Sch6lkopf et al., 1997] corresponds to minimizing the specific 
cost function with a regularization operatorof type (21). This also explains the good perfor- 
mance of SV machines in this case, as it is by no means obvious that choosing aftat function 
in high dimensional space will correspond to a simple function in low dimensional space, 
as showed in example 2. Gaussian kernels tend to yield good performance under general 
smoothness assumptions and should be considered especially if no additional knowledge 
of the data is available. 
348 A. J. Smola and B. Scholkopf 
6 A NEW CLASS OF SUPPORT VECTOR KERNELS 
We will follow the lines of Madych and Nelson [ 1990] as pointed out by Girosi et ai. [ 1993]. 
Our main statement is that conditionally positive definite functions (c.p.d.) generate admis- 
sible SV kernels. This is very useful as the property of being c.p.d. often is easier to verify 
than Mercer's condition, especially when combined with the results of Schoenberg and 
Micchelli on the connection between c.p.d. and completely monotonic functions [Schoen- 
berg, 1938, Micchelli, 1986]. Moreover c.p.d. functions lead to a class of SV kernels that 
do not necessarily satisfy Mercer's condition. 
Definition 1 (Conditionally positive definite functions) 
A continuous function h, defined on [0, cx>), is said to be conditionally positive definite 
(c.p.d.) of order rrt on I ' if for any' distinct points x,..., xe C 1 n and scalars ci , . . . , ct 
the quadratic form  ' 
Yi,j=i cicjh(llxi - xjll) is nonnegative provided that Yi: cip(xi) - 
O for all polynomials p on I  of degree lower than rrt. 
Proposition 2 (c.p.d. functions and admissible kernels) 
Define II. the space of polynomials of degree lower than rn on I '. Every c.p.d. function 
h of order rn generates an admissible Kernel for SV expansions on the space of functions 
f orthogonal to II, by setting k(xi, x.) :: h(llx i - gill2). 
Proof: In [Dyn, 1991] and [Madych and Nelson, 1990] it was shown that c.p.d. functions 
h getterate semi-norms II.11h by 
Ilf}l '- / dxidxh(llxi - xll)f(xi)f(x). (23) 
Provided that the projection of f onto the space of polynomials of degree lower than rrt is 
zero. For these functions, this, however, also defines a dot product in some feature space. 
Hence they can be used as SV kernels. 
Only c.p.d. functions of order m up to 2 are of practical interest for SV methods (for 
details see [Smola and Sch61kopf, 1998]). Consequently, we may use kernels like the ones 
proposed in [Girosi et al., 1993] as SV kernels: 
k(x, y) --  -llx-yll2 Gaussian, (rrt - 0); (24) 
(x, y) -- _v/llx _ yll2 + 2 multiquadric, (rrt - 1) (25) 
 inverse multiquadric, (rrt = 0) (26) 
k(x, y) = v/llx_yl12+c2 
k(x, y) - Ilx - yll 2 in Ilx - yll thin plate splines, (rrt -- 2) (27) 
7 DISCUSSION 
We have pointed out a connection between SV kernels and regularization operators. As one 
of the possible implications of this result, we hope that it will deepen our understanding of 
SV machines and of why they have been found to exhibit high generalization ability. In 
Sec. 5, we have given examples where only the translation into the regularization frame- 
work provided insight in why certain kernels are preferable to others. Capacity control is 
one of the strengths of SV machines; however, this does not mean that the structure of the 
learning machine, i.e. the choice of a suitable kernel for a given task, should be disregarded. 
On the contrary, the rather general class of admissible SV kernels should be seen as another 
strength, provided that we have a means of choosing the right kernel. The newly established 
link to regularization theory can thus be seen as a tool for constructing the structure con- 
sisting of sets of functions in which the SV machine (approximately) performs structural 
From Regularization Operators to Support Vector Kernels 349 
risk minimization (e.g. [Vapnik, 1995]). For a treatment of SV kernels in a Reproducing 
Kernel Hiibert Space context see [Girosi, 1997]. 
Finally one should leverage the theoretical results achieved for regularization operators 
for a better understanding of SVs (and vice versa). By doing so this theory might serve 
as a bridge for connecting two (so far) separate threads of machine learning. A trivial 
example for such a connection would be a Bayesian interpretation of SV machines. In 
this case the choice of a special kernel can be regarded as a prior on the hypothesis space 
with P[f] cr exp(-,Xllfl[2). m more subtle reasoning probably will be necessary for 
understanding the capacity bounds [Vapnik, 1995] from a Regularization Network point 
of view. Future work will include an analysis of the family of polynomial kernels, which 
perform very well in Pattern Classification [Sch61kopf et al., 1995]. 
Acknowledgements 
AS is supported by a grant of the DFG (# Ja 379/51). BS is supported by the Studienstiftung 
des deutschen Volkes. The authors thank Chris Burges, Federico Girosi, Leo van Hemmen, 
Klaus-Robert Mtiller and Vladimir Vapnik for helpful discussions and comments. 
References 
M. A. Aizerman, E. M. Braverman, and L. I. Rozono6r. Theoretical foundations of the po- 
tential function method in pattern recognition learning. Automation and Remote Control, 
25:821-837, 1964. 
N. Dyn. Interpolation and approximation by radial and related functions. In C.K. Chui, 
L.L. Schumaker, and D.J. Ward, editors, Approximation Theory, VI, pages 211-234. 
Academic Press, New York, 1991. 
F. Girosi. An equivalence between sparse approximation and support vector machines. A.I. 
Memo No. 1606, MIT, i 997. 
F. Girosi, M. Jones, and T. Poggio. Priors, stabilizers and basis functions: From regular- 
ization to radial, tensor and additive splines. A.I. Memo No. 1430, MIT, 1993. 
W.R. Madych and S.A. Nelson. Multivariate interpolation and conditionally positive defi- 
nite functions. II. Mathematics of Computation, 54( 189):211-230, 1990. 
C. A. Micchelli. Interpolation of scattered data: distance matrices and conditionally posi- 
tive definite functions. Constructive Approximation, 2:11-22, 1986. 
I.J. Schoenberg. Metric spaces and completely monotone functions. Ann. of Math., 39: 
811-841, 1938. 
B. Sch61kopf, C. Burges, and V. Vapnik. Extracting support data for a given task. In U. M. 
Fayyad and R. Uthurusamy, editors, Proc. KDD 1, Menlo Park, 1995. AAAI Press. 
B. Sch61kopf, K. Sung, C. Burges, F. Girosi, P. Niyogi, T. Poggio, and V. Vapnik. Con- 
paring support vector machines with gaussian kernels to radial basis function classifiers. 
IEEE Trans. Sign. Processing, 45:2758 -2765, 1997. 
A. J. Smola and B. Sch61kopf. On a kernel-based method for pattern recognition, re- 
gression, approximation and operator inversion. Algorithmica, 1998. see also GMD 
Technical Report 1997-1064, URL: http://svm.first.gmd.de/papers.html. 
V. Vapnik. The Nature of Statistical Learning Theory. Springer Verlag, New York, 1995. 
V. Vapnik, S. Golowich, and A. Smola. Support vector method for function approximation, 
regression estimation, and signal processing. In NIPS 9, San Mateo, CA, 1997. 
A. Yuille and N. Grzywacz. The motion coherence theory. In Proceedings of the Interna- 
tional Conference on Computer Vision, pages 344-354, Washington, D.C., 1988. IEEE 
Coinputer Society Press. 
