Model Selection for Support Vector Machines 
Olivier Chapelle*,t, Vladimir Vapnik* 
* AT&T Research Labs, Red Bank, NJ 
t LIP6, Paris, France 
{ chapelle, vlad} @research. att. com 
Abstract 
New functionals for parameter (model) selection of Support Vector Ma- 
chines are introduced based on the concepts of the span of support vec- 
tors and rescaling of the feature space. It is shown that using these func- 
tionals, one can both predict the best choice of parameters of the model 
and the relative quality of performance for any value of parameter. 
1 Introduction 
Support Vector Machines (SVMs) implement the following idea: they map input vectors 
into a high dimensional feature space, where a maximal margin hyperplane is constructed 
[6]. It was shown that when training data are separable, the error rate for SVMs can be 
characterized by 
h: R2/M 2, (1) 
where/i is the radius of the smallest sphere containing the training data and M is the mar- 
gin (the distance between the hyperplane and the closest training vector in feature space). 
This functional estimates the VC dimension of hyperplanes separating data with a given 
margin M. 
To perform the mapping and to calculate/i and M in the SVM technique, one uses a 
positive definite kernel K(x, x') which specifies an inner product in feature space. An 
example of such a kernel is the Radial Basis Function (RBF), 
K(x, x') - e -IIx-x'112/22 
This kernel has a free parameter a and more generally, most kernels require some param- 
eters to be set. When treating noisy data with SVMs, another parameter, penalizing the 
training errors, also needs to be set. The problem of choosing the values of these parame- 
ters which minimize the expectation of test error is called the model selection problem. 
It was shown that the parameter of the kernel that minimizes functional (1) provides a good 
choice for the model: the minimum for this functional coincides with the minimum of the 
test error [ 1 ]. However, the shapes of these curves can be different. 
In this article we introduce refined functionals that not only specify the best choice of 
parameters (both the parameter of the kernel and the parameter penalizing training error), 
but also produce curves which better reflect the actual error rate. 
Model Selection for Support Vector Machines 231 
The paper is organized as follows. Section 2 describes the basics of SVMs, section 3 
introduces a new functional based on the concept of the span of support vectors, section 4 
considers the idea of rescaling data in feature space and section 5 discusses experiments of 
model selection with these functionals. 
2 Support Vector Learning 
We introduce some standard notation for SVMs; for a complete description, see [6]. Let 
(xi, yi)<i< be a set of training examples, xi 6 11 r which belong to a class labeled by 
yi 6 {-1, 1 }. The decision function given by a SVM is  
, (2) 
0 
where the coefficients a i are obtained by maximizing the following functional  
W(a) = E ai-3 E aiajYiYjK(xi,xj) 
i:1 i,j=l 
(3) 
under constraints 
Eaiyi = 0 andO _< ai _< C i = 1,...,. 
i=1 
(7 is a constant which controls the tradeoff between the complexity of the decision function 
and the number of training examples misclassified. SVM are linear maximal margin clas- 
sifiers in a high-dimensional feature space where the data are mapped through a non-linear 
function ,I,(x) such that ,I,(xi). ,I,(xj) - K(xi,xj). 
The points xi with ai > 0 are called support vectors. We distinguish between those with 
0 < ai < (7 and those with ai = (7. We call them respectively support vectors of the first 
and second category. 
3 Prediction using the span of support vectors 
The results introduced in this section are based on the leave-one-out cross-validation esti- 
mate. This procedure is usually used to estimate the probability of test error of a learning 
algorithm. 
3.1 The leave-one-out procedure 
The leave-one-out procedure consists of removing from the training data one element, con- 
structing the decision rule on the basis of the remaining training data and then testing the 
removed element. In this fashion one tests all  elements of the training data (using g dif- 
ferent decision rules). Let us denote the number of errors in the leave-one-out procedure 
by E(Xl, y, ..., x, y). It is known [6] that the the leave-one-out procedure gives an al- 
most unbiased estimate of the probability of test error: the expectation of test error for the 
machine trained on  - 1 examples is equal to the expectation of (x, y, ..., x, y). 
We now provide an analysis of the number of errors made by the leave-one-out procedure. 
For this purpose, we introduce a new concept, called the span of support vectors [7]. 
232 O. Chapelle and V. N. Vapnik 
3.2 Span of support vectors 
Since the results presented in this section do not depend on the feature space, we will 
consider without any loss of generality, linear SVMs, i.e. K(xi, x/) = xi  xj. 
Suppose that c  = (a, ..., a) is the solution of the optimization problem (3). 
For any fixed support vector xp we define the set Ap as constrained linear combinations of 
the support vectors of the first category (xi)ip  
Ap = )  )iXi, )i = 1, 0 _< ot i q-yiypOtp) i _< C . (4) 
{i4p/ o<,, <c) 
i=1, ip 
Note that hi can be less than 0. 
We also define the quantity $p, which we call the span of the support vector xp as the 
minimum distance between xp and this set (see figure 1) 
$p2 = d2(xp,Ap) = min (x - x) 2. 
xA, 
(5) 
Figure 1' Three support vectors with al = a2 = a3/2. The set Ai is the semi-opened 
dashed line. 
It was shown in [7] that the set A v is not empty and that S v = d(xp, Ap) _< Dsv, where 
Dsv is the diameter of the smallest sphere containing the support vectors. 
Intuitively, the smaller $ -- d(xp, A) is, the less likely the leave-one-out procedure is to 
make an error on the vector xp. Formally, the following theorem holds  
Theorem 1 [7] If in the leave-one-out procedure a support vector xp corresponding to 
0 < a < C is recognized incorrectly, then the following inequality holds 
o> 1 
ap _ $p max(D, 1/x/-)' 
This theorem implies that in the separable case (C = ), the number of errors 
made by the leave-one-out procedure is bounded as follows  (x,y,...,xe,ye) <_ 
maxv$pD = maxv$vD/M 2, because av  = 1/M 2 [6]. This is already an 
Ep Op 
improvement compared to functional (1), since $p _< Dsv. But depending on the geome- 
try of the support vectors the value of the span $p can be much less than the diameter Dsv 
of the support vectors and can even be equal to zero. 
We can go further under the assumption that the set of support vectors does not change 
during the leave-one-out procedure, which leads us to the following theorem  
Model Selection for Support Vector Machines 233 
Theorem 2 If the sets of support vectors of first and second categories remain the same 
during the leave-one-out procedure, then for any support vector Xp, the following equality 
holds: 
yp[f(xp) fp(xp)] o 2 
where fo and fP are the decision function (2) given by the SVM trained respectively on the 
whole training set and after the point xp has been removed. 
The proof of the theorem follows the one of Theorem 1 in [7]. 
The assumption that the set of support vectors does not change during the leave-one-out 
procedure is obviously not satisfied in most cases. Nevertheless, the proportion of points 
which violate this assumption is usually small compared to the number of support vec- 
tors. In this case, Theorem 2 provides a good approximation of the result of the leave-one 
procedure, as pointed out by the experiments (see Section 5.1, figure 2). 
As already noticed in [ 1], the larger ap is, the more "important" in the decision function the 
support vector xp is. Thus, it is not surprising that removing a point xp causes a change in 
the decision function proportional to its Lagrange multiplier ap. The same kind of result as 
Theorem 2 has also been derived in [2], where for SVMs without threshold, the following 
inequality has been derived' yp(f(xp) - fP(xp)) _< avK(xp,xp). The span $p takes 
into account the geometry of the support vectors in order to get a precise notion of how 
"important" is a given point. 
The previous theorem enables us to compute the number of errors made by the leave-one- 
out procedure: 
Corollary 1 Under the assumption of Theorem 2, the test error prediction given by the 
leave-one-out procedure is 
l(x Yl, x,ye) lCard{p/ o 2 
= ap$; > ypf(xp)} (6) 
t ----   ... __ 
Note that points which are not support vectors are correctly classified by the leave-one-out 
procedure. Therefore tt defines the number of errors of the leave-one-out procedure on the 
entire training set. 
Under the assumption in Theorem 2, the box constraints in the definition of Ap (4) can 
be removed. Moreover, if we consider only hyperplanes passing through the origin, the 
constraint  hi = 1 can also be removed. Therefore, under those assumptions, the com- 
putation of the span $p is an unconstrained minimization of a quadratic form and can be 
done analytically. For support vectors of the first category, this leads to the closed form 
2 --1 
S -- 1/(Ksv)pp, where Ksv is the matrix of dot products between support vectors of 
the first category. A similar result has also been obtained in [3]. 
In Section 5, we use the span-rule (6) for model selection in both separable and non- 
separable cases. 
4 Rescaling 
As we already mentioned, functional (1) bounds the VC dimension of a linear margin clas- 
sifier. This bound is tight when the data almost "fills" the surface of the sphere enclosing 
the training data, but when the data lie on a flat ellipsoid, this bound is poor since the radius 
of the sphere takes into account only the components with the largest deviations. The idea 
we present here is to make a rescaling of our data in feature space such that the radius of the 
sphere stays constant but the margin increases, and then apply this bound to our rescaled 
data and hyperplane. 
234 O. Chapelle and V. N. Vapnik 
Let us first consider linear SVMs, i.e. without any mapping in a high dimensional space. 
The rescaling can be achieved by computing the covariance matrix of our data and rescaling 
according to its eigenvalues. Suppose our data are centered and let (qol,..., qo,) be the 
normalized eigenvectors of the covariance matrix of our data. We can then compute the 
smallest enclosing box containing our data, centered at the origin and whose edges are 
parallels to (qol,..., qo,). This box is an approximation of the smallest enclosing ellipsoid. 
The length of the edge in the direction qo; is/; = max/Ixi  qo; I. The rescaling consists 
of the following diagonal transformation: 
D: x > Dx = 
k 
Let us consider i = D-1xi and  = Dw. The decision function is not changed under 
this transformation since   i - w  xi and the data i fill a box of side length 1. Thus, 
in functional (1), we replace/{2 by 1 and 1/M 2 by 2. Since we rescaled our data in a 
box, we actually estimated the radius of the enclosing ball using the oo-norm instead of 
the classical 2-norm. Further theoretical works needs to be done to justify this change of 
norm. 
In the non-linear case, note that even if we map our data in a high dimensional feature space, 
they lie in the linear subspace spanned by these data. Thus, if the number of training data  
is not too large, we can work in this subspace of dimension at most . For this purpose, one 
can use the tools of kernel PCA [5]: if A is the matrix of normalized eigenvectors of the 
Gram matrix Kij = K(xi, xj) and (,i) the eigenvalues, the dot product xi. qo is replaced 
by x/A and w. qo becomes x/i AikYiOti' ThUS, we can still achieve the diagonal 
transformation A and finally functional (1) becomes 
E A miax Ak (E AikYa)2' 
k i 
5 Experiments 
To check these new methods, we performed two series of experiments. One concerns the 
choice of a, the width of the RBF kernel, on a linearly separable database, the postal 
database. This dataset consists of 7291 handwritten digit of size 16x16 with a test set 
of 2007 examples. Following [4], we split the training set in 23 subsets of 317 training 
examples. Our task consists of separating digit 0 to 4 from 5 to 9. Error bars in figures 2a 
and 3 are standard deviations over the 23 trials. In another experiment, we try to choose 
the optimal value of C in a noisy database, the breast-cancer database 1. The dataset has 
been split randomly 100 times into a training set containing 200 examples and a test set 
containing 77 examples. 
Section 5.1 describes experiments of model selection using the span-rule (6), both in the 
separable case and in the non-separable one, while Section 5.2 shows VC bounds for model 
selection in the separable case both with and without rescaling. 
5.1 Model selection using the span-rule 
In this section, we use the prediction of test error derived from the span-rule (6) for model 
selection. Figure 2a shows the test error and the prediction given by the span for differ- 
ent values of the width cr of the RBF kernel on the postal database. Figure 2b plots the 
same functions for different values of C on the breast-cancer database. We can see that 
the method predicts the correct value of the minimum. Moreover, the prediction is very 
accurate and the curves are almost identical. 
Available from http: //horn. first. Imd. de/,-raetsch/data/breast-cancer 
Model Selection for Support Vector Machines 235 
40 
35 
30 
25 
2c 
I.u 
15 
1(2 
- - -' Test err)r 
T -- Span pred ct on 
',, 
Log sigma 
(a) choice of cr in the postal database 
36 
34 
32 
30 
28 
I.u 
26 
24 
2O 
TTT. 
I  ' 
Log C 
'--- Tst erro) j 
-- Span pred ction  
+ 
10 12 
(b) choice of C' in the breast-cancer database 
Figure 2: Test error and its prediction using the span-rule (6). 
The computation of the span-rule (6) involves computing the span $ (5) for every support 
vector. Note, however, that we are interested in the inequality $v 2 < /vf(xv)/a, rather 
than the exact value of the span $v. Thus, while minimizing Sv = d(xv, Av), if we find a 
point x*  Av such that d(xv,x*) 2 <_ /vf(xv)/ct, we can stop the minimization because 
this point will be correctly classified by the leave-one-out procedure. 
It turned out in the experiments that the time required to compute the span was not pro- 
hibitive, since it is was about the same than the training time. 
There is a noteworthy extension in the application of the span concept. If we denote by 
0 one hyperparameter of the kernel and if the derivative OK(x,,x) is computable, then it 
o0 
is possible to compute analytically o E a$-I(x) which is the derivative of an upper 
00 ' 
bound of the number of errors made by the leave-one-out procedure (see Theorem 2). This 
provides us a more powerful technique in model selection. Indeed, our initial approach 
was to choose the value of the width cr of the RBF kernel according to the minimum of 
the span-rule. In our case, there was only hyperparamter so it was possible to try different 
values of a. But, if we have several hyperparameters, for example one cr per component, 
- E (" 
K(x, x') = e ' , it is not possible to do an exhaustive search on all the possible 
values of of the hyperparameters. Nevertheless, the previous remark enables us to find their 
optimal value by a classical gradient descent approach. 
Preliminary results seem to show that using this approach with the previously mentioned 
kernel improve the test error significantely. 
5.2 VC dimension with rescaling 
In this section, we perform model selection on the postal database using functional (1) and 
its rescaled version. Figure 3a shows the values of the classical bound 2/M2 for different 
values of a. This bound predicts the correct value for the minimum, but does not reflect the 
actual test error. This is easily understandable since for large values of or, the data in input 
space tend to be mapped in a very flat ellipsoid in feature space, a fact which is not taken 
into account [4]. Figure 3b shows that by performing a rescaling of our data, we manage 
to have a much tighter bound and this curve reflects the actual test error, given in figure 2a. 
236 O. Chapelle and V. N. Vapnik 
18000 
16000 
14000 
12000 
10000 
.ooo I 
6000[ 
4000[ 
[ ' VC Dirensi_[ 
I 
- -2 0 2 4 
Log sigma 
(a) without rescaling 
120 
100 
80 
._ 
7o 60 
40 
20 
 '1 V( Dimension with rscalincj 
01 
Log sigma 
(b) with rescaling 
Figure 3: Bound on the VC dimension for different values of or on the postal database. The 
shape of the curve with rescaling is very similar to the test error on figure 2. 
6 Conclusion 
In this paper, we introduced two new techniques of model selection for SVMs. One is based 
on the span, the other is based on rescaling of the data in feature space. We demonstrated 
that using these techniques, one can both predict optimal values for the parameters of the 
model and evaluate relative performances for different values of the parameters. These 
functionals can also lead to new learning techniques as they establish that generalization 
ability is not only due to margin. 
Acknowledgments 
The authors would like to thank Jason Weston and Patrick Haffner for helpfull discussions 
and comments. 
References 
[l] C.J.C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and 
Knowledge Discovery, 2(2): 121-167, 1998. 
[2] T S. Jaakkola and D. Haussler. Probabilistic kernel regression models. In Proceedings of the 
1999 Conference on AI and Statistics, 1999. 
[3] M. Opper and O. Winther. Gaussian process classification and SVM: Mean field results and 
leave-one-out estimator. In Advances in Large Margin Classifiers. MIT Press, 1999. to appear. 
[4] B. SchOlkopf, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Kernel-dependent Support 
Vector error bounds. In Ninth International Conference on Artificial Neural Networks, pp. 304 - 
309 
[5] B. SchOlkopf, A. Smola, and K.-R. MQller. Kernel principal component analysis. In Artifi- 
cial Neural Networks -- ICANN'97, pages 583 - 588, Berlin, 1997. Springer Lecture Notes in 
Computer Science, Vol. 1327. 
[6] V. Vapnik. Statistical Learning Theory. Wiley, New York, 1998. 
[7] V. Vapnik and O. Chapelle. Bounds on error expectation for SVM. Neural Computation, 1999. 
Submitted. 
