A Geometric Interpretation of v-SVM 
Classifiers 
David J. Crisp 
Centre for Sensor Signal and 
Information Processing, 
Deptartment of Electrical Engineering, 
University of Adelaide, South Australia 
dcrisp @eleceng. adelaide. edu. au 
Christopher J.C. Burges 
Advanced Technologies, 
Bell Laboratories, 
Lucent Technologies 
Holmdel, New Jersey 
burges@lucent. com 
Abstract 
We show that the recently proposed variant of the Support Vector 
machine (SVM) algorithm, known as y-SVM, can be interpreted 
as a maximal separation between subsets of the convex hulls of the 
data, which we call soft convex hulls. The soft convex hulls are 
controlled by choice of the parameter y. If the intersection of the 
convex hulls is empty, the hyperplane is positioned halfway between 
them such that the distance between convex hulls, measured along 
the normal, is maximized; and if it is not, the hyperplane's normal 
is similarly determined by the soft convex hulls, but its position 
(perpendicular distance from the origin) is adjusted to minimize 
the error sum. The proposed geometric interpretation of y-SVM 
also leads to necessary and sufficient conditions for the existence of 
a choice of y for which the y-SVM solution is nontrivial. 
I Introduction 
Recently, Sch51kopf et al. [1] introduced a new class of SVM algorithms, called 
y-SVM, for both regression estimation and pattern recognition. The basic idea is to 
remove the user-chosen error penalty factor C that appears in SVM algorithms by 
introducing a new variable p which, in the pattern recognition case, adds another 
degree of freedom to the margin. For a given normal to the separating hyperplane, 
the size of the margin increases linearly with p. It turns out that by adding p to 
the primal objective function with coefficient -y, y _ 0, the variable C can be 
absorbed, and the behaviour of the resulting SVM - the number of margin errors 
and number of support vectors - can to some extent be controlled by setting y. 
Moreover, the decision function produced by y-SVM can also be produced by the 
original SVM algorithm with a suitable choice of C. 
In this paper we show that y-SVM, for the pattern recognition case, has a clear 
geometric interpretation, which also leads to necessary and sufficient conditions for 
the existence of a nontrivial solution to the y-SVM problem. All our considerations 
apply to feature space, after the mapping of the data induced by some kernel. We 
adopt the usual notation: w is the normal to the separating hyperplane, the mapped 
Geometric Interpretation of u-SVM Classifiers 24.5 
data is denoted by xi 6 N, i = 1,.-., l, with corresponding labels yi  {+l}, b, p 
are scalars, and Ei, i = 1,-- -, l are positive scalar slack variables. 
2 v-SVM Classifiers 
The v-SVM formulation, as given in [1], is as follows: minimize 
F' 1 1 , 
- 11w'11  - p' + ?  E, 
with respect to w', b', p', E'i, subject to: 
(1) 
y(w'.x + b') _ p' - El, El _ 0, p' _ 0. (.) 
Here v is a user-chosen parameter between 0 and 1. The decision function (whose 
sign determines the label given to a test point x) is then: 
/'(x) = o'.  + b'. (s) 
1 
The Wolfe dual of this problem is: maximize F b = - ]ij aiajyiyxi  x5 subject 
to 
O_oti _ 1, _otiYi=O, Eoti_Y 
i i 
(4) 
with w' given by w' = '-i aiyixi. Sch51kopf et al. [1] show that v is an upper 
bound on the fraction of margin errors x, a lower bound on the fraction of support 
vectors, and that both of these quantities approach v asymptotically. 
Note that the point w' = b' = p = E = 0 is feasible, and that at this point, F' = 0. 
Thus any solution of interest must have F' _ 0. Furthermore, if vp' = 0, the 
optimal solution is at w' = b' = p = E = 02. Thus we can assume that vp' > 0 (and 
therefore u > 0) always. Given this, the constraint p' _ 0 is in fact redundant: a 
negative value of p' cannot appear in a solution (to the problem with this constraint 
removed) since the above (feasible) solution (with p' = 0) gives a lower value for 
F'. Thus below we replace the constraints (2) by 
y(w'.x + ') _ p' - El, El _ o. 
(5) 
2.1 A Reparameterization of v-SVM 
We reparameterize the primal problem by dividing the objective function F' by 
v2/2, the constraints (5) by , and by making the following substitutions: 
2 w' ' p' El 
,= , = -, o=-, p=-, E =-. (6) 
XA margin error xi is defined to be any point for which i > 0 (see [1]). 
Sin fact we can prove that, even if the optimal solution is not unique, the global 
solutions still all have w - 0: see Burges and Crisp, "Uniqueness of the SVM Solution" in 
this volume. 
246 D. J. Crisp and C. J. C. Burges 
This gives the equivalent formulation: minimize 
F = Ilwll 2 - 2p +   e, 
i 
with respect to w, b, p, i, subject to: 
(7) 
y(w.x + b) _> p- ,  ) 0. (8) 
If we use as decision function f(x) -- f(X)/l, the formulation is exactly equivalent, 
although both primal and dual appear different. The dual problem is now: minimize 
1 
FD =  Z aiajyiyjxi . xj (9) 
i,j 
with respect to the ai, subject to: 
Z ciYi = O' Z ai = 2' O _< ai <_ l a (10) 
i i 
with w given by w =  Y]-i iYiXi. In the following, we will refer to the reparam- 
eterized version of ,-SVM given above as p-SVM, although we emphasize that it 
describes the same problem. 
3 A Geometric Interpretation of .SVM 
In the separable case, it is clear that the optimal separating hyperplane is just that 
hyperplane which bisects the shortest vector joining the convex hulls of the positive 
and negative polarity points a. We now show that this geometric interpretation can 
be extended to the case of ,-SVM for both separable and nonseparable cases. 
3.1 The Separable Case 
We start by giving the analysis for the separable case. The convex hulls of the two 
classes are 
and 
H+ = {  aixi 
i: yl --'- I 
i:yi=--I 
Z ai=l, ai_>0} (11) 
i:yi=+l 
Z ai=l, ai>_0}. (12) 
i:yl =-- 1 
Finding the two closest points can be written as the following optimization problem: 
m2n E tixi-- Z li3i (13) 
i: Yl -----]- 1 i: Yl ---- -- 1 
3See, for example, K. Bennett, 1997, in http://www.rpi.edu/Dennek/svmtalk.ps (also, 
to appear). 
A Geometric Interpretation ofu-SVM Classifiers 247 
subject to: 
 ai=l, ' ai=l, a>_O (14) 
i:yi=+l i:yl------1 
Taking the decision boundary f(x) = w. x +  = 0 to be the perpendicular bisector 
of the line segment joining the two closest points means that at the solution, 
and  = -w.p, where 
1 
w-- ( Z oqxi- Z aixi) (15) 
i:yi----+l i:yi------1 
Thus w lies along the line segment (and is half its size) and p is the midpoint of the 
line segment. By rescaling the objective function and using the class labels yi -- +1 
we can rewrite this as4: 
subject to 
1 
mjn [[w[[ 2 =  ZaiaJYiYJXi .xj (17) 
ij 
Y]. aiyi = 0, E ai = 2, ai _> 0. 
i i 
1 
The associated decision function is f(x) = w. x +  where w =  Y']-i otiYiXi, 
I 1 
P --  Ei Otiggi and  = -w.p = - ']4j otiYiOtj ggi ' ggJ' 
(18) 
3.2 The Connection with v-SVM 
Consider now the two sets of points defined by: 
H+,={ Z aixi Z ai=l, 
i:yi=+l i:yi=+l 
0 <_ ai <_ } (19) 
and 
H-,={ Z aixi 
i:yi=--I 
Z ai = 1, 
i:yl------1 
o < < (20) 
We have the following simple proposition: 
Proposition 1: H+. C H+ and H_. C H_, and H+. and H_. are both convex 
sets. Furthermore, the positions of the points H+. and H_. with respect to the xi 
do not depend on the choice of origin. 
Proof.' Clearly, since the ai defined in H+. is a subset of the ai defined in H+, 
H+. C H+, similarly for H_. Now consider two points in H+. defined by ax, 
Then all points on the line joining these two points can be written as Y]-i:u,=+x ((1 - 
,)Otli -{- ,Ot2i)ggi, 0 _  _ 1. Since ai and a2i both satisfy 0 _ ai _ p, so does 
(1-A)axi+Aa2i, and since also '.i:y=+x (1-A)axi+Aa2i = 1, the set H+. is convex. 
4That one can rescale the objective function without changing the constraints follows 
from uniqueness of the solution. See also Burges and Crisp, "Uniqueness of the SVM 
Solution" in this volume. 
248 D. . Crisp and C. J. C. Burges 
The argument for H_ is similar. Finally, suppose that every xi is translated by 
xo, i.e. xi -- xi + xo Vi. Then since -'-i:y,--+ ai - 1, every point in H+ is also 
translated by the same amount, similarly for H_. I2 
The problem of finding the optimal separating hyperplane between the convex sets 
H+ and H_ then becomes: 
subject to 
1 
main Hw[[ 2 -   oqajyiyjxi .xj (21) 
ij 
(22) 
i i 
Since Eqs. (21) and (22) are identical to (9) and (10), we see that the ,-SVM 
algorithm is in fact finding the optimal separating hyperplane between the convex 
sets H+ and H_. We note that the convex sets H+ and H_ are not simply 
uniformly scaled versions of H+ and H_. An example is shown in Figure 1. 
1/3 
xl 
x3 
! 
p.=!/3 
5112 
!/6 ' 
x2 
xl 
!/3 ! 
x3 
! 
--5/! 2 
xl 
1/6 5/12 ! 
x3 
1/2 ! 
x2 
! 
/4 
!/4 
xl 
x3 
!/4 3/4 ! 
Figure 1: The soft convex hull for the vertices of a right isosceles triangle, for 
various p. Note how the shape changes as the set grows and is constrained by the 
boundaries of the encapsulating convex hull. For/z  , the set is empty. 
Below, we will refer to the formulation given in this section as the soft convex hull 
formulation, and the sets of points defined in Eqs. (19) and (20) as soft convex 
hulls. 
3.3 Comparing the Offsets and Margin Widths 
The natural value of the offset  in the soft convex hull approach,  = -w .p, arose 
by asking that the separating hyperplane lie halfway between the closest extremities 
of the two soft convex hulls. Different choices of b just amount to hyperplanes with 
the same normal but at different perpendicular distances from the origin. This 
value of b will not in general be the same as that for which the cost term in Eq. (7) 
is minimized. We can compare the two values as follows. The KKT conditions for 
the p-SVM formulation are 
- = 0 
a(y(w . + b) - p + = 0 
Multiplying (24) by Yi, summing over i and using (23) gives 
(23) 
(24) 
A Geometn'c Interpretation ofu-SVM Classifiers 249 
(25) 
Thus the separating hyperplane found in the/z-SVM algorithm sits a perpendicular 
distance 121111 >-]4 yiil away from that found in the soft convex hull formulation. 
For the given w, this choice of b results in the lowest value of the cost,/z ]-i i- 
The soft convex hull approach suggests taking 5 = w  w, since this is the value 
Ill takes at the points >-.y,=+x aixi and >-.y,=_x aixi. Again, we can use the KKT 
conditions to compare this with p. Summing (24) over i and using (23) gives 
p = + f,. (26) 
i 
Since 5 = w- w, this again shows that if p = 0 then w = i = O, and, by (25), b - O. 
3.4 The Primal for the Soft Convex Hull Formulation 
By substituting (25) and (26) into the/z-SVM primal formulation (7) and (8) we 
obtain the primal formulation for the soft convex hull problem: minimize 
P- I1112- 
with respect to w, , 5, , subject to: 
(27) 
yi(w ' xi + ) > 5- i + I z E I + YiYj j, 
- 2 
o. (28) 
It is straightforward to check that the dual is exactly (9) and (10). Moreover, by 
summing the relevant KKT conditions, as above, we see that  - -w.p and 5 - w.w. 
Note that in this formulation the variables i retain their meaning according to (8). 
4 Choosing v 
In this section we establish some results on the choices for v, using the /z-SVM 
formulation. First, note that --i aiyi = 0 and '-i ci = 2 implies -]4:y,=+ ci - 
-]-i:v,=- ci = 1. Then ci _> 0 gives ci <_ 1, Vi. Thus choosing/z  1, which 
corresponds to choosing v ( 2/l, results in the same solution of the dual (and hence 
the same normal w) as choosing/z = 1. (Note that different values of/z  I can 
still result in different values of the other primal variables, e.g. b). 
The equalities 5-]-i:v,=+ ci = 5-]-i:w=- ci = 1 also show that if/z < 2/l then the 
feasible region for the dual is empty and hence the problem is insoluble. This 
corresponds to the requirement y < 1. However, we can improve upon this. Let l+ 
(l_) be the number of positive (negative) polarity points, so that l+ + l_ = l. Let 
lmin ---- min{/+, l_ }. Then the minimal value of/z which still results in a nonempty 
feasible region is IZmin -- 1/lmin. This gives the condition v <_ 21min/l. 
We define a "nontrivial" solution of the problem to be any solution with w  0. 
The following proposition gives conditions for the existence of nontrivial solutions. 
250 D. J. Crisp and C. J. C. Burges 
Proposition 2: A value of v exists which will result in a nontrivial solution to 
the v-SVM classification problem if and only if {H+: p = P,in} r3 {H_: t = 
min } = 0. 
Proof: Suppose that {H+: p = P,i} rn {H_: t = P,i}  0. Then for all 
allowable values of p (and hence u), the two convex hulls will intersect, since {H+: 
t = t,} C {H+: t _> t,i} and {H_: t = t,n} C {H_: t _> t,i}. If 
the two convex hulls intersect, then the solution is trivial, since by definition there 
then exist feasible points z such that z = Y'i:ui=+ OtiXi and z = i:i=_ oti2i , 
and hence 2w = EiotiYiXi = Ei:yi=+l tixi- Ei:yi=-i otixi = 0 (cf. (21), (22). 
Now suppose that {H+: p = P,i} r3 {H_: / = P,i} = 0. Then clearly a 
nontrivial solution exists, since the shortest distance between the two convex sets 
(H+: p = P,i) and (H_: p = P,in) is not zero, hence the corresponding 
w0. D 
Note that when l+ = l_, the condition amounts to the requirement that the centroid 
of the positive examples does not coincide with that of the negative examples. Note 
also that this shows that, given a data set, one can find a lower bound on u, by 
finding the largest p that satisfies H_u Cl H+u = 0. 
5 Discussion 
The soft convex hull interpretation suggests that an appropriate way to penalize 
positive polarity errors differently from negative is to replace the sum p -i i in (7) 
with p+ i:=+ i + P- -].i:=_, i. In fact one can go further and introduce a p 
for every train point. The p-SVM formulation makes this possibility explicit, which 
it is not in original u-SVM formulation. 
Note also that the fact that ,-SVM leads to values of b which differ from that which 
would place the optimal hyperplane halfway between the soft convex hulls suggests 
that there may be principled methods for choosing the best b for a given problem, 
other than that dictated by minimizing the sum of the i's. Indeed, originally, the 
sum of i's term arose in an attempt to approximate the number of errors on the 
train set [2]. The above reasoning in a sense separates the justification for w from 
that for b. For example, given w, a simple line search could be used to find that 
value of b which actually does minimize the number of errors on the train set. Other 
methods (for example, minimizing the estimated Bayes error [3]) may also prove 
useful. 
Acknowledgments 
C. Burges wishes to thank W. Keasler, V. Lawrence and C. Nohl of Lucent Tech- 
nologies for their support. 
References 
[1] B. Sch51kopf and A. Smola and R. Williamson and P. Bartlett. New support vector 
algorithms, neurocolt2 nc2-tr-1998-031. Technical report, GMD First and Australian 
National University, 1998. 
[2] C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273-297, 
1995. 
[3] C. J. C. Burges and B. Sch51kopf. Improving the accuracy and speed of support vector 
learning machines. In M. Mozer, M. Jordan, and T. Petsche, editors, Advances in Neural 
Information Processing Systems 9, pages 375-381, Cambridge, MA, 1997. MIT Press. 
