Uniqueness of the SVM Solution 
Christopher J.C. Burges 
Advanced Technologies, 
Bell Laboratories, 
Lucent Technologies 
Holmdel, New Jersey 
burges lucent. corn 
David J. Crisp 
Centre for Sensor Signal and 
Information Processing, 
Deptartment of Electrical Engineering, 
University of Adelaide, South Australia 
dcrisp eleceng. adelaide. edu. au 
Abstract 
We give necessary and sufficient conditions for uniqueness of the 
support vector solution for the problems of pattern recognition and 
regression estimation, for a general class of cost functions. We show 
that if the solution is not unique, all support vectors are necessarily 
at bound, and we give some simple examples of non-unique solu- 
tions. We note that uniqueness of the primal (dual) solution does 
not necessarily imply uniqueness of the dual (primal) solution. We 
show how to compute the threshold b when the solution is unique, 
but when all support vectors are at bound, in which case the usual 
method for determining b does not work. 
1 Introduction 
Support vector machines (SVMs) have attracted wide interest as a means to imple- 
ment structural risk minimization for the problems of classification and regression 
estimation. The fact that training an SVM amounts to solving a convex quadratic 
programming problem means that the solution found is global, and that if it is not 
unique, then the set of global solutions is itself convex; furthermore, if the objec- 
tive function is strictly convex, the solution is guaranteed to be unique [1] x. For 
quadratic programming problems, convexity of the objective function is equivalent 
to positive semi-definiteness of the Hessian, and strict convexity, to positive definite- 
ness [1]. For reference, we summarize the basic uniqueness result in the following 
theorem, the proof of which can be found in [1]: 
Theorem 1: The solution to a convex programming problem, for which the ob- 
jective function is strictly convex, is unique. Positive definiteness of the Hessian 
implies strict convexity of the objective function. 
Note that in general strict convexity of the objective function does not neccesarily 
imply positive definiteness of the Hessian. Furthermore, the solution can still be 
unique, even if the objective function is loosely convex (we will use the term "loosely 
convex" to mean convex but not strictly convex). Thus the question of uniqueness 
XThis is in contrast with the case of neural nets, where local minima of the objective 
function can occur. 
224 C. J. C. Burges and D. J. Crisp 
for a convex programming problem for which the objective function is loosely convex 
is one that must be examined on a case by case basis. In this paper we will give 
necessary and sufficient conditions for the support vector solution to be unique, 
even when the objective function is loosely convex, for both the clasification and 
regression cases, and for a general class of cost function. 
One of the central features of the support vector method is the implicit mapping 
 of the data z  Rn to some feature space 5, which is accomplished by replacing 
dot products between data points zi, zj, wherever they occur in the train and test 
algorithms, with a symmetric function K(zi, zj), which is itself an inner product in 
 [2]: K(zi, zj) -- ((I)(zi), (I)(zj)) -- (xi,xj), where we denote the mapped points in 
 by x - (z). In order for this to hold the kernel function K must satisfy Mercer's 
positivity condition [3]. The algorithms then amount to constructing an optimal 
separating hyperplane in 5, in the pattern recognition case, or fitting the data to a 
linear regression tube (with a suitable choice of loss function [4]) in the regression 
estimation case. Below, without loss of generality, we will work in the space 5, 
whose dimension we denote by dr. The conditions we will find for non-uniqueness 
of the solution will not depend explicitly on  or . 
Most approaches to solving the support vector training problem employ the Wolfe 
dual, which we describe below. By uniqueness of the primal (dual) solution, we 
mean uniqueness of the set of primal (dual) variables at the solution. Notice that 
strict convexity of the primal objective function does not imply strict convexity of 
the dual objective function. For example, for the optimal hyperplane problem (the 
problem of finding the maximal separating hyperplane in input space, for the case 
of separable data), the primal objective function is strictly convex, but the dual 
objective function will be loosely convex whenever the number of training points 
exceeds the dimension of the data in input space. In that case, the dual Hessian 
H will necessarily be positive semidefinite, since H (or a submatrix of H, for the 
cases in which the cost function also contributes to the (block-diagonal) Hessian) is a 
Gram matrix of the training data, and some rows of the matrix will then necessarily 
be linearly dependent [5] 2 . In the cases of support vector pattern recognition and 
regression estimation studied below, one of four cases can occur: (1) both primal 
and dual solutions are unique; (2) the primal solution is unique while the dual 
solution is not; (3) the dual is unique but the primal is not; (4) both solutions 
are not unique. Case (2) occurs when the unique primal solution has more than 
one expansion in terms of the dual variables. We will give an example of case (3) 
below. It is easy to construct trivial examples where case (1) holds, and based on 
the discussion below, it will be clear how to construct examples of (4). However, 
since the geometrical motivation and interpretation of SVMs rests on the primal 
variables, the theorems given below address uniqueness of the primal solution s . 
2 The Case of Pattern Recognition 
We consider a slightly generalized form of the problem given in [6], namely to 
minimize the objective function 
F = (1/2)[[w[[ 2 +  Cif (1) 
i 
aRecall that a Gram matrix is a matrix whose ij'th element has the form {xi,xj} for 
some inner product {, ), where xi is an element of a vector space, and that the rank of a 
Gram matrix is the maximum number of linearly independent vectors xi that appear in it 
[6]. 
aDue to space constraints some proofs and other details will be omitted. Complete 
details will be given elsewhere. 
Uniqueness of the SVM Solution 225 
with constants p  [1, cw), Ci > 0, subject to constraints: 
yi(w ' xi + b) _ 1 - i, i = 1,...,l (2) 
i k 0, i=l,---,l (3) 
where w is the vector of weights, b a scalar threshold, i are positive slack variables 
which are introduced to handle the case of nonseparable data, the Yi are the polar- 
ities of the training samples (Yi 6 (+1)), xi are the images of training samples in 
the space  by the mapping , the Ci determine how much errors are penalized 
(here we have allowed each pattern to have its own penalty), and the index i labels 
the I training patterns. The goal is then to find the values of the primal variables 
(w, b, i) that solve this problem. Most workers choose p = 1, since this results in 
a particularly simple dual formulation, but the problem is convex for any p _ 1. 
We will not go into further details on support vector classification algorithms them- 
selves here, but refer the interested reader to [3], [7]. Note that, at the solution, b 
is determined from w and i by the Karush Kuhn Tucker (KKT) conditions (see 
below), but we include it in the definition of a solution for convenience. 
Note that Theorem i gives an immediate proof that the solution to the optimal 
hyperplane problem is unique, since there the objective function is just (1/2)l[w[I 2, 
which is strictly convex, and the constraints (Eq. (2) with the  variables removed) 
are linear inequality constraints which therefore define a convex set 4. 
For the discussion below we will need the dual formulation of this problem, for the 
case p = 1. It takes the following form: minimize  Eij OtiOtjYiYj{Xi,Xj) -- -].i Oti 
subject to constraints: 
_Y 0, alP_0 (4) 
Ci = -i + m (5) 
) = o (6) 
i 
and where the solution takes the form w = -']-i otiYiXi, and the KKT conditions, 
which are satisfied at the solution, are rlii = O, oq(yi(w.xi + b) - 1 + i) = 0, where 
r/i are Lagrange multipliers to enforce positivity of the i, and ci are Lagrange 
multipliers to enforce the constraint (2). The r/i can be implicitly encapsulated 
in the condition 0 _ oi _ Ci, but we retain them to emphasize that the above 
equations imply that whenever i  0, we must have ci - Ci. Note that, for a 
given solution, a support vector is defined to be any point xi for which ci > 0. Now 
suppose we have some solution to the problem (1), (2), (3). Let Af denote the set 
{i: Yi = 1, w 'xi + b < 1}, Af2 the set {i: Yi = -1, w 'xi + b > -1}, JV'a the set 
(i: Yi = 1, w.xi + b = 1), Af4 the set (i: Yi = -1, w. xi + b = -1), JV'5 the set 
(i: yi = 1, w.xi +b  1), and Af6 the set (i: yi = -1, w.xi + b (-1). Then we 
have the following theorem: 
Theorem 2: The solution to the soft-margin problem, (1), (2) and (3), is unique 
for p > 1. For p = 1, the solution is not unique if and only if at least one of the 
following two conditions holds: 
c, = (7) 
Furthermore, whenever the solution is not unique, all solutions share the same w, 
and any support vector xi has Lagrange multiplier satisfying cq = Ci, and when (7) 
4This is of course not a new result: see for example [3]. 
226 C. J. C. Burges and D. J.. Crisp 
holds, then Afa contains no support vectors, and when (8) holds, then iV'4 contains 
no support vectors. 
Proof: For the case p > 1, the objective function F is strictly convex, since a 
sum of strictly convex functions is a strictly convex function, and since the function 
g(v) - v p, v  + is strictly convex for p > 1. Furthermore the constraints define 
a convex set, since any set of simultaneous linear inequality constraints defines a 
convex set. Hence by Theorem i the solution is unique. 
For the case p = 1, define z to be that dr +/-component vector with zi - wi, i - 
1,..., dF, and zi - i, i - d, + 1,...,d, + I. In terms of the variables z, the 
problem is still a convex programming problem, and hence has the property that 
any solution is a global solution. Suppose that we have two solutions, z and 
z2. Then we can form the family of solutions zt, where zt -- (1 - t)zx + tz2, and 
since the solutions are global, we have F(z) = F(z2) = F(zt). By expanding 
F(zt) - F(z) = 0 in terms of z and z2 and differentiating twice with respect to t 
we find that wx = w2. Now given w and b, the i are completely determined by the 
KKT conditions. Thus the solution is not unique if and only if b is not unique. 
Define 5  min {minieAr  i, minieAr6 (-1 - w.xi - b)}, and suppose that condition 
(7) holds. Then a different solution {w', b','} is given by w' = w, b' = b + 
and i = i - 5, Vi  Aft, i = i + 5, Vi  iV'2 uAf4, all other i = 0, since by 
construction F then remains the same, and the constraints (2), (3) are satisfied 
by the primed variables. Similarly, suppose that condition (8) holds. Define 5 -- 
min{minief 2 i,minie5(w .xi + b- 1)}. Then a different solution {w',b','} is 
given byw' = w, b' = b-5, and i = 
all other i = 0, since again by construction F is unchanged and the constraints 
are still met. Thus the given conditions are sufficient for the solution to be non- 
unique. To show necessity, assume that the solution is not unique: then by the 
above argument, the solutions must differ by their values of b. Given a particular 
solution b, suppose that b + 5, 5 > 0 is also a solution. Since the set of solutions is 
itself convex, then b + 5  will also correspond to a solution for all 5 : 0 _< 5  _< 5. 
Given some b  = b + 5 , we can use the KKT conditions to compute all the i, and 
we can choose 5' sufficiently small so that no i, i 6 iV'6 that was previously zero 
becomes nonzero. Then we find that in order that F remain the same, condition 
(7) must hold. If b - 5, 5 > 0 is a solution, similar reasoning shows that condition 
(8) must hold. To show the final statement of the theorem, we use the equality 
constraint (6), together with the fact that, from the KKT conditions, all support 
vectors xi with indices in Af UJV'2 satisfy ai = Ci. Substituting (6) in (7) then gives 
Y-a ai + Y-4 (Ci - ai) = 0 which implies the result, since all ai are non-negative. 
Similarly, substituting (6) in (8) gives Y-fa (el - cq) + y,f4 ci = 0 which again 
implies the result. Cl 
Corollary: For any solution which is not unique, letting $ denote the set of indices 
of the corresponding set of support vectors, then we must have Y-i$ CiYi - O. 
Furthermore, if the number of data points is finite, then for at least one of the 
family of solutions, all support vectors have corresponding i  0. 
Note that it follows from the corollary that if the Ci are chosen such that there 
exists no subset 7' of the train data such that Y-iT CiYi -' O, then the solution is 
guaranteed to be unique, even ifp = 1. Furthermore this can be done by choosing all 
the Ci very close to some central value C, although the resulting solution can depend 
sensitively on the values chosen (see the example immediately below). Finally, note 
that if all Ci are equal, the theorem shows that a necessary condition for the solution 
to be non-unique is that the negative and positive polarity support vectors be equal 
in number. 
Uniqueness of the SVM Solution 22 7 
A simple example of a non-unique solution, for the case p = 1, is given by a train set 
in one dimension with just two examples, (xl = 1, yx = 1} and (x2 = -1,y2 = -1, 
with C1 = C  C. It is straightforward to show analytically that for C _ 5, 
the solution is unique, with w = 1, ix =  = b = 0, and margin 5 equal to 2, 
 there is a family of solutions, with -1 + 2C < b < i - 2C and 
while for C < 5 _ _ 
x corresponds to 
 =1-b-2C, = l+b-2C, and margin l/C. ThecaseC<  
Case (3) in Section (1) (dual unique but primal not), since the dual variables are 
uniquely specified by c - C. Note also that this family of solutions also satisfies 
the condition that any solution is smoothly deformable into another solution [7]. 
If Cx  C, the solution becomes unique, and is quite different from the unique 
solution found when C2  C. When the C's are not equal, one can interpret 
what happens in terms of the mechanical analogy [8], with the central separating 
hyperplane sliding away from the point that exerts the higher force, until that point 
lies on the edge of the margin region. 
Note that if the solution is not unique, the possible values of b fall on an interval 
of the real line: in this case a suitable choice would be one that minimizes an 
estimate of the Bayes error, where the SVM output densities are modeled using a 
validation set 6. Alternatively, requiring continuity with the cases p  1, so that one 
would choose that value of b that would result by considering the family of solutions 
generated by different choices of p, and taking the limit from above of p - 1, would 
again result in a unique solution. 
3 The Case of Regression Estimation 7 
Here one has a set of I pairs (xl,y),(x,y),...,(x,y), (xi  .T',yi  , and 
the goal is to estimate the unknown functional dependence ] of the y on the x, 
where the function f is assumed to be related to the measurements {xi,Yi} by 
Yi ---- ](Xi)+hi, and where ni represents noise. For details we refer the reader to [3], 
[9]. Again we generalize the original formulation [10], as follows: for some choice of 
positive error penalties el, and for positive el, minimize 
i . 
F- Ilwll + +c; (9) 
with constant p  [1, ), subject to constrnts 
yi-w'xi-b  ei+i (10) 
w'xi+b-yi 5 ei+ (11) 
*)  0 (12) 
where we have adopted the notation *)  (i, ) [9]. This formulation results in 
an % insensitive" loss function, that is, there is no penalty (*) = 0) sociated 
with point xi if [Yi - w 'xi - bl  el. Now let fl, fl* be the Lagrange multipliers 
introduced to enforce the constraints (10), (11). The dual then gives 
= o 5 C,, o 5 5 (13) 
i i 
5The margin is defined to be the distance between the two hyperplanes corresponding 
to equality in Eq. (2), namely 2/11wll , and the margin region is defined to be the set of 
points between the two hyperplanes. 
6This method was used to estimate b under similar circumstances in [8]. 
The notation in this section only coincides with that used in section 2 where convenient. 
228 C. ,J.. C. Burges and D. d. Crisp 
which we will need below. For this formulation, we have the following 
Theorem 3: For a given solution, define f(Xi, Yi) ---- Yi -- W ' Xi -- b, and define 
to be the set of indices {i: f(xi, Yi) > el}, iV'2 the set {i: f(xi, Yi) -- el}, JV'a the set 
{i: f(xi,Yi) = -el}, and JV'a the set {i: f(xi,Yi) < -el}. Then the solution to (9) 
- (12) is unique for p > 1, and for p = I it is not unique if and only if at least one 
of the following two conditions holds: 
ieu 
(14) 
(15) 
Furthermore, whenever the solution is not unique, all solutions share the same w, 
and all support vectors are at bound (that is s, either/i = Ci or/ = C), and 
when (14) holds, then JV'a contains no support vectors, and when (15) holds, then 
iV'2 contains no support vectors. 
The theorem shows that in the non-unique case one will only be able to move the 
tube (and get another solution) if one does not change its normal w. A trivial 
example of a non-unique solution is when all the data fits inside the e-tube with 
room to spare, in which case for all the solutions, the normal to the e-tubes always 
lies along the y direction. Another example is when all Ci are equal, all data falls 
outside the tube, and there are the same number of points above the tube as below 
it. 
4 Computing b when all SVs are at Bound 
The threshold b in Eqs. (2), (10) and (11) is usually determined from that sub- 
set of the constraint equations which become equalities at the solution and for 
which the corresponding Lagrange multipliers are not at bound. However, it may 
be that at the solution, this subset is empty. In this section we consider the sit- 
uation where the solution is unique, where we have solved the optimization prob- 
lem and therefore know the values of all Lagrange multipliers, and hence know 
also w, and where we wish to find the unique value of b for this solution. Since 
the *) are known once b is fixed, we can find b by finding that value which 
both minimizes the cost term in the primal Lagrangian, and which satisfies all 
the constraint equations. Let us consider the pattern recognition case first. Let 
$+ (S_) denote the set of indices of positive (negative) polarity support vectors. 
Also let V+ (V_) denote the set of indices of positive (negative) vectors which are 
not support vectors. It is straightforward to show that if '-i$_ Ci > Y-ie$+ Ci, 
then b = max{maxims_ (-1 -w.xi), maxiv+(1 -w.xi)}, while if Y-i$_ Ci < 
Y-i$+ Ci, then b = min{mini$+(1-w.xi), miniv_(-1-w.xi)}. Further- 
more, if Y-i$_ Ci = Y-i$+ Ci, and if the solution is unique, then these two values 
coincide. 
In the regression case, let us denote by $ the set of indices of all sup- 
port vectors,  its complement, S the set of indices for which /i = Ci, 
and S2 the set of indices for which / = Ct, so that S = S U S2 (note 
S1 r"l S 2 -- 0). Then if Y-ies2 Ct > Y'-ies C, the desired value of b is 
b = max {maxies(Yi - w. xi + el), maxie$(yi - w- xi - el)} while if Y-ies2 C < 
Y-i& Ci, then b = min{mini$(yi-w.xi-ei), miniB(yi-w.xi +ei)}. 
SRecall that if ei > O, then 3i/[ = O. 
Uniqueness of the SVM Solution 229 
Again, if the solution is unique, and if also '-ies2 C = '-ies Ci, then these 
two values coincide. 
5 Discussion 
We have shown that non-uniqueness of the SVM solution will be the exception rather 
than the rule: it will occur only when one can rigidly parallel transport the margin 
region without changing the total cost. If non-unique solutions are encountered, 
other techniques for finding the threshold, such as minimizing the Bayes error arising 
from a model of the SVM posteriors [8], will be needed. The method of proof in the 
above theorems is straightforward, and should be extendable to similar algorithms, 
for example Mangasarian's Generalized SVM [11]. In fact one can extend this result 
to any problem whose objective function consists of a sum of strictly convex and 
loosely convex functions: for example, it follows immediately that for the case of the 
v-SVM pattern recognition and regression estimation algorithms [12], with arbitrary 
convex costs, the value of the normal w will always be unique. 
Acknowledgments 
C. Burges wishes to thank W. Keasler, V. Lawrence and C. Nohl of Lucent Tech- 
nologies for their support. 
References 
[1] R. Fletcher. Practical Methods of Optimization. John Wiley and Sons, Inc., 2nd 
edition, 1987. 
[2] B. E. Boser, I. M. Guyon, and V .Vapnik. A training algorithm for optimal margin 
classifiers. In Fifth Annual Workshop on Computational Learning Theory, Pittsburgh, 
1992. ACM. 
[3] V. Vapnik. Statistical Learning Theory. John Wiley and Sons, Inc., New York, 1998. 
[4] A.J. Smola and B. SchSlkopf. On a kernel-based method for pattern recognition, 
regression, approximation and operator inversion. Algorithmica, 22:211 - 231, 1998. 
[5] Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, 
1985. 
[6] C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273-297, 
1995. 
[7] C.J.C. Burges. A tutorial on support vector machines for pattern recognition. Data 
Mining and Knowledge Discovery, 2(2):121-167, 1998. 
[8] C. J. C. Burges and B. SchSlkopf. Improving the accuracy and speed of support vector 
learning machines. In M. Mozer, M. Jordan, and T. Petsche, editors, Advances in Neural 
Information Processing Systems 9, pages 375-381, Cambridge, MA, 1997. MIT Press. 
[9] A. Smola and B. SchSlkopf. A tutorial on support vector regression. Statistics and 
Computing, 1998. In press: also, COLT Technical Report TR-1998-030. 
[10] V. Vapnik, S. Golowich, and A. Smola. Support vector method for function approx- 
imation, regression estimation, and signal processing. Advances in Neural Information 
Processing Systems, 9:281-287, 1996. 
[11] O.L. Mangarasian. Generalized support vector machines, mathematical programming 
technical report 98-14. Technical report, University of Wisconsin, October 1998. 
[12] B. SchSlkopf, A. Smola, R. Williamson and P. Bartlett, New Support Vector Algo- 
rithms, NeuroCOLT2 NC2-Tl-1998-031, 1998. 
