Shrinking the Tube: 
A New Support Vector Regression Algorithm 
Bernhard Schiilkopl '*, Peter Bartlett*, Alex Smola,*, Robert Williamson* 
 GMD FIRST, Rudower Chaussee 5, 12489 Berlin, Germany 
* FEIT/RSISE, Australian National University, Canberra 0200, Australia 
bs, smola @ first.gmd.de, Peter. Bartlett, Bob. Williamson @anu.edu.au 
Abstract 
A new algorithm for Support Vector regression is described. For a priori 
chosen , it automatically adjusts a flexible tube of minimal radius to the 
data such that at most a fraction  of the data points lie outside. More- 
over, it is shown how to use parametric tube shapes with non-constant 
radius. The algorithm is analysed theoretically and experimentally. 
1 INTRODUCTION 
Support Vector (SV) machines comprise a new class of learning algorithms, motivated by 
results of statistical learning theory (Vapnik, 1995). Originally developed for pattern recog- 
nition, they represent the decision boundary in terms of a typically small subset (Sch61kopf 
et al., 1995) of all training examples, called the Support Vectors. In order for this property 
to carry over to the case of SV Regression, Vapnik devised the so-called s-insensitive loss 
function I!t - f(x)[ = max{O, lY - f(x)[ - }, which does not penalize errors below 
some e > O, chosen a priori. His algorithm, which we will henceforth call e-SVR, seeks to 
estimate functions 
f(x) = (w-x) + b, w,x  11N,b  11, (1) 
based on data 
(x, y),..., (x, y)  N x , (2) 
by minimizing the regularized risk functional 
IIw[I 2/2 + c. R 
,top, (3) 
where U is a constant determining the trade-off between niinimizing training errors and 
minimizing the model complexity term Ilwll 2, and R 
ernp .- 
The parameter s can be useful if the desired accuracy of the approximation can be specified 
beforehand. In some cases, however, we just want the estimate to be as accurate as possible, 
without having to commit ourselves to a certain level of accuracy. 
We present a modification of the s-SVR algorithm which automatically minimizes s, thus 
adjusting the accuracy level to the data at hand. 
Shrinking the Tube.' A New Support Vector Regression Algorithm 331 
2 v-SV REGRESSION AND c-SV REGRESSION 
To estimate functions (1) from empirical data (2) we proceed as follows (Sch61kopf et al., 
1998a). At each point xi, we allow an error of e. Everything above e is captured in 
slack variables ,c(*) ((.) being a shorthand implying both the variables with and without 
asterisks), which are penalized in the objective function via a regularization constant (7, 
chosen a priori (Vapnik, 1995). The tube size s is traded off against model complexity and 
slack variables via a constant t,, > 0: 
minimize r(w,(*),) = I1w112/2 
subject to ((w  xi) +b)-Yi 
- ((w. + 
(*) > 0, e 
I )) 
< e + i i= 
<_ e + 
> O. 
(4) 
(5) 
(6) 
(7) 
Here and below, it is understood that i = 1,..., , and that bold face greek letters denote 
-dimensional vectors of the corresponding variables. Introducing a Lagrangian with mul- 
tipliers - (*) -(*) 
cr i , 'tli ,/3 >_ 0, we obtain the the Wolfe dual problem. Moreover, as Boser et al. 
(1992), we substitute a kernel k for the dot product, corresponding to a dot product in some 
feature space related to input space via a nonlinear map , 
k(x,y) = ((x)- (y)). (8) 
This leads to the v-SVR Optimization Problem: for v >_ 0, C > 0, 
maximize W(c (*)) = (ct? -- cti)Yi 2 
subject to 
 C 
(10) 0 < -(*)<- 
The regression estimate can be shown to take the form 
/(x): (ct? - cti)k(xi,x) + b, 
i=1 
i,j=l 
(9) 
(13) 
where b (and ) can be computed by taking into account that (5) and (6) (substitution of 
j (ctj - ctd)k(xd, x) for (w-x) is understood) become equalities with I*) = 0 for points 
 (*) C/E, respectively, due to the Karush-Kuhn-Tucker conditions (cf. Vapnik, 
with 0 < c i < 
1995). The latter moreover imply that in the kernel expansion (13), only those - (*) will 
cr i 
be nonzero that correspond to a constraint (5)/(6) which is precisely met. The respective 
patterns xi are referred to as Support Vectors. 
Before we give theoretical results explaining the significance of the parameter t,,, the fol- 
lowing observation concerning  is helpful. If t., > 1, then  - 0, since it does not pay to 
increase s (cf. (4)). If t,, < 1, it can still happen that e = 0, e.g. if the data are noise-free 
and can perfectly be interpolated with a low capacity model. The case s = 0, however, is 
not what we are interested in; it corresponds to plain Lt loss regression. Below, we will use 
the term errors to refer to training points lying outside of the tube, and the term fraction 
of errors/SVs to denote the relative numbers of errors/SVs, i.e. divided by . 
Proposition 1 Assume  > O. The following statements hold: 
(i) t,, is an upper bound on the fraction of errors. 
(ii) t,, is a lower bound on the fraction of SVs. 
332 B. Sch6lkopf, P. L. Bartlett, A.d. Smola and R. Williamson 
(iii) Suppose the data (2) were generated iid from a distribution P(x,j) = 
P(x)P(ylx) with/v(/[x) continuous. With probability 1, asymptotically, t., equals 
both the fraction of SVs and the fraction of errors. 
The first two statements of this proposition can be proven from the structure of the dual op- 
timization problem, with (12) playing a crucial role. Presently, we instead give a graphical 
proof based on the primal problem (Fig. 1). 
To understand the third statement, note that all errors are also SVs, but there can be SVs 
which are not errors: namely, if they lie exactly at the edge of the tube. Asymptotically, 
however, these SVs form a negligible fraction of the whole SV set, and the set of errors and 
the one of SVs essentially coincide. This is due to the fact that for a class of functions with 
well-behaved capacity (such as SV regression functions), and for a distribution satisfying 
the above continuity condition, the number of points that the tube edges f 4- s can pass 
through cannot asymptotically increase linearly with the sample size. Interestingly, the 
proof (Sch61kopf et al., 1998a) uses a uniform convergence argument similar in spirit to 
those used in statistical learning theory. 
Due to this proposition, 0 _< , < 1 can be used to control the number of errors (note that 
for t., _> 1, (11) implies (12), since ci  ct. = 0 for all i (Vapnik, 1995)). Moreover, since 
the constraint (10) implies that (12) is equivalent to -i -  < C///2, we conclude that 
Proposition 1 actually holds for the upper and the lower edge of the tube separately, with 
t.,/2 each. As an aside, note that by the same argument, the number of SVs at the two edges 
of the standard s-SVR tube asymptotically agree. 
Moreover, note that this bears on the robustness of t.,-SVR. At first glance, SVR seems all 
but robust: using the s-insensitive loss function, only the patterns outside of the s-tube con- 
tribute to the empirical risk term, whereas the patterns closest to the estimated regression 
have zero loss. This, however, does not mean that it is only the outliers that determine the 
regression. In fact, the contrary is the case: one can show that local movements of target 
values ti of points xi outside the tube do not influence the regression (Sch61kopf et al., 
1998c). Hence, t.,-SVR is a generalization of an estimator for the mean of a random vari- 
able which throws away the largest and smallest examples (a fraction of at most ,/2 of 
either category), and estimates the mean by taking the average of the two extremal ones of 
the remaining examples. This is close in spirit to robust estimators like the trimmed mean. 
Let us briefly discuss how the new algorithm relates to s-SVR (Vapnik, 1995). By rewriting 
(3) as a constrained optimization problem, and deriving a dual much like we did for t.,-SVR, 
Figure 1' Graphical depiction of the t.,-trick. Imag- 
ine increasing s, starting from 0. The first term in 
1   
t.,s+ ? Y-'i= (i +i ) (cf. (4)) will increase propor- 
tionally to t.,, while the second term will decrease 
proportionally to the fraction of points outside of 
the tube. Hence, s will grow as long as the latter 
fraction is larger than t.,. At the optimum, it there- 
fore must be _< t., (Proposition 1, (i)). Next, imag- 
ine decreasing s, starting from some large value. 
Again, the change in the first term is proportional 
to t.,, but this time, the change in the second term 
is proportional to the fraction of SVs (even points 
on the edge of the tube will contribute). Hence, s 
will shrink as long as the fraction of SVs is smaller 
than t.,, eventually leading to Proposition 1, (ii). 
Shrinla'ng the Tube.' ,4 New &tpport Vector Regession ,4lgorithm 333 
one arrives at the following quadratic program: maximize 
1 
W(c, cU) = -s E(ct;+cti)+E(ctT-cti)yi-  E (ct'-cti)(ctJ -ctj)k(xi'xd) (14) 
/=1 /=1 i,j=l 
subject to (10) and (11). Compared to (9), we have an additional term -c 
which makes it plausible that the constraint (12) is not needed. 
In the following sense, y-SVR includes c-SVR. Note that in the general case, using kernels, 
' is a vector in feature space. 
Proposition 2 If u-SVR leads to the solution_ g, ,, D, then c-SVR with c set a priori to , 
and the same value of C', has the solution ,, b. 
Proof If we minimize (4), then fix c and minimize only over the remaining variables, the 
solution does not change. I 
3 PARAMETRIC INSENSITIVITY MODELS 
We generalized c-SVR by considering the tube as not given but instead estimated it as a 
model parameter. What we have so far retained is the assumption that the c-insensitive zone 
has a tube (or slab) shape. We now go one step further and use parametric models of arbi- 
trary shape. Let {(*)} (here and below, q = 1, ..., p is understood) be a set of 2p positive 
functions on I/v . Consider the following quadratic program: for given v' *) t, (*) > 0, 
minimize 
(*)) -IIwll2/2 + c- (%% + + + (15) 
i:1 
subjectto ((w.xi)+b)-yi 5 qsq(q(xi)+i (16) 
A calculation analogous to Sec. 2 shows that the Wolfe dual consists of maximizing (9) 
  (*)r(*) 
subject to (10), (11), and, instead of (12), the modified constraints i=t (xi) < 
i gq -- 
U  v? ). In the experiments in Sec. 4, we use a simplified version of this optimization 
problem, where we drop the term vs from the objective function (15), and use sq and 
(q in (17). By this, we render the problem symmetric with respect to the two edges of the 
tube. In addition, we use p = 1. This leads to the same Wolfe dual, except for the last 
constraint, which becomes (cf. (12)) 
i:1 
The advantage of this setting is that since the same v is used for both sides of the tube, the 
computation of s, b is straightforward: for instance, by solving a linear system, using two 
conditions as those described following (13). Otherwise, general statements are harder to 
make: the linear system can have a zero determinant, depending on whether the functions 
 (*) C/f, are 
(?), evaluated on the xi with 0 < i < linearly dependent. The latter occurs, 
for instance, if we use constant functions ((*)  1. In this case, it is pointless to use 
 (,) 
two different values v, v*' for, the constraint (10) then implies that both sums 
' 
will be bounded by U  min{v, v* }. We conclude this section by giving, without proof, a 
generalization of Proposition 1, (iii), to the optimization problem with constraint (19): 
334 B. Sch6lkopf P. L. Bartlett, A.d. Smola and R. 'lliamson 
Proposition 3 Assume 6 > O. Suppose the data (2) were generated iid from a distribution 
?(x, y) = P(x)P(ylx) with P(vlx) continuous. With probability 1, asymptotically, the 
fractions of SVs and errors equal ,.(f (x) dP(x))-, where 1t 5 is the asymptotic distribu- 
tion of SVs over x. 
4 EXPERIMENTS AND DISCUSSION 
In the experiments, we used the optimizer LOQO (http://www. princeton.eduFrvdb/). This 
has the serendipitous advantage that the primal variables b and  can be recovered as the 
dual variables of the Wolfe dual (9) (i.e. the double dual variables) fed into the optimizer. 
In Fig. 2, the task was to estimate a regression of a noisy sinc function, given/ examples 
(3i,]i), with :i drawn uniformly from [-3,3], and ti = sin(7rci)/(7rci) + vi, with vi 
drawn from a Gaussian with zero mean and variance a 2. We used the default parameters 
 = 50, C' -- 100, a = 0.2, and the RBF kernel k(:, :') = 
Figure 3 gives an illustration of how one can make use of parametric insensitivity models as 
proposed in Sec. 3. Using the proper model, the estimate gets much better. In the parametric 
case, we used r, = 0.1 and (x) = sin2((2r/3)z), which, due to f (c)dP(c) = 1/2, 
corresponds to our standard choice t., - 0.2 in t.,-SVR (cf. Proposition 3). The experimental 
findings are consistent with the asymptotics predicted theoretically even if we assume a 
uniform distribution of SVs: for  -- 200, we got 0.24 and 0.19 for the fraction of SVs and 
errors, respectively. 
This method allows the incorporation of prior knowledge into the loss function. Although 
this approach at first glance seems fundamentally different from incorporating prior know- 
ledge directly into the kernel (Sch61kopf et al., 1998b), from the point of view of statistical 
Figure 2: Left: t.,-SV regression with t., = 0.2 (top) and t., = 0.8 (bottom). The larger 
allows more points to lie outside the tube (see Sec. 2). The algorithm automatically adjusts 
s to 0.22 (top) and 0.04 (bottom). Shown are the sinc function (dotted), the regression f 
and the tube f 4- . Middle: t.,-SV regression on data with noise a --- 0 (top) and a - 1 
(bottom). In both cases, t., = 0.2. The tube width automatically adjusts to the noise (top: 
s = 0, bottom: s - 1.19). Right: s-SV regression (Vapnik, 1995) on data with noise rr = 0 
(top) and a -- 1 (bottom). In both cases, s -- 0.2 -- this choice, which has to be specified 
a priori, is ideal for neither case: in the top figure, the regression estimate is biased; in the 
bottom figure, s does not match the external noise (cf. Smola et al., 1998). 
Shrinking the Tube.' A New Support Vector Regression Algorithm 335 
Figure 3' Toy example, using 
prior knowledge about an :- 
dependence of the noise. Additive 
noise (a -- 1) was multiplied by 
sin2((27r/3):). Left: the same 
function was used as  as a para- 
metric insensitivity tube (Sec. 3). 
Right: -SVR with standard tube. 
Table 1' Results for the Boston housing benchmark; top: v-SVR, bottom: s-SVR. MSE: 
Mean squared errors, STD: standard deviations thereof (100 trials), Errors: fraction of train- 
ing points outside the tube, SVs: fraction of training points which are SVs. 
 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 
automatic s 2.6 1.7 1.2 0.8 0.6 0.3 0.0 0.0 0.0 0.0 
MSE 9.4 8.7 9.3 9.5 10.0 10.6 11.3 11.3 11.3 11.3 
STD 6.4 6.8 7.6 7.9 8.4 9.0 9.6 9.5 9.5 9.5 
Errors 0.0 0.1 0.2 0.2 0.3 0.4 0.5 0.5 0.5 0.5 
SVs 0.3 0.4 0.6 0.7 0.8 0.9 1.0 1.0 1.0 1.0 
s 0 1 I 2 3 4 5 6 7 8 9 I 10 
MSE 11.3 9.5 8.8 9.7 11.2 13.1 15.6 18.2 22.1 27.0 34.3 
STD 9.5 7.7 6.8 6.2 6.3 6.0 6.1 6.2 6.6 7.3 8.4 
Errors 0.5 0.2 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 
SVs 1.0 0.6 0.4 0.3 0.2 0.1 0.1 0.1 0.1 0.1 0.1 
learning theory the two approaches are closely related: in both cases, the structure of the 
loss-function-induced class of functions (which is the object of interest for generalization 
error bounds) is customized; in the first case, by changing the loss function, in the second 
case, by changing the class of functions that the estimate is taken from. 
Empirical studies using s-SVR have reported excellent performance on the widely used 
Boston housing regression benchmark set (Stitson et al., 1999). Due to Proposition 2, 
the only difference between -SVR and standard s-SVR lies in the fact that different 
parameters, s vs. v, have to be specified a priori. Consequently, we are in this exper- 
iment only interested in these parameters and simply adjusted C and the width 2a 2 in 
k(x,y) = exp(-IIx - yl12/(2a2)) as Sch61kopf et al. (1997): we used 2a 2 = 0.3. N, 
where N = 13 is the input dimensionality, and C/ - 10  50 (i.e. the original value of 
10 was corrected since in the present case, the maximal t-value is 50). We performed 100 
runs, where each time the overall set of 50t5 examples was randomly split into a training 
set of  = 481 examples and a test set of 25 examples. Table 1 shows that in a wide range 
of  (note that only 0 <_  _< 1 makes sense), we obtained performances which are close to 
the best performances that can be achieved by selecting s a priori by looking at the test set. 
Finally, note that although we did not use validation techniques to select the optimal values 
for C and 2a 2, we obtained performance which are state of the art (Stitson et al. (1999) re- 
port an MSE of 7.15 for s-SVR using ANOVA kernels, and 11.7 for Bagging trees). Table 1 
moreover shows that  can be used to control the fraction of SVs/errors. 
Discussion. The theoretical and experimental analysis suggest that  provides a way to 
control an upper bound on the number of training errors which is tighter than the one used 
in the soft margin hyperplane (Vapnik, 1995). In many cases, this makes it a parameter 
which is more convenient than the one in s-SVR. Asymptotically, it directly controls the 
336 B. Sch6lkopf, P L. Bartlett, A. . Smola and R. Williamson 
number of Support Vectors, and the latter can be used to give a leave-one-out generalization 
bound (Vapnik, 1995). In addition,  characterizes the compression ratio: it suffices to train 
the algorithm only on the SVs, leading to the same solution (Sch61kopf et al., 1995). In 
s-SVR, the tube width s must be specified a priori; in -SVR, which generalizes the idea of 
the trimmed mean, it is computed automatically. Desirable properties of s-SVR, including 
the formulation as a definite quadratic program, and the sparse SV representation of the 
solution, are retained. We are optimistic that in many applications, -SVR will be more 
robust than s-SVR. Among these should be the reduced set algorithm of Osuna and Girosi 
(1999), which approximates the SV pattern recognition decision surface by s-SVR. Here, 
 should give a direct handle on the desired speed-up. 
One of the immediate questions that a w-approach to SV regression raises is whether a 
similar algorithm is possible for the case of pattern recognition. This question has recently 
been answered to the affirmative (Sch61kopf et al., 1998c). Since the pattern recognition 
algorithm (Vapnik, 1995) does not use s, the only parameter that we can dispose of by 
using  is the regularization constant C'. This leads to a dual optimization problem with a 
homogeneous quadratic form, and  lower bounding the sum of the Lagrange multipliers. 
Whether we could have abolished C' in the regression case, too, is an open problem. 
Acknowledgement This work was supported by the ARC and the DFG (# Ja 379/71). 
References 
B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin 
classifiers. In D. Haussler, editor, Proceedings of the 5th Annual ACM Workshop on 
Computational Learning Theory, pages 144-152, Pittsburgh, PA, 1992. ACM Press. 
E. Osuna and F. Girosi. Reducing run-time complexity in support vector machines. In 
B. Sch61kopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods -- Support 
Vector Learning, pages 271 -283. MIT Press, Cambridge, MA, 1999. 
B. Sch61kopf, C. Burges, and V. Vapnik. Extracting support data for a given task. In 
U. M. Fayyad and R. Uthurusamy, editors, Proceedings, First International Conference 
on Knowledge Discovery & Data Mining. AAAI Press, Menlo Park, CA, 1995. 
B. Sch61kopf, P. Bartlett, A. Smola, and R. Williamson. Support vector regression with 
automatic accuracy control. In L. Niklasson, M. Bod6n, and T. Ziemke, editors, Pro- 
ceedings of the 8th International Conference on Artificial Neural Networks, Perspectives 
in Neural Computing, pages 111 - 116, Berlin, 1998a. Springer Verlag. 
B. Sch61kopf, P. Simard, A. Smola, and V. Vapnik. Prior knowledge in support vector 
kernels. In M. Jordan, M. Kearns, and S. Solla, editors, Advances in Neural Information 
Processing Systems 10, pages 640 - 646, Cambridge, MA, 1998b. MIT Press. 
B. Sch61kopf, A. Smola, R. Williamson, and P. Bartlett. New support vector algorithms. 
1998c. NeuroColt2-TR 1998-031; cf. http:iwww.neurocolt.com 
B. Sch61kopf, K. Sung, C. Burges, F. Girosi, P. Niyogi, T. Poggio, and V. Vapnik. Com- 
paring support vector machines with gaussian kernels to radial basis function classifiers. 
IEEE Trans. Sign. Processing, 45:2758 - 2765, 1997. 
A. Smola, N. Murata, B. Sch61kopf, and K.-R. MOller. Asymptotically optimal choice of 
s-loss for support vector machines. In L. Niklasson, M. Bod6n, and T. Ziemke, editors, 
Proceedings of the 8th International Conference on Artificial Neural Networks, Perspec- 
tives in Neural Computing, pages 105 - 110, Berlin, 1998. Springer Verlag. 
M. Stitson, A. Gammerman, V. Vapnik, V. Vovk, C. Watkins, and J. Weston. Support 
vector regression with ANOVA decomposition kernels. In B. Sch61kopf, C. Burges, 
and A. Smola, editors, Advances in Kernel Methods  Support Vector Learning, pages 
285 - 291. MIT Press, Cambridge, MA, 1999. 
V. Vapnik. The Nature of Statistical Learning Theory. Springer Verlag, New York, 1995. 
