v-Arc: Ensemble Learning 
in the Presence of Outliers 
G. Ritsch t, B. Sch51kopf t, A. Smola*, 
K.-R. Miillert, T. Onoda**, and S. Mika* 
t GMD FIRST, Rudower Chaussee 5, 12489 Berlin, Germany 
Microsoft Research, I Guildhall Street, Cambridge CB2 3NH, UK 
 Dep. of Engineering, ANU, Canberra ACT 0200, Australia 
t CRIEPI, 2-11-1, Iwado Kita, Komae-shi, Tokyo, Japan 
{raetsch, klaus, mika}first.gmd.de, bscmicrosoft. com, 
Alex. Smolaanu. edu. au, onodacriepi. denken. or. j p 
Abstract 
AdaBoost and other ensemble methods have successfully been ap- 
plied to a number of classification tasks, seemingly defying prob- 
lems of overfitting. AdaBoost performs gradient descent in an error 
function with respect to the margin, asymptotically concentrating 
on the patterns which are hardest to learn. For very noisy prob- 
lems, however, this can be disadvantageous. Indeed, theoretical 
analysis has shown that the margin distribution, as opposed to just 
the minimal margin, plays a crucial role in understanding this phe- 
nomenon. Loosely speaking, some outliers should be tolerated if 
this has the benefit of substantially increasing the margin on the 
remaining points. We propose a new boosting algorithm which al- 
lows for the possibility of a pre-specified fraction of points to lie in 
the margin area or even on the wrong side of the decision boundary. 
1 Introduction 
Boosting and related Ensemble learning methods have been recently used with great 
success in applications such as Optical Character Recognition (e.g. [8, 16]). 
The idea of a large minimum margin [17] explains the good generalization perfor- 
mance of AdaBoost in the low noise regime. However, AdaBoost performs worse 
on noisy tasks [10, 11], such as the iris and the breast cancer benchmark data sets 
[1]. On the latter tasks, a large margin on all training points cannot be achieved 
without adverse effects on the generalization error. This experimental observation 
was supported by the study of [13] where the generalization error of ensemble meth- 
ods was bounded by the sum of the fraction of training points which have a margin 
smaller than some value p, say, plus a complexity term depending on the base hy- 
potheses and p. While this bound can only capture part of what is going on in 
practice, it nevertheless already conveys the message that in some cases it pays to 
allow for some points which have a small margin, or are misclassified, if this leads 
to a larger overall margin on the remaining points. 
To cope with this problem, it was mandatory to construct regularized variants of 
AdaBoost, which traded off the number of margin errors and the size of the margin 
562 G. Riitsch, B. Sch6lkopf, A. J. Smola, K.-R. Mgller, T. Onoda andS. Mika 
[9, 11]. This goal, however, had so far been achieved in a heuristic way by introduc- 
ing regularization parameters which have no immediate interpretation and which 
cannot be adjusted easily. 
The present paper addresses this problem in two ways. Primarily, it makes an algo- 
rithmic contribution to the problem of constructing regularized boosting algorithms. 
However, compared to the previous efforts, it parameterizes the above trade-off in 
a much more intuitive way: its only free parameter directly determines the fraction 
of margin errors. This, in turn, is also appealing from a theoretical point of view, 
since it involves a parameter which controls a quantity that plays a crucial role in 
the generalization error bounds (cf. also [9, 13]). Furthermore, it allows the user 
to roughly specify this parameter once a reasonable estimate of the expected error 
(possibly from other studies) can be obtained, thus reducing the training time. 
2 Boosting and the Linear Programming Solution 
Before deriving a new algorithm, we briefly discuss the properties of the solution 
generated by standard AdaBoost and, closely related, Arc-GV [2], and show the 
relation to a linear programming (LP) solution over the class of base hypotheses G. 
Let {gt(x)  t = 1,... ,T} be a sequence of hypotheses and c = [o1...OT] their 
weights satisfying ct > 0. The hypotheses gt are elements of a hypotheses class 
G = {g' x + [-1, 1]}, which is defined by a base learning algorithm. 
The ensemble generates the label which is the weighted majority of the votes by 
sign(f(x)) where f(x)- iillgz(x). (1) 
In order to express that f and therefore also the margin p depend on c and for ease 
of notation we define 
p(z,a) := yf(x) where z := (x,y) and f is defined as in (1). (2) 
Likewise we use the normalized margin: 
p(c) := min p(zi, or) (3) 
l<i<m ' 
Ensemble learning methods have to find both, the hypotheses gt  G used for the 
combination and their weights c. In the following we will consider only AdaBoost 
algorithms (including Arcing). For more details see e.g. [4, 2]. The main idea of 
AdaBoost is to introduce weights wt(zi) on the training patterns. They are used to 
control the importance of each single pattern in learning a new hypothesis (i.e. while 
repeatedly running the base algorithm). Training patterns that are difficult to learn 
(which are misclassified repeatedly) become more important. 
The minimization objective of AdaBoost can be expressed in terms of margins as 
(c,) := y exp(-Ilc, llp(zi, c,)) . (4) 
i----1 
In every iteration, AdaBoost tries to minimize this error by a stepwise maximization 
of the margin. It is widely believed that AdaBoost tries to maximize the smallest 
margin on the training set [2, 5, 3, 13, 11]. Strictly speaking, however, a general 
proof is missing. It would imply that AdaBoost asymptotically approximates (up to 
scaling) the solution of the following linear programming problem over the complete 
hypothesis set G (cf. [7], assuming a finite number of basis hypotheses): 
maximize p 
subject to p(zi,e) >p for alll<i<rn 
ct,p>_0 for alll<t<lGI (5) 
Ilalll-- 1 
u-Arc: Ensemble Learning in the Presence of Outliers 563 
Since such a linear program cannot be solved exactly for a infinite hypothesis set 
in general, it is interesting to analyze approximation algorithms for this kind of 
problems. 
Breiman [2] proposed a modification of AriaBoost - Arc-GV - making it possible 
to show the asymptotic convergence of p(ot t) to the global solution pip: 
Theorem I (Breiman [2]). Choose at in each iteration as 
at := argmin  exp [-II*ll (P(Zi, Oft) -- P(O?-))] , (6) 
ce[O,1]  
and assume that the base learner always finds the hypothesis g  G which minimizes 
the weighted training error with respect to the weights. Then 
lira p(c t) = pip. 
Note that the algorithm above can be derived from the modified error function 
gv(O?) :: E exp [-[[otl[1 (p(zi, o t) - p(t-))]. (7) 
i 
The question one might ask now is whether to use AdaBoost or rather Arc-GV 
in practice. Does Arc-GV converge fast enough to benefit from its asymptotic 
properties? In [12] we conducted experiments to investigate this question. We 
empirically found that (a) AdaBoost has problems finding the optimal combination 
if plp < 0, (b) Arc-GV's convergence does not depend on plp, and (c) for plp > 0, 
AdaBoost usually converges to the maximum margin solution slightly faster than 
Arc-GV. Observation (a) becomes clear from (4): (c) will not converge to 0 and 
IIlll can be bounded by some value. Thus the asymptotic case cannot be reached, 
whereas for Arc-GV the optimum is always found. 
Moreover, the number of iterations necessary to converge to a good solution seems to 
be reasonable, but for a near optimal solution the number of iterations is rather high. 
This implies that for real world hypothesis sets, the number of iterations needed 
to find an almost optimal solution can become prohibitive, but we conjecture that 
in practice a reasonably good approximation to the optimum is provided by both 
AdaBoost and Arc-GV. 
3 v-Algorithms 
For the LP-AdaBoost [7] approach it has been shown for noisy problems that the 
generalization performance is usually not as good as the one of AdaBoost [7, 2, 11]. 
From Theorem 5 in [13] (cf. Theorem 3 on page 5) this fact becomes clear, as 
the minimum of the right hand side of inequality (cf. (13)) need not necessarily be 
achieved with a maximum margin. We now propose an algorithm to directly control 
the number of margin errors and therefore also the contribution of both terms in the 
inequality separately (cf. Theorem 3). We first consider a small hypothesis class 
and end up with a linear program - v-LP-AdaBoost. In subsection 3.2 we then 
combine this algorithm with the ideas from section 2 and get a new algorithm - 
v-Arc - which approximates the v-LP solution. 
3.1 v-LP-AdaBoost 
Let us consider the case where we are given a (finite) set G = {g: x + [-1, 1]} of T 
hypotheses. To find the coefficients c for the combined hypothesis f(x) we extend 
the LP-AdaBoost algorithm [7, 11] by incorporating the parameter v [15] and solve 
the following linear optimization problem: 
m 
1 Ei----1 i 
maximize P ,m 
subject to p(zi,c)_>p-i for alll<i<m 
- - (8) 
i,at,P_>0 for alll_<t_<Tandl<i<m 
I1111- 1 
564 G. Riitsch, B. SchOlkopf, A. J. Smola, K.-R. Miiller, T. Onoda andS. Mika 
This algorithm does not force all margins to be beyond zero and we get a soft 
margin classification (cf. SVMs) with a regularization constant i . The following 
proposition shows that v has an immediate interpretation: 
Proposition 2 (Rfitsch et al. [12]). Suppose we run the algorithm given in (8) 
on some data with the resulting optimal p > O. Then 
1. v upper bounds the fraction of margin errors. 
2. 1 - v upper bounds the fraction of patterns with margin larger than p. 
Since the slack variables i only enter the cost function linearly, their absolute size 
is not important. Loosely speaking, this is due to the fact that for the optimum 
of the primal objective function, only derivatives wrt. the primal variables matter, 
and the derivative of a linear function is constant. 
In the case of SVMs [14], where the hypotheses can be thought of as vectors in 
some feature space, this statement can be translated into a precise rule for dis- 
torting training patterns without changing the solution: we can move them locally 
orthogonal to a separating hyperplane. This yields a desirable robustness property. 
Note that the algorithm essentially depends on the number of outliers, not on the 
size of the error [15]. 
3.2 The v-Arc Algorithm 
Suppose we have a very large (but finite) base hypothesis class G. Then it is difficult 
to solve (8) as (5) directly. To this end, we propose a new algorithm - v-Arc - that 
approximates the solution of (8). 
The optimal p for fixed margins p(zi, Or) in (8) can be written as 
( I m 
p(c) := argrnax p - Z(p- p(zi,oO)+  (9) 
pe[O,1] vm i=1 
where ()+ := max(,O). Setting i := (pv(a)- p(zi a))+ and subtracting 
I m ' 
v, i=1 i from the resulting inequality on both sides yields (for all I < i < m) 
m m 
P(zi,o0 + i 1 Zi _> p(c) 1 Zi' (10) 
i=1 i=1 
Two more substitutions are needed to transform the problem into one which can 
be solved by the AdaBoost algorithm. In particular we have to get rid of the slack 
variables i again by absorbing them into quantities similar to p(zi, or) and p(c). 
This works as follows: on the right hand side of (10) we have the objective function 
(cf. (8)) and on the left hand side a term that depends nonlinearly on c. Defining 
fig m 
tS(c) := p(c) - __1 Z i and tS(zi,c):= p(zi,c) + i- __1  i, (11) 
i=1 i=1 
which we substitute for p(c) and p(z,c) in (5), respectively, we obtain a new 
optimization problem. Note that tSv(c) and tS(zi,c) play the role of a corrected 
or virtual margin. We obtain a nonlinear min-max problem 
maximize 
subject to tS(zi,c) _>/5(c) for all 1 _< i _< rn 
at>_0 for all 1_< t <_ T 
(12) 
which Arc-GV can solve approximately (cf. section 2). Hence, by replacing the mar- 
gin p(z, c) by/5(z, c) in equation (4) and the other formulas for Arc-GV (cf. [2]), 
u-Arc: Ensemble Learning in the Presence of Outliers 565 
we obtain a new algorithm which we refer to as v-Arc. 
We can now state interesting properties for v-Arc by using Theorem 5 of [13] that 
bounds the generalization error R(f) for ensemble methods. In our case Rp (f) _< v 
by construction (i.e. the number of patterns with a margin smaller than p, cf. Propo- 
sition 2), thus we get the following simple reformulation of this bound: 
Theorem 3. Let p(x, y) be a distribution over A' x [-1, 1], and let X be a sample 
of m examples chosen lid according to p. Suppose the base-hypothesis space G has 
VC dimension h, and let 5  O. Then with probability at least I - 5 over the 
random choice of the training set X, Y, every function f generated by v-Arc, where 
v  (0, 1) and pv  O, satisfies the following bound: 
R(f) _< v + + log ). (13) 
So, for minimizing the right hand side we can tradeoff between the first and the 
second term by controlling an easily interpretable regularization parameter v. 
4 Experiments 
We show a set of toy experiments to illustrate the general behavior of v-Arc. As 
base hypothesis class G we use the RBF networks of [11], and as data a two-class 
problem generated from several 2D Gauss blobs (cf. Banana shape dataset from 
bttp://www.first.gmd.de/~data/banana.btm[.). We obtain the following results: 
 v-Arc leads to approximately vrn patterns that are effectively used in 
the training of the base learner: Figure I (left) shows the fraction 
of patterns that have high average weights during the learning process 
(i.e. EtT= wt(zi) ) 1/2m). We find that the number of the latter increases 
(almost) linearly with v. This follows from (11) as the (soft) margin of 
patterns with p(z, c) ( pv is set to p and the weight of those patterns will 
be the same. 
 The (estimated) test error, averaged over 10 training sets, exhibits a rather 
fiat minimum in v (Figure I (lower)). This indicates that just as for v- 
SVMs, where corresponding results have been obtained, v is a well-behaved 
parameter in the sense that a slight misadjustment it is not harmful. 
 v-Arc leads to the fraction v of margin errors (cf. dashed line in Figure 1) 
exactly as predicted in Proposition 2. 
 Finally, a good value of v can already be inferred from prior knowledge of 
the expected error. Setting it to a value similar to the latter provides a 
good starting point for further optimization (cf. Theorem 3). 
Note that for v = 1, we recover the Bagging algorithm (if we used bootstrap 
samples), as the weights of all patterns will be the same (wt(zi) = 1/m for all 
i = 1,... , m) and also the hypothesis weights will be constant (at  1IT for all 
t= 1,... ,T). 
Finally, we present a small comparison on ten benchmark 
data sets obtained from the UCI [1] benchmark repository 
(cf. http://da.frst.gmd.de/~raetsch/data/benchraaxks.htral). We an- 
alyze the performance of single RBF networks, AdaBoost, v-Arc and RBF-SVMs. 
For AdaBoost and v-Arc we use RBF networks [11] as base hypothesis. The 
model parameters of RBF (number of centers etc.), v-Arc (v) and SVMs (rr, C) are 
optimized using 5-fold cross-validation. More details on the experimental setup can 
566 G. Riitsch, B. SchGlkopf A. J. Smola, K.-R. Miller, T. Onoda andS. Mika 
0.8 number of important 
 patterns 
_ o.6 number of 
 margin errors _.  
 0.4 t ]  training error 
0.2 
Figure 1: Toy experiment (a = 0): the left 
0'16 t ......... 
0.12 
0.11 
graph shows the average fraction of important 
patterns, the av. fraction of margin errors and the av. training error for different values 
of the regularization constant y for y-Arc. The right graph shows the corresponding 
generalization error. In both cases, the parameter v allows us to reduce the test errors 
to values much lower than for the hard margin algorithm (for y -- 0 we recover Arc- 
GV/AdaBoost, and for y - I we get Bagging.) 
be found in [11]. Fig. 1 shows the generalization error estimates (after averaging 
over 100 realizations of the data sets) and the confidence interval. The results 
of the best classifier and the classifiers that are not significantly worse are set in 
bold face. To test the significance, we used a t-test (p = 80%). On eight out of 
the ten data sets, v-Arc performs significantly better than AdaBoost. This clearly 
shows the superior performance of v-Arc for noisy data sets and supports this soft 
margin approach for AdaBoost. Furthermore, we find comparable performances 
for v-Arc and SVMs. In three cases the SVM performs better and in two cases 
v-Arc performs best. Summarizing, AdaBoost is useful for low noise cases, where 
the classes are separable. v-Arc extends the applicability of boosting to problems 
that are difficult to separate and should be applied if the data are noisy. 
5 Conclusion 
We analyzed the AdaBoost algorithm and found that Arc-GV and AdaBoost are 
efficient for approximating the solution of non-linear min-max problems over huge 
hypothesis classes. We re-parameterized the LPReg-AdaBoost algorithm (cf. [7, 11]) 
and introduced a new regularization constant v that controls the fraction of pat- 
terns inside the margin area. The new parameter is highly intuitive and has to be 
optimized only on a fixed interval [0, 1]. 
Using the fact that Arc-GV can approximately solve min-max problems, we found a 
formulation of Arc-GV - v-Arc- that implements the v-idea for Boosting by defining 
an appropriate soft margin. The present paper extends previous work on regular- 
izing boosting (DOOM [9], AdaBoostReg [11]) and shows the utility and flexibility 
of the soft margin approach for AdaBoost. 
Banana 
B.Cancer 
Diabetes 
German 
Heart 
Ringnorm 
F.Sonar 
Thyroid 
Titanic 
Waveform 
RBF 
10.8 + 0.06 
27.6 4- 0.47 
24.3 + 0.19 
24.7 4- 0.24 
AB v-Arc 
12.3 + 0.07 10.6 + 0.05 
30.4 4- 0.47 25.8 4- 0.46 
26.5 + 0.23 23.7 + 0.20 
27.5 4- 0.25 24.4 4- 0.22 
SVM 
11.5 + 0.07 
26.0 + 0.47 
23.5 4- 0.17 
23.6 4- 0.21 
17.6 4- 0.33 
1.7 4- 0.02 
34.4 4- 0.20 
4.5 4- 0.21 
23.3 4- 0.13 
10.7 4- 0.11 
20.3 4- 0.34 
1.9 4- 0.03 
35.7 4- 0.18 
4.4 4- 0.22 
22.6 4- 0.12 
10.8 + 0.06 
16.5 4- 0.36 
1.7 4- 0.02 
34.4 4- 0.19 
4.4 4- 0.22 
23.0 4- 0.14 
10.0 4- 0.07 
16.0 4- 0.33 
1.7 4- 0.01 
32.4 + 0.18 
4.8 4- 0.22 
22.4 4- 0.10 
9.9 4- 0.04 
Table 1: Generalization error estimates and confidence intervals. The best classifiers for a 
particular data set are marked in bold face (see text). 
u-Arc: Ensemble Learning in the Presence of Outliers 567 
We found empirically that the generalization performance in v-Arc depends only 
slightly on the choice of the regularization constant. This makes model selection 
(e.g. via cross-validation) easier and faster. 
Future work will study the detailed regularization properties of the regularized ver- 
sions of AdaBoost, in particular in comparison to v-LP Support Vector Machines. 
Acknowledgments: Partial funding from DFG grant (Ja 379/52) is gratefully 
acknowledged. This work was done while AS and BS were at GMD FIRST. 
References 
[1] C. Blake, E. Keogh, and C. J. Merz. UCI repository of machine learning databases, 
1998. http://www.ics.uci.edu/-mlearn/MLRepository. html. 
[2] L. Breiman. Prediction games and arcing algorithms. Technical Report 504, Statistics 
Department, University of California, December 1997. 
[3] M. Frean and T. Downs. A simple cost function for boosting. Technical report, Dept. 
of Computer Science and Electrical Eng., University of Queensland, 1998. 
[4] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning 
and an application to boosting. In Computational Learning Theory: Eurocolt '95, 
pages 23-37. Springer-Verlag, 1995. 
[5] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning 
and an application to boosting. J. of Comp. Fj Syst. Sc., 55(1):119-139, 1997. 
[6] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical 
view of boosting. Technical report, Stanford University, 1998. 
[7] A. Grove and D. Schuurmans. Boosting in the limit: Maximizing the margin of 
learned ensembles. In Proc. of the 15th Nat. Conf. on AI, pages 692-699, 1998. 
[8] Y. LeCun, L. D. Jackel, L. Bottou, C. Cortes, J. S. Denker, H. Drucker, I. Guyon, 
U. A. Miiller, E. S/ickinger, P. Simard, and V. Vapnik. Learning algorithms for 
classification: A comparison on handwritten digit recognition. Neural Networks, pages 
261-276, 1995. 
[9] L. Mason, P. L. Bartlett, and J. Baxter. Improved generalization through explicit 
optimization of margins. Machine Learning, 1999. to appear. 
[10] J. R. Quinlan. Boosting first-order learning (invited lecture). Lecture Notes in Com- 
puter Science, 1160:143, 1996. 
[11] G. R/itsch, T. Onoda, and K.-R. Milllet. Soft margins for AdaBoost. Technical Report 
NC-TR-1998-021, Department of Computer Science, Royal Holloway, University of 
London, Egham, UK, 1998. To appear in Machine Learning. 
[12] G. Ritsch, B. SchSkopf, A. Smola, S. Mika, T. Onoda, and K.-R. Milllet. Robust 
ensemble learning. In A.J. Smola, P.L. Bartlett, B. SchSlkopf, and D. Schuurmans, 
editors, Advances in LMC, pages 207-219. MIT Press, Cambridge, MA, 1999. 
[13] R. Schapire, Y. Freund, P. L. Bartlett, and W. Sun Lee. Boosting the margin: A 
new explanation for the effectiveness of voting methods. Annals of Statistics, 1998. 
(Earlier appeared in: D. H. Fisher, Jr. (ed.), Proc. ICML97, M. Kaufmann). 
[14] B. SchSlkopf, C. J. C. Burges, and A. J. Smola. Advances in Kernel Methods -- 
Support Vector Learning. MIT Press, Cambridge, MA, 1999. 
[15] B. SchSlkopf, A. Smola, R. C. Williamson, and P. L. Bartlett. New support vector 
algorithms. Neural Computation, 12:1083 - 1121, 2000. 
[16] H. Schwenk and Y. Bengio. Training methods for adaptive boosting of neural net- 
works. In Michael I. Jordan, Michael J. Kearns, and Sara A. Solla, editors, Advances 
in Neural Inf. Processing Systems, volume 10. The MIT Press, 1998. 
[17] V. Vapnik. The Nature of Statistical Learning Theory. Springer Verlag, New York, 
1995. 
