Potential Boosters ? 
Nigel Duffy 
Department of Computer Science 
University of California 
Santa Cruz, CA 95064 
nigedufjfcse. ucsc. edu 
David Helmbold 
Department of Computer Science 
University of California 
Santa Cruz, CA 95064 
dphcse. ucsc. edu 
Abstract 
Recent interpretations of the Adaboost algorithm view it as per- 
forming a gradient descent on a potential function. Simply chang- 
ing the potential function allows one to create new algorithms re- 
lated to AdaBoost. However, these new algorithms are generally 
not known to have the formal boosting property. This paper ex- 
mines the question of which potential functions lead to new al- 
gorithms that are boosters. The two main results are general sets 
of conditions on the potential; one set implies that the resulting 
algorithm is a booster, while the other implies that the algorithm 
is not. These conditions are applied to previously studied potential 
functions, such as those used by LogitBoost and Doom II. 
I Introduction 
The first boosting algorithm appeared in Rob Schapire's thesis [1]. This algorithm 
was able to boost the performance of a weak PAC learner [2] so that the resulting 
algorithm satisfies the strong PAC learning [3] criteria. We will call any method 
that builds a strong PAC learning algorithm from a weak PAC learning algorithm 
a PAC boosting algorithm. Freund and Schapire later found an improved PAC 
boosting algorithm called AdaBoost [4], which also tends to improve the hypotheses 
generated by practical learning algorithms [5]. 
The AdaBoost algorithm takes a labeled training set and produces a master hy- 
pothesis by repeatedly calling a given learning method. The given learning method 
is used with different distributions on the training set to produce different base 
hypotheses. The master hypothesis returned by AdaBoost is a weighted vote of 
these base hypotheses. AdaBoost works iteratively, determining which examples 
are poorly classified by the current weighted vote and selecting a distribution on 
the training set to emphasize those examples. 
Recently, several researchers [6, 7, 8, 9, 10] have noticed that Adaboost is performing 
a constrained gradient descent on an exponential potential function of the margins 
of the examples. The margin of an example is yF(x) where y is the :51 valued label 
of the example x and F(x)   is the net weighted vote of master hypothesis F. 
Once Adaboost is seen this way it is clear that further algorithms may be derived 
by changing the potential function [6, 7, 9, 10]. 
Potential Boosters? 259 
The exponential potential used by AdaBoost has the property that the influence 
of a data point increases exponentially if it is repeatedly misclassified by the base 
hypotheses. This concentration on the "hard" examples allows AriaBoost to rapidly 
obtain a consistent hypothesis (assuming that the base hypotheses have certain 
properties). However, it also means that an incorrectly labeled or noisy example 
can quickly attract much of the distribution. It appears that this lack of noise- 
tolerance is one of AdaBoost's few drawbacks [tt]. Several researchers [7, 8, 9, t0] 
have proposed potential functions which do not concentrate as much on these "hard" 
examples. However, they generally do not show that the derived algorithms have 
the PAC boosting property. 
In this paper we return to the original motivation behind boosting algorithms and 
ask: "for which potential functions does gradient descent lead to PAC boosting 
algorithms" (i.e. boosters that create strong PAC learning alg. orithms from arbitrary 
weak PAC learners). We give necessary conditions that are met by some of the 
proposed potential functions (most notably the LogitBoost potential introduced by 
Friedman et al. [7]). Furthermore, we show that simple gradient descent on other 
proposed potential functions (such as the sigmoidal potential used by Mason et 
al. It0]) cannot convert arbitrary weak PAC learning algorithms into strong PAC 
learners. The aim of this work is to identify properties of potential functions required 
for PAC boosting, in order to guide the search for more effective potentials. 
Some potential functions have an additional tunable parameter [10] or change over 
time [12]. Our results do not yet apply to such dynamic potentials. 
2 PAC Boosting 
Here we define the notions of PAC learning  and boosting, and define the notation 
used throughout the paper. 
A concept C is a subset of the learning domain X. A random example of C is a pair 
(x  X,y  {-1, +1)) where x is drawn from some distribution on X and y = I if 
x  C and -1 otherwise. A concept class is a set of concepts. 
Definition 1 A (strong) PAC learner for concept class C has the property that for 
every distribution D on 2d, all concepts C E C, and all 0 < e, 6 < 1/2: with probabil- 
ity at least 1- 6 the algorithm outputs a hypothesis h where PD[h(x)  C(x)] _ e. 
The learning algorithm is given C, e, 6, and the ability to draw random examples of 
C (w.r.t. distribution D), and must run in time bounded by poly(1/e, 1/6). 
Definition 2 A weak PAC learner is similar to a strong PA C learner, except that 
it need only satisfy the conditions for a particular 0 < co, 6o < 1/2 pair, rather than 
for all e,6 pairs. 
Definition 3 A PAC boosting algorithm is a generic algorithm which can leverage 
any weak PA C learner to meet the strong PA C learning criteria. 
In the remainder of the paper we emphasize boosting the accuracy e as it is much 
easier to boost the confidence 6, see Haussler et al. [13] and Freund [14] for details. 
Furthermore, we emphasize boosting by re-sampling, where the strong PAC learner 
draws a large sample, and each iteration the weak learning algorithm is called with 
some distribution over this sample. 
XTo simplify the presentation we omit the instance space dimension and taxget repre- 
sentation length parameters. 
260 N. DuJ and D. Helmbold 
Throughout the paper we use the following notation. 
 m is the cardinality of the fixed sample ((xl,Yl),..., (xm,y,)). 
 ht(x) is the +1 valued weak hypothesis created at iteration t. 
 at is the weight or vote of ht in the master hypothesis, the a's may or may 
not be normalized so that tt,=l oft, = 1. 
t t 
 Ft(x) .t,=l(ott, ht,(x) , is the master hypothesis 2 at iter- 
ation t. 
 Ui,t -- Yi tt,=l ott, ht,(x) is the margin of xi after iteration t; the t sub- 
script is often omitted. Note that the margin is positive when the master 
Zt':l t" 
hypothesis is correct, and the normalized margin is Ui,t/ t 
 p(u) is the potential of an instance with margin u, and the total potential 
is Zim=l p(ui). 
 P D[ ],P$[ ], and Es[] are the probability with respect to the unknown 
distribution over the domain, and the probability aud expectations with 
respect to the uniform distribution over the sample, respectively. 
m U 
Our results apply to total potential functions of the form Y-i=l P(i) where p is 
positive and strictly decreasing. 
3 Leveraging Learners by Gradient Descent 
AdaBoost [4] has recently been interpreted as gradient descent independently by 
several groups [6, 7, 8, 9, 10. Under this interpretation AdaBoost is seen as minimiz- 
ing the total potential Y-i=l p(ui) = Y',im_-i exp(-ui) via feasible direction gradient 
descent. On each iteration t + 1, AdaBoost chooses the direction of steepest descent 
as the distribution on the sample, and calls the weak learner to obtain a new base 
hypothesis ht+. The weight ct+l of this new weak hypothesis is calculated to min- 
imize 3 the resulting potential '.im=l P(Ui,t+l ) -- im_-i exp(--(ui,t -{- ott+lYiht+l (xi) ) ). 
This gradient descent idea has been generalized to other potential functions [6, 7, 
10]. Duffy et al. [9] prove bounds for a similar gradient descent technique using a 
non-componentwise, non-monotonic potential function. 
Note that if the weak learner returns a good hypothesis ht (with training error at 
m 
most e < 1/2), then Y]-i=l Dt(xi)yiht(xi) ) I - 2e ) 0. We set r -- I - 2e, and 
assume that each base hypothesis produced satisfies Zim__l Dt(xi)yiht(xi) _ r. 
In this paper we consider this general gradient descent approach applied to various 
potentials -]-i1 p(ui). Note that each potential function p has two corresponding 
gradient descent algorithms (see [6]). The un-normalized algorithms (like AdaBoost) 
continually add in new weak hypotheses while preserving the old a's. The normal- 
ized algorithms re-scale the a's so that they always sum to 1. In general, we call 
such algorithms "leveraging algorithms", reserving the term "boosting" for those 
that actually have the PAC boosting property. 
4 Potentials that Don't Boost 
In this section we describe sufficient conditions on potential functions so that the 
corresponding leveraging algorithm does not have the PAC boosting property. We 
2The prediction of the master hypothesis on instance x is the sign of Ft(x). 
SOur current proofs require that the actual ct's be no greater than a constant (say 1). 
Therefore, this minimizing c may need to be reduced. 
Potential Boosters? 261 
apply these conditions to show that two potentials from the literature do not lead 
to boosting algorithms. 
Theorem 1 Let p(u) be a potential function for which: 
1) the derivative, p'(u), is increasing (-p'(u) decreasing) in +, and 
2) 3/ > 0 such that for all u  O, -/p'(u) _ -p'(-2u). 
Then neither the normalized nor the un-normalized leveraging algorithms 
sponding to potential p have the PA C boosting property. 
co rre - 
This theorem is proven by an adversary argument. Whenever the concept class is 
sufficiently rich 4, the adversary can keep a constant fraction of the sample from 
being correctly labeled by the master hypothesis. Thus as the error tolerance e goes 
to zero, the master hypotheses will not be sufficiently accurate. 
We now apply this theorem to two potential functions from the literature. 
Friedman et al. [7] describe a potential they call "Squared Error(p)" where the 
potentialatxiis (yi+ I er(x,) )2 
2 eF(x) -]-e_F(x ) This potential can be re-written 
1( 
asP$1(ui)-- 1 + 2u u' +  
 e-U k eU  e-U 
Corollary 1 Potential "Squared Error(p)" does not lead to a boosting algorithm. 
Proof: This potential satisfies the conditions of Theorem 1. It is strictly decreas- 
ing, and the second condition holds for/ = 2. 
Mason et al. [10] examine a normalized algorithm using the potential pa(u) ---- 
I -- tanh (Au). Their algorithm optimizes over choices of A via cross-validation, and 
uses weak learners with slightly different properties. However, we can plug this 
potential directly into the gradient descent framework and examine the resulting 
algorithms. 
Corollary 2 The DOOMII potential PD does not lead to a boosting algorithm for 
any .fixed A. 
Proof: The potential is strictly decreasing, and the second condition of Theorem 1 
holds for/ = 1. 
Our techniques show that potentials that are sigmoidal in nature do not lead to 
algorithms with the PAC boosting property. Since sigmoidal potentials are gen- 
erally better over-estimates of the 0, 1 loss than the potential used by AdaBoost, 
our results imply that boosting algorithms must use a potential with more subtle 
properties than simply upper bounding the 0, 1 loss. 
5 Potential Functions That Boost 
In this section we give sufficient conditions on a potential function for it's corre- 
sponding un-normalized algorithm to have the PAC boosting property. This result 
implies that AdaBoost [4] and LogitBoost [7] have the PAC boosting property (Al- 
though this was previously known for AdaBoost [4], we believe this is a new result 
for LogitBoost). 
4The VC-dimension 4 concept class consisting of pairs of intervals on the real line is 
sufficient for our adversary. 
262 N. DuJ] and D. Helmbold 
One set of conditions on the potential imply that it decreases roughly exponen- 
tially when the (un-normalized) margins are large. Once the margins are in this 
exponential region, ideas similar to those used in AdaBoost's analysis show that the 
minimum normalized margin quickly becomes bounded away from zero. This allows 
us to bound the generalization error using a theorem from Bartlett et al. [15]. 
A second set of conditions governs the behavior of the potential function before 
the un-normalized margins are large enough. These conditions imply that the total 
potential decreases by a constant factor each iteration. Therefore, too much time 
will not be spent before all the margins enter the exponential region. 
t 
The margin value bounding the exponential region is U, and once -]i=l p(ui) _ 
p(U), all margins p(ui) will remain in the exponential region. The following theorem 
gives conditions on p ensuring that = p(ui) quickly becomes less than p(U). 
Theorem 2 If the following conditions hold for p(u) and U: 
1. -p'(u) is strictly decreasing-and 0  p"(u) _ B, and 
2. 3q  0 such that p(u) _ -qp'(u) Vu  U, 
then im__ p(ui) _ p(U) aer Ti _ 
4Bq2m2p(O)ln(mp(O) ) 
p(V) 
p(U)2r 2 
iterations. 
The proof of this theorem approximates the new total potential by the old potential 
minus ct times a linear term, plus an error. By bounding the error as a function of 
ct and minimizing we demonstrate that some values of ct give a sufficient decrease 
in the total potential. 
Theorem 3 If the following conditions hold for p(u), U, q, and iteration Tx' 
3/? _ v such that -p'(u -t- v) _ p(u -t- v) _ -p'(u)/?-Vq whenever -1 _ 
v _ l and u ) U, 
2. im__x p(ui,T) _ p(U), 
3. -p(u) is strictly decreasing, and 
d. 3C > O, ff > 1 such that Cp(u) _ ff- Vu > U 
then P s[yFT(x) _ O] _ 
nentially in T - T if 0  
_70Txp(U) (frO q) (T-Tx) 
-ln(qv-r 2) 
ln(7) 
which decreases expo- 
The proof of this theorem is a generalization of the AdaBoost proof. 
Combining these two theorems, and the generalization bound from Theorem 2 of 
Bartlett et al. [15] gives the following result, where d is the VC dimension of the 
weak hypothesis class. 
Theorem 4 If for all edges 0  r  1/2 there exists Tx,r _ poly(m, l/r), Ur, 
and q satisfying the conditions of Theorem 3 such that p(U) _ poly(r) and 
qv/ - r  = l(r) ( 1 - poly(r), then in time poly(m, l/r) all examples have nor- 
Potential Boosters? 263 
malized margin at least 0 = In 2t(r) /ln(7) and 
(1( 
Pv[yFr(x) <_ 0] e O (ln(/(r)+ 1)-ln(21(r))) 2 +log(I/5) 
Choosing m appropriately makes the error rate sufficiently small so that the algo- 
rithm corresponding to p has the PA C boosting property. 
We now apply Theorem 4 to show that the AdaBoost and LogitBoost potentials 
lead to boosting algorithms. 
6 Some Boosting Potentials 
In this section we show as a direct consequence of our Theorem 4 that the potential 
functions for AdaBoost and LogitBoost lead to boosting algorithms. Note that 
the LogitBoost algorithm we analyze is not exactly the same as that described by 
Friedman et al. [7], their "weak learner" optimizes a square loss which appears to 
better fit the potential. First we re-derive the boosting property for AdaBoost. 
Corollary 3 AdaBoost's [16] potential boosts. 
Proof: To prove this we simply need to show that the potential p(u) = exp(-u) 
satisfies the conditions of Theorem 4. This is done by setting U = - ln(m), q = 1, 
3'-fi-e,C=l, andT =0. 
Corollary 4 The log-likelihood potential (as used in LogitBoost [7]) boosts. 
Proof: In this case p(u) = ln(1 + e -u) and -p'(u) = x+e-' We set 7 =/? = e, 
C-2, U=-ln(X/1-e2/2-1) andq=l+exp(-Ur)= lv/-27 Now The- 
x/T- e 2 
orem 2 shows that after T < poly(m, I/r) iterations the conditions of Theorem 4 
are satisfied. 
7 Conclusions 
In this paper we have examined leveraging weak learners using a gradient descent 
approach [9]. This approach is a direct generalization of the Adaboost [4, 16] algo- 
rithm, where Adaboost's exponential potential function is replaced by alternative 
potentials. We demonstrated properties of potentials that are sufficient to show 
that the resulting algorithms are PAC boosters, and other properties that imply 
that the resulting algorithms are not PAC boosters. We applied these results to 
several potential functions from the literature [7, 10, 16]. 
New insight can be gained from examining our criteria carefully. The conditions 
that show boosting leave tremendous freedom in the choice of potential function 
for values less than some U, perhaps this freedom can be used to choose potential 
functions which do not overly concentrate on noisy examples. 
There is still a significant gap between these two sets of properties, we are still a long 
way from classifying arbitrary potential functions as to their boosting properties. 
There are other classes of leveraging algorithms. One class looks at the distances 
between successive distributions [17, 18]. Another class changes their potential 
264 N. Duf and D. HelmboM 
over time [6, 8, 12, 14]. The criteria for boosting may change significantly with 
these different approaches. For example, Freund recently presented a boosting 
algorithm [12] that uses a time-varying sigmoidal potential. It would be interesting 
to adapt our techniques to such dynamic potentials. 
References 
[1] Robert E. Schapire. The Design and Analysis of Ecient Learning Algorithms. MIT 
Press, 1992. 
[2] Michael Kearns and Leslie Valiant. Cryptographic limitations on learning Boolean 
formulae and finite automata. Journal of the ACM, 41(1):67-95, January 1994. 
[3] L. G. Valiant. A theory of the learnable. Commun. ACM, 27(11):1134-1142, Novem- 
ber 1984. 
[4] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line 
learning and an application to boosting. Journal of Computer and System Sciences, 
55(1):119-139, August 1997. 
[5] Eric Bauer and Ron Kohavi. An empirical comparison of voting classification algo- 
rithms: Bagging, boosting and variants. Machine Learning, 36(1-2):105-39, 1999. 
[6] Leo Breiman. Arcing the edge. Technical Report 486, Department of Statistics, 
University of California, Berkeley., 1997. available at www.stat.berkeley. edu. 
[7] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: 
a statistical view of boosting. Technical report, Stanford University, 1998. 
[8] G. R/itsch, T. Onoda, and K.-R. Mfiller. Soft margins for AdaBoost. Machine Learn- 
ing, 2000. To appear. 
[9] Nigel Duffy and David P. Helmbold. A geometric approach to leveraging weak learn- 
ers. In Paul Fischer and Hans Ulrich Simon, editors, Computational Learning Theory: 
Jth European Conference (EuroCOLT '99), pages 18-33. Springer-Verlag, March 1999. 
[10] Llew Mason, Jonathan Baxter, Peter Bartlett, and Marcus Frean. Boosting algorithms 
as gradient descent. To appear in NIPS 2000. 
[11] Thomas G. Dietterich. An experimental comparison of three methods for construct- 
ing ensembles of decision trees: Bagging, Boosting, and Randomization. Machine 
Learning. To appear. 
[12] Yoav Freund. An adaptive version of the boost-by-majority algorithm. In Proc. 12th 
Annu. Conf. on Cornput. Learning Theory, pages 102-113. ACM, 1999. 
[13] David Haussler, Michael Kearns, Nick Littlestone, and Manfred K. Warmuth. Equiva- 
lence of models for polynomial learnability. Information and Computation, 95(2):129- 
161, December 1991. 
[14] Y. Freund. Boosting a weak learning algorithm by majority. Information and Com- 
putation, 121(2):256-285, September 1995. 
[15] Robert E. Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee. Boosting the 
margin: A new explanation for the effectiveness of voting methods. The Annals of 
Statistics, 26(5):1651-1686, 1998. 
[16] Robert E. Schapire and Yoram Singer. Improved boosting algorithms using 
confidence-rated predictions. Machine Learning, 37(3):297-336, December 1999. 
[17] Jyrki Kivinen and Manfred K. Warmuth. Boosting as entropy projection. In Proc. 
12th Annu. Conf. on Cornput. Learning Theory, pages 134-144. ACM, 1999. 
[18] John Lafferty. Additive models, boosting, and inference for generalized divergences. 
In Proc. 12th Annu. Conf. on Cornput. Learning Theory, pages 125-133. ACM. 
