Direct Optimization of Margins Improves 
Generalization in Combined Classifiers 
Llew Mason,Peter Bartlett, Jonathan Baxter 
Department of Systems Engineering 
Australian National University, Canberra, ACT 0200, Australia 
{lmason, bartlett, jon}@syseng.anu.edu.au 
Abstract 
1oo 
80 
2o 
-I -0.8 -0.6 -0.4 -0.2 o 0.2 0.4 0.6 0.8 I 
Margin 
Cumulative training margin dis- 
tributions for AdaBoost versus 
our "Direct Optimization Of 
Margins" (DOOM) algorithm. 
The dark curve is AdaBoost, the 
light curve is DOOM. DOOM 
sacrifices significant training er- 
ror for improved test error (hori- 
zontal marks on margin= 0 line). 
i Introduction 
Many learning algorithms for pattern classification minimize some cost function of 
the training data, with the aim of minimizing error (the probability of misclassifying 
an example). One example of such a cost function is simply the classifier's error 
on the training data. Recent results have examined alternative cost functions that 
provide better error estimates in some cases. For example, results in [Bar98] show 
that the error of a sigmoid network classifier f(.) is no more than the sample average 
of the cost function sgn(0-yy(x)) (which takes value 1 when yy(x) is no more than 
0 and 0 otherwise) plus a complexity penalty term that scales as Ilwl11/O, where 
(x,y) E X x (+1) is a labelled training example, and ]IWII1 is the sum of the 
magnitudes of the output node weights. The quantity yf(x) is the margin of the 
real-valued function f, and reflects the extent to which f(x) agrees with the label y E 
{+l). By minimizing squared error, neural network learning algorithms implicitly 
maximize margins, which may explain their good generalization performance. 
More recently, Schapire et al [SFBL98] have shown a similar result for convex com- 
binations of classifiers, such as those produced by boosting algorithms. They show 
Direct Optimization of Margins Improves Generalization 289 
that, with high probability over m random examples, every convex combination of 
classifiers from some finite class H has error satisfying 
(1 (logmloglH [ 
Pr[yf(x) _< 0] _< Es [sgn(0 - yf(x))] + O  02 + log(I/5) (1) 
for all 0 > 0, where Es denotes the average over the sample S. 
One way to think of these results is as a technique for adjusting the effective com- 
plexity of the function class by adjusting 0. Large values of 0 correspond to low 
complexity and small values to high complexity. If the learning algorithm were to 
optimize the parametrized cost function Essgn(0 - yf(x)) for large values of 0, it 
would not be able to make fine distinctions between different functions in the class, 
and so the effective complexity of the class would be reduced. The second term in 
the error bounds (the regularization term involving the complexity parameter 0 and 
the size of the base hypothesis class H) would be correspondingly reduced. In both 
the neural network and boosting settings, the learning algorithms do not directly 
minimize these cost functions; we use different values of the complexity parameter 
in the cost functions only in explaining their generalization performance. 
In this paper, we address the question: what are suitable cost functions for con- 
vex combinations of classifiers? In the next section, we give general conditions on 
parametrized families of cost functions that ensure that they can be used to give er- 
ror bounds for convex combinations of classifiers. In the remainder of the paper, we 
investigate learning algorithms that choose the convex coefficients of a combined 
classifier by minimizing a suitable family of piecewise linear cost functions using 
gradient descent. Even when the base hypotheses are chosen by the AdaBoost al- 
gorithm, and we only use the new cost functions to adjust the convex coefficients, 
we obtained an improvement on the test error of AdaBoost in all but one of the 
UC Irvine data sets we used. Margin distribution plots show that in many cases the 
algorithm achieves these lower errors by sacrificing training error, in the interests 
of reducing the new cost function. 
2 Theory 
In this section, we derive an error bound that generalizes the result for convex 
combinations of classifiers described in the previous section. The result involves a 
family of margin cost functions (functions mapping from the interval [-1, 1] to ll + ), 
indexed by an integer-valued complexity parameter N, which measures the resolu- 
tion at which we examine the margins. The following definition gives conditions on 
the margin cost functions that relate the complexity N to the amount by which the 
margin cost function is larger than the function sgn(-yf(x)). The particular form 
of this definition is not important. In particular, the functions v are only used in 
the analysis in this section, and will not concern us later in the paper. 
Definition I A family {Cr  N E N} of margin cost functions is B-admissible for 
B >_ 0 if for all N  N there is an interval  C I of length no more than B and a 
function N ' [--1, 1] -->  that satisfies 
sgn(-c) _< EzQN,, (N(Z)) _< CN(O) 
for all   [-1, 1], where Ez..QN,(-) denotes the expectation when Z is chosen 
randomly as Z - (l/N) y./N= Zi with Zi  {-1,1} and Pr(Zi - 1) = (1 + a)/2. 
As an example, let Cv(c) = sgn(0 - c) + c, for 0 = 1/v/ and some constant c. 
This is a B-admissible family of margin cost functions, for suitably large B. (This is 
290 L. Mason, P L. Bartlett and J. Baxter 
exhibited by the functions N(a) = sgn(0/2 -- a) + c/2; the proof involves Chernoff 
bounds.) Clearly, for larger values of N, the cost functions CN are closer to the 
threshold function sgn(-c). Inequality (1) is implied by the following theorem. 
In this theorem, co(H) is the set of convex combinations of functions from H. A 
similar proof gives the same result with VCdim(H) in m replacing in 
Theorem 2 For any B-admissible family {CN : N  1I) of margin cost functions, 
any finite hypothesis class H and any distribution P on X x {-1, 1), with probability 
at least 1 -5 over a random sample $ of m labelled examples chosen according to 
P, every N and every f in co(H) satisfies 
B 2 
Pr[yf(x) _< 0] < Es[CN(yf(x))] + --m (NlnIHl + ln(N(N + 1)/5)). 
Proof Fix N and f  co(H), and suppose that f = Y']i oqhi for hi  H. Define 
CoN(H) = {(l/N)yY= hj 'hj H}, and notice that ICON(H)] < IH[ N. As in 
the proof of (1) in [SFBL98], we show using the probabilistic method that there is 
a function g in CON(H) that closely approximates f. Let Q be the distribution on 
CoN(H) corresponding to the average of N independent draws from {hi} according 
to the distribution {ci}, and let QN,, be the distribution given in Definition 1. Then 
for any fixed pair x, y, when g is chosen according to Q the distribution of yg(x) is 
QN,yf(x). Now, fix the function N implied by the B-admissibility condition. By 
the definition of B-admissibility, 
Egc2Er'[N(Yg(x))] = EpEzQN,s(,)[N(Z)] ) Epsgn(-yf(x)) = P[yf(x) _< 0]. 
Similarly, E$[CN(yf(x))] _> Egc2E$[N(yg(x))]. Hence, if Pr[yf(x) _< 0] - 
Es [UN(yf(x))] ) eN, then EQ [E [N(yg(x))] -- Es [N(yg(x))]] ) eN. Thus, 
Pr[3f  Pr[yf(x)5 0] Es[CN(yf(x))]+N] 
_ Pr[3g  coN(H)' Ep[N(yg(x))] _ E$[N(yg(x))] +eN] 
_ ]HI N exp(-2mev/B2), 
where the last inequality follows from the union bound and Hoeffding's inequality. 
Setting this probability to 5N = 5/(N(N + 1)), solving for eN, and summing over 
values of N completes the proof, since YNe 5N = 5. [] 
For the best bounds, we want N to satisfy EzQN, [N(Z)] _> sgn(--ct), but with 
the difference EzQ,, [N(Z)- sgn(-c)] as small as possible for c  [-1, 1]. 
One approach would be to minimize the expectation of this difference, for ct chosen 
uniformly in [-1, 1]. However, this yields a non-monotone solution for C(ct). 
Figure la illustrates an example of a monotone B-admissible family; it shows the 
cost functions CN(O) = EzQ,, N(Z), for N = 20, 50 and 200, where N(Ct) = 
sgn( v/21og N/N - o ) + 1/N. 
3 Algorithm 
We now consider how to select convex coefficients Wl,... , WT for a sequence of 
T 
{-1, 1} classifiers h,..., hT so that the combined classifier f(x) = Y't= wtht(x) 
has small error. In the experiments we used the hypotheses provided by AdaBoost. 
(The aim was to investigate how useful are the error estimates provided by the cost 
functions of the previous section.) 
If we take Theorem 2 at face value and ignore log terms, the best error bound is 
obtained if the weights Wl,..., wr and the complexity N are chosen to minimize 
Direct Optimization of Margins Improves Generalization 291 
1.2 
1 
o 0.6 
0.4 
0.2 
-' -0.5 0 0.5 
0. 
0. 
0. 
-1 -0.5 0 0.5 1 
Figure 1: (a) The margin cost functions CN(c), for N = 20, 50 and 200, compared to the 
function sgn(-c). Larger values of N correspond to closer approximations to sgn(-c). 
(b) Piecewise linear upper bounds on the functions CN(ct), and the function sgn(-ct). 
(I/m) y'4m__ Civ(Yif(xi)) + nV/-ff-/m, where n is a constant and {Cv} is a family of 
B-admissible cost functions. Although Theorem 2 provides an expression for the 
constant n, in practical problems this will almost certainly be an overestimate and 
so our penalty for even moderately complex models will be too great. To solve 
this problem, instead of optimizing the average cost of the margins plus a penalty 
term over all values of the parameter 0, we estimated the optimal value of 0 using 
a cross-validation set. That is, for fixed values of 0 in a discrete but fairly dense 
m 
set we selected weights optimizing the average cost 1 Yi= Co(yif(xi)) and then 
m 
chose the solution with smallest error on an independent cross-validation set. 
We considered the use of the cost functions plotted in Figure la, but the existence of 
fiat regions caused difficulties for gradient descent approaches. Instead we adopted 
a piecewise linear family of cost functions Co that are linear in the intervals [-1, 0], 
[0,0], and [0, 1], and pass through the points (-1,1.2), (0,0.1), (0,0.1), and (1,0), 
for 0 E (0, 1). The numbers were chosen to ensure the Co are upper bounds on 
the cost functions of Figure la (see Figure lb). Note that 0 plays the role of a 
complexity parameter, except that in this case smaller values of 0 correspond to 
higher complexity classes. 
Even with the restriction to piecewise linear cost functions, the problem of optimiz- 
i m 
ing  Yi= CO(Yif(xi)) is still hard. Fortunately, the nature of this cost function 
makes it possible to find successful heuristics (which is why we chose it). The algo- 
rithm we have devised to optimize the Co family of cost functions is called Direct 
Optimization Of Margins (DOOM). (The pseudo-code of the algorithm is given in 
the full version [MBB98].) DOOM is basically a form of gradient descent, with two 
complications: it takes account of the fact that the cost function is not differen- 
tiable at 0 and 0, and it ensures that the weight vector lies on the unit ball in 11. 
In order to avoid problems with local minima we actually allow the weight vector 
to lie within the/1-ball throughout optimization rather than on the/1-ball. If the 
weight vector reaches the surface of the/1-ball and the update direction points out 
of the/-ball, it is projected back to the surface of the/1-ball. 
I m 
Observe that the gradient of  Yi=I Co(yif(xi)) is a constant function of the 
weights w = (w,...,WT) provided no example (xi,Yi) "crosses" one of the dis- 
continuities at 0 or 0 (i.e. provided the margin yif(xi) does not cross 0 or 0). 
Hence, the central operation of DOOM is to step in the negative gradient direction 
until an example's margin hits one of the discontinuities (projecting where neces- 
sary to ensure the weight vector lies within the ll ball). At this point the gradient 
vector becomes multi-valued (generally two-valued but it can be more). Each of the 
possible gradient directions is then tested by taking a small step in that direction (a 
292 L. Mason, P L. Bartlett and J. Baxter 
random subset of the gradient directions is chosen if there are too many of them). 
If none of the directions lead to a decrease in the cost, the examples whose margins 
lie on discontinuities of the cost function are added to a constraint set E. In sub- 
sequent iterations the same stepping procedure above is followed except that the 
direction step is modified to ensure that the examples in E do not move (i.e. they 
remain on the discontinuity points of C0). That is, the weight vector moves within 
the subspace defined by the examples in E. If no progress is made in any iteration, 
the constraint set E is reset to zero. If still no progress is made the procedure 
terminates. 
4 Experiments 
We used the following two-class problems from the UC Irvine database [CBM98]: 
Cleveland Heart Disease, Credit Application, German, Glass, Ionosphere, King 
Rook vs King Pawn, Pima Indians Diabetes, Sonar, Tic-Tac-Toe, and Wisconsin 
Breast Cancer. For the sake of simplicity we did not consider multi-class prob- 
lems. Each data set was randomly separated into train, test and validation sets, 
with the test and validation sets being equal in size. This was repeated 10 times 
independently and the results were averaged. 
35 
 30 
E 25 
 15 
' 10 
 5 
0 
-5 
0 5 10 15 20 
AdaBoost Test Error (%) 
25 30 
Figure 2: Relative improvement of 
DOOM over AdaBoost for all exam- 
ined datasets. 
Each experiment consisted of the following steps. 
First, AdaBoost was run on the training data to 
produce a sequence of base classifiers and their 
corresponding weights. In all of the experiments 
the base classifiers were axis-orthogonal hyper- 
planes (also known as decision stumps); this 
choice ensured that the complexity of the class 
of base classifiers was constant. Boosting was 
halted when adding a new classifier failed to de- 
crease the error on the validation set. DOOM 
was then run on the classifiers produced by Ad- 
aBoost for a large range of 0 values and 1000 
random initial weight vectors for each value of 0. 
The weight vector (and 0 value) with minimum 
misclassification on the validation set was chosen 
as the final solution. 
In some cases the training sets were reduced in size to make overfitting more likely, 
so that complexity regularization with DOOM could have an effect. (The details 
are given in the full version [MBB98].) In three of the datasets (Credit Applica- 
tion, Wisconsin Breast Cancer and Pima Indians Diabetes), AdaBoost gained no 
advantage from using more than a single classifier. In these datasets, the number 
of classifiers was chosen so that the validation error was reasonably stable. 
A comparison between the test errors generated by AdaBoost and DOOM is shown 
in Figure 2. In only one data set did DOOM produce a classifier which performed 
worse than AdaBoost in terms of test error; for most data sbts DOOM's test error 
was a significant improvement over AdaBoost's. 
Figure 3 shows cumulative training margin distribution graphs for four of the 
datasets for both AdaBoost and DOOM (with optimal 0 chosen by cross-validation). 
For a given margin the value on the curve corresponds to the proportion of training 
examples with margin no mor, e than this value. The test errors for both algorithms 
are also shown for comparison, as short horizontal lines on the vertical axis. 
The margin distributions show that the value of the minimum training margin has 
no real impact on generalization performance. (See also [Bre97] and [GS98].) As 
Direct Optimization of Margins Improves Generalization 293 
40 
30 
' 20 
I0 
0 
80 
6O 
4O 
20 
Wisconsion Breast Cancer 
-I 
-0.8 -0.6 -0.4-0.2 0 0.2 0.4 0.6 0.8 
Mar m 
Ionos he 
-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 I 
Margin 
IOO 
80 
20 
o 
IOO 
80 
6o 
4o 
20 
Credit Application 
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 
Margin 
So r  ] ....................... 
-I -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 
Margin 
Figure 3: Cumulative training margin distributions for four datasets. The dark curve is 
AdaBoost, the light curve is DOOM with 0 selected by cross-validation. The test errors 
for both algorithms are marked on the vertical axis at margin 0. 
can be seen in Figure 3 (Credit Application and Sonar data sets), the generaliza- 
tion performance of the combined classifier produced by DOOM can be as good 
as or better than that of the classifier produced by AdaBoost, despite having dra- 
matically worse minimum training margin. Conversely, Figure 3 (Ionosphere data 
set) shows that improved generalization performance can be associated with an 
improved minimum margin. 
The margin distributions also show that there is a balance to be found between 
training error and complexity (as measured by 0). DOOM is willing to sacrifice 
training error in order to reduce complexity and thereby obtain a better margin 
distribution. For instance, in Figure 3 (Sonar data set), DOOM's training error is 
over 20% while AdaBoost's is 0%, but DOOM's test error is 5% less than that of 
AdaBoost's. The reason for this success can be seen in Figure 4, which illustrates 
the changes in the cost function, training error, and test error as a function of 0. 
The optimal complexity for this data set is low (corresponding to a large optimal 
0). In this case, a reduction in complexity is more important to generalization error 
than a reduction in training error. 
5 Conclusion 
In this paper we have addressed the question: what are suitable cost functions for 
convex combinations of base hypotheses? For general families of cost functions that 
are functions of the margin of a sample, we proved (Theorem 2) that the error of 
a convex combination is no more than the sample average of the cost function plus 
a regularization term involving the complexity of the cost function and the size of 
the base hypothesis class. 
We constructed a piecewise linear family of cost functions satisfying the conditions of 
Theorem 2 and presented a heuristic algorithm (DOOM) for optimizing the sample 
294 L. Mason, P L. Bartlett and d. Baxter 
0.45 
0.40 
0.35 
0.30 
0.25 
0.20 
0.15 
0.10 
0.05 
0.00 
Figure 4: Sonar data set. Left: Plot of cost (4 i=, Co(yif(xi))) against 0 for AdaBoost 
and DOOM. Right: Plot of training and test error against 0. 
average of the cost. 
We ran experiments on several of the datasets in the UC Irvine database, in which 
AdaBoost was used to generate a set of base classifiers and then DOOM was used 
to find the optimal convex combination of those classifiers. In all but one case 
the convex combination generated by DOOM had lower test error than AdaBoost's 
combination. Margin distribution plots show that in many cases DOOM achieves 
these lower test errors by sacrificing training error, in the interests of reducing the 
new cost function. The margin plots also show that the size of the minimum margin 
is not relevant to generalization performance. 
Acknowledgments 
Thanks to Yoav Freund, Wee Sun Lee and Rob Schapire for helpful comments and 
suggestions. This research was supported in part by a grant from the Australian Re- 
search Council. Jonathan Baxter was supported by an Australian Research Council 
Fellowship and Llew Mason was supported by an Australian Postgraduate Award. 
References 
[Bar98] 
[Bre97] 
[CBM98] 
[GS98] 
[MBB98] 
[SFBL98] 
P. L. Bartlett. The sample complexity of pattern classification with neural 
networks: the size of the weights is more important than the size of 
the network. IEEE Transactions on Information Theory, 44(2):525-536, 
1998. 
L. Breiman. Prediction games and arcing algorithms. Technical Report 
504, Department of Statistics, University of California, Berkeley, 1997. 
E. Keogh C. Blake and C.J. Merz. UCI repository of machine learning 
databases, 1998. http://www.ics.uci.edu/-mlearn/MLRepository. html. 
A. Grove and D. Schuurmans. Boosting in the limit: Maximizing the 
margin of learned ensembles. In Proceedings of the Fifteenth National 
Conference on Artificial Intelligence, pages 692-699, 1998. 
L. Mason, P. L. Bartlett, and J. Baxter. Improved generalization through 
explicit optimization of margins. Technical report, Department of Sys- 
tems Engineering, Australian National University, 1998. (Available from 
http://syseng. anu.edu. au/lsg). 
R. E. Schapire, Y. Freund, P. L. Bartlett, and W. S. Lee. Boosting 
the margin: a new explanation for the effectiveness of voting methods. 
Annals of Statistics, (to appear), 1998. 
