Training Methods for Adaptive Boosting 
of Neural Networks 
Holger Schwenk 
Dept. IRO 
Universitd de Montrdal 
2920 Chemin de la Tour, 
Montreal, Qc, Canada, H3C 3J7 
schwenkiro. umontreal. ca 
Yoshua Bengio 
Dept. IRO 
Universitd de Montrdal 
and AT&T Laboratories, NJ 
bengioyiro. umontreal. ca 
Abstract 
"Boosting" is a general method for improving the performance of any 
learning algorithm that consistently generates classifiers which need to 
perform only slightly better than random guessing. A recently proposed 
and very promising boosting algorithm is AdaBoost [5]. It has been ap- 
plied with great success to several benchmark machine learning problems 
using rather simple learning algorithms [4], and decision trees [ 1, 2, 6]. 
In this paper we use AdaBoost to improve the performances of neural 
networks. We compare training methods based on sampling the training 
set and weighting the cost function. Our system achieves about 1.4% 
error on a data base of online handwritten digits from more than 200 
writers. Adaptive boosting of a multi-layer network achieved 1.5% error 
on the UCI Letters and 8.1% error on the UCI satellite data set. 
1 Introduction 
AdaBoost [4, 5] (for Adaptive Boosting) constructs a composite classifier by sequentially 
training classifiers, while putting more and more emphasis on certain patterns. AriaBoost 
has been applied to rather weak learning algorithms (with low capacity) [4] and to deci- 
sion trees [ 1, 2, 6], and not yet, until now, to the best of our knowledge, to artificial neural 
networks. These experiments displayed rather intriguing generalization properties, such as 
continued decrease in generalization error after training error reaches zero. Previous work- 
ers also disagree on the reasons for the impressive generalization performance displayed 
by AriaBoost on a large array of tasks. One issue raised by Breiman [1] and the authors of 
AdaBoost [4] is whether some of this effect is due to a reduction in variance similar to the 
one obtained from the Bagging algorithm. 
In this paper we explore the application of AdaBoost to Diabolo (auto-associative) net- 
works and multi-layer neural networks (MLPs). In doing so, we also compare three dif- 
648 H. Schwenk and Y. Bengio 
ferent versions of AdaBoost: (R) training each classifier with a fixed training set obtained 
by resampling with replacement from the original training set (as in [1]), (E) training by 
resampling after each epoch a new training set from the original training set, and (W) train- 
ing by directly weighting the cost function (here the squared error) of the neural network. 
Note that the second version (E) is a better approximation of the weighted cost function 
than the first one (R), in particular when many epochs are performed. If the variance re- 
duction induced by averaging the hypotheses from very different models explains a good 
part of the generalization performance of AdaBoost, then the weighted training version 
(W) should perform worse then the resampling versions, and the fixed sample version (R) 
should perform better then the continuously resampled version (E). 
2 AdaBoost 
AdaBoost combines the hypotheses generated by a set of classifiers trained one after the 
other. The t th classifier is trained with more emphasis on certain patterns, using a cost func- 
tion weighted by a probability distribution Dt over the training data (Dr(i) is positive and 
Ei Dt (i) -- 1). Some learning algorithms don't permit training with respect to a weighted 
cost function. In this case sampling with replacement (using the probability distribution 
Dr) can be used to approximate a weighted cost function. Examples with high probability 
would then occur more often than those with low probability, while some examples may 
not occur in the sample at all although their probability is not zero. This is particularly true 
in the simple resampling version (labeled "R" earlier), and unlikely when a new training 
set is resampled after each epoch ("E" version). Neural networks can be trained directly 
with respect to a distribution over the learning data by weighting the cost function (this is 
the "W" version): the squared error on the i-th pattern is weighted by the probability Dt (i). 
The result of training the t th classifier is a hypothesis ht: X --+ Y where Y = {1, ..., k} is 
the space of labels, and X is the space of input features. After the t th round the weighted 
error et of the resulting classifier is calculated and the distribution Dt+l is computed from 
Dr, by increasing the probability of incorrectly labeled examples. The global decision f is 
obtained by weighted voting. Figure 1 (left) summarizes the basic AdaBoost algorithm. It 
converges (learns the training set) if each classifier yields a weighted error that is less than 
50%, i.e., better than chance in the 2-class case. There is also a multi-class version, called 
pseudoloss-AdaBoost, that can be used when the classifier computes confidence scores for 
each class. Due to lack of space, we give only the algorithm (see figure 1, right) and we 
refer the reader to the references for more details [4, 5]. 
AdaBoost has very interesting theoretical properties, in particular it can be shown that the 
error of the composite classifier on the training data decreases exponentially fast to zero [5] 
as the number of combined classifiers is increased. More importantly, however, bounds 
on the generalization error of such a system have been formulated [7]. These are based 
on a notion of margin of classification, defined as the difference between the score of the 
correct class and the strongest score of a wrong class. In the case in which there are just 
two possible labels {-1, +1}, this is yf(z), where f is the composite classifier and y the 
correct label. Obviously, the classification is correct if the margin is positive. We now 
state the theorem bounding the generalization error of Adaboost [7] (and any classifier 
obtained by a convex combination of a set of classifiers). Let H be a set of hypotheses 
(from which the ht hare chosen), with VC-dimenstion d. Let f be any convex combination 
of hypotheses from H. Let $ be a sample of N examples chosen independently at random 
according to a distribution D. Then with probability at least 1 - 6 over the random choice 
of the training set $ from D, the following bound is satisfied for all 0 > 0: 
( 1 '/dlg2(N/d) 
PD[yf(x) _< 0] _< P$[yf(x) _< 0] + O V ' + log(i/6) (1) 
Note that this bound is independent of the number of combined hypotheses and how they 
Training Methods for Adaptive Boosting of Neural Networks 649 
Input: sequence of N examples (x, y),..., (XN, YN) 
with labels Yi E Y - {1,..., k} 
Init: D (i) = 1/N for all i 
Repeat: 
1. Train neural network with respect 
to distribution D, and obtain 
hypothesis h,: X - Y 
2. calculate the weighted error of h,: 
abort loop 
et = E Dr(i) if et >  
i:ht(xi)yl 
3. set/3 = e,/(1 - e,) 
4. update distribution Dt 
Dt+(i) - 
zt 
with 6 i -- (ht(xi) = Yi) 
and Zt a normalization constant 
Output: final hypothesis: 
f(z)=argmax E log 1 
yY 
t:h(x)=y 
Init: letB = {(i,y): i E {1,...,N},y  Yi} 
Dx(i,y) = X/IBI for all (i,y)  B 
Repeat: 
1. Train neural network with respect 
to distribution Dt and obtain 
hypothesis h: X x Y - [0, 1] 
2. calculate the pseudo-loss of ht: 
1 
et--  EDt(i,y)(1-ht(xi, yi)+ht(xi, y)) 
(i,y)eB 
3. set/3t -- et/(1 - 
4. update distribution Dt 
 1 
Dt+x(i, y) = Dt(2,y) fq((l+ht(xi,Yl)-ht(ei,Y)) 
Zt 
where Zt is a normalization constant 
Output: final hypothesis: 
f(x) = argmax t log ht(x,y) 
yGY 
Figure 1: AdaBoost algorithm (left), multi-class extension using confidence scores (right) 
are chosen from H. The distribution of the margins however plays an important role. It can 
be shown that the AdaBoost algorithm is especially well suited to the task of maximizing 
the number of training examples with large margin [7]. 
3 The Diabolo Classifier 
Normally, neural networks used for classification are trained to map an input vector to an 
output vector that encodes directly the classes, usually by the so called "l-out-of-N encod- 
ing". An alternative approach with interesting properties is to use auto-associative neural 
networks, also called autoencoders or Diabolo networks, to learn a model of each class. 
In the simplest case, each autoencoder network is trained only with examples of the cor- 
responding class, i.e., it learns to reconstruct all examples of one class at its output. The 
distance between the input vector and the reconstructed output vector expresses the likeli- 
hood that a particular example is part of the corresponding class. Therefore classification 
is done by choosing the best fitting model. Figure 2 summarizes the basic architecture. 
It shows also typical classification behavior for an online character recognition task. The 
input and output vectors are (z, y)-coordinate sequences of a character. The visual repre- 
sentation in the figure is obtained by connecting these points. In this example the "1" is 
correctly classified since the network for this class has the smallest reconstruction error. 
The Diabolo classifier uses a distributed representation of the models which is much more 
compact than the enumeration of references often used by distance-based classifiers like 
nearest-neighbor or RBF networks. Furthermore, one has to calculate only one distance 
measure for each class to recognize. This allows to incorporate knowledge by a domain 
specific distance measure at a very low computational cost. In previous work [8], we have 
shown that the well-known tangent-distance [ 11 ] can be used in the objective function of the 
autoencoders. This Diabolo classifier has achieved state-of-the-art results in handwritten 
OCR [8, 9]. Recently, we have also extended the idea of a transformation invariant distance 
650 H. Schwenk and Y. Bengio 
score 2 I 
" A I sc37 
 net 1/q 
j  l-...net 2./- 
; ; 
/ ' r,,.net 7/q ' 
input output 
sequence sequences 
character distance decision 
to classify measures module 
Figure 2: Architecture of a Diabolo classifier 
class 
1 
measure to online character recognition [10]. One autoencoder alone, however, can not 
learn efficiently the model of a character if it is written in many different stroke orders and 
directions. The architecture can be extended by using several autoencoders per class, each 
one specializing on a particular writing style (subclass). For the class "0", for instance, 
we would have one Diabolo network that learns a model for zeros written clockwise and 
another one for zeros written counterclockwise. The assignment of the training examples to 
the different subclass models should ideally be done in an unsupervised way. However, this 
can be quite difficult since the number of writing styles is not known in advance and usually 
the number of examples in each subclass varies a lot. Our training data base contains for 
instance 100 zeros written counterclockwise, but only 3 written clockwise (there are also 
some more examples written in other strange styles). Classical clustering algorithms would 
probably tend to ignore subclasses with very few examples since they aren't responsible 
for much of the error, but this may result in poor generalization behavior. Therefore, in 
previous work we have manually assigned the subclass labels [10]. Of course, this is not a 
generally satisfactory approach, and certainly infeasible when the training set is large. In 
the following, we will show that the emphasizing algorithm of AdaBoost can be used to 
train multiple Diabolo classifiers per class, performing a soft assignment of examples of 
the training set to each network. 
4 Results with Diabolo and MI,P Classifiers 
Experiments have been performed on three data sets: a data base of online handwritten 
digits, the UCI Letters database of offiine machine-printed alphabetical characters and the 
UCI satellite database that is generated from Landsat Multi-spectral Scanner image data. 
All data sets have a pre-defined training and test set. The Diabolo classifier was only 
applied to the online data set (since it takes advantage of the structure of the input features). 
The online data set was collected at Paris 6 University [10]. It is writer-independent (dif- 
ferent writers in training and test sets) and there are 203 writers, 1200 training examples 
and 830 test examples. Each writer gave only one example per class. Therefore, there are 
many different writing styles, with very different frequencies. We only applied a simple 
pre. processing: the characters were resampled to 11 points, centered and size normalized 
to a (x,y)-coordinate sequence in [-1, 1] 22. Since the Diabolo classifier with tangent dis- 
tance [10] is invariant to small transformations we don't need to extract further features. 
Table 1 summarizes the results on the test set of different approaches before using Ada- 
Boost. The Diabolo classifier with hand-selected sub-classes in the training set performs 
best since it is invariant to transformations and since it can deal with the different writing 
styles. The experiments suggest that fully connected neural networks are not well suited 
for this task: small nets do poorly on both training and test sets, while large nets overfit. 
Training Methods for Adaptive Boosting of Neural Networks 651 
Table 1: Online digits data set error rates for different unboosted classifiers 
Diabolo classifier fully connected MLP 
nosubclasses I hand-selected 22-10-10122-30-10122-50-10 
train: 2.2% 0.6% 5.7% 0.8% 0.4% 
test: 3.3% 1.2% 8.8% 3.3% 2.8% 
10-fold cross-validation was used to find the optimal number of training epochs (typically 
about 200). If training is continued until 1000 epochs, the test error increases by more 
than 1%. 
Table 2 shows the results of bagged and boosted multi-layer perceptrons with 10, 30 or 50 
hidden units, trained for either 100, 200, 500 or 1000 epochs, and using either the ordinary 
resampling scheme (R), resampling with different random selections at each epoch (E), or 
training with weights Dt on the squared error criterion for each pattern (W). 100 neural 
networks were combined. The multi-class version of the AdaBoost algorithm was used 
in all the experiments with MLPs: it yielded considerably better results than the basic 
version. Pseudoloss-AdaBoost was however not useful for the Diabolo classifier since it 
uses a powerful discriminant learning algorithm [9]. 
Table 2: Online digits test error rates for boosted MLPs 
architecture 22-10-10 22-30-10 22-50-10 
version: R I E IW R I m IW R I E Iw 
Bagging: 5.4% 2.8% 1.8% 
500 it 
AdaBoost: 
100it 2.9% 3.2% 6.0% 1.7% 1.8% 5.1% 2.1% 1.8% 4.9% 
200 it 3.0% 2.8% 5.6% 1.8% 1.8% 4.2% 1.8% 1.7% 3.5% 
500 it 2.5% 2.7% 3.3% 1.7% 1.5% 3.0% 1.7% 1.7% 2.8% 
1000it 2.8% 2.7% 3.2% 1.8% 1.6% 2.6% 1.6% 1.5% 2.2% 
5000 it - - 2.9% - 1.6% - 1.6% 
AdaBoost improved in all cases the generalization error of the MLPs, for instance from 
8.8 % to about 2.7 % for the 22-10-10 architecture. Boosting was also always superior to 
Bagging. Furthermore, it seems that the number of iterations of each individual classifier 
has no significant importance on the results of the combined classifier, at least on this 
database. Note that the test set has only 830 examples and small differences in the error 
rate are not statistically significant. AdaBoost with weighted training of MLPs, however, 
doesn't work if the learning of each individual MLP is stopped too early (1000 epochs): the 
networks didn't learn well enough the weighted examples and  rapidly approached 0.5. 
When training each MLP for 5000 epochs, however, the weighted training (W) version 
achieved the same low test error. AdaBoost is less useful for very big networks (50 or more 
hidden units for this data) since each individual classifier achieves zero training error. In 
this case the probability distribution Dt doesn't change any more and AdaBoost reduces to 
Bagging (with eventually unequal probabilities). 
Figure 3 shows the error rate of some of the boosted classifiers as the number of machines is 
increased, as well as examples of the margin distributions obtained after training. AdaBoost 
brings training error to zero after only a few steps, even with a MLP with only 10 hidden 
units. The generalization error is also considerably improved and it continues to decrease 
asymptotically after zero training error has been reached. The Diabolo classifier performs 
652 H. Schwenk and Y. Bengio 
12 
10 
 8 
 6 
2 
0 
1 
0.8 
0.6 
0.4 
0.2 
0 
Dlabolo Classifier 
Bagging -- 
AdaBoost --- 
MLP 22-10-10, 1000 iter 
Bagging  
AdaBoost (R) --- 
AdaBoOst (E) .... 
20 40 60 80 100 
nLnber of machines 
AdaBoost(E) of MLP 22-10-10 
0. 10 10 .... 
1 100 .... 
20 40 60 80 100 
number of machines 
AdaBoost(E) of Diabolos 
10 .... 
100 ....... 
i I 
-1 -0.5 
I I I 
0 0.5 1.0 
MLP 22-30-10, 1000 lter 
Bagging 
AdaBoOst (R) --- 
AdaBoost (E) .... 
AdaBoost (W), 5000 t 
I I I I 
20 40 60 80 100 
number of machines 
Bagging of MLP 22-30-10 
I 
10 .... 
50 ...... 
100 .... 
I I 
-1 -0.5 
0 0.S 
AdaBoost(E) of MLP 22-30-10 
0 0.5 .0 
I 
1.0 
Figure 3: top: error rates of the boosted classifiers for increasing number of networks 
bottom: margin distributions using 2, 5, 10 and 100 machines respectively 
best when combining 16 classifiers (1.4% error = 12 errors) which is almost as good as 
the Diabolo classifier using hand-selected subclasses (1.2% = 10 errors). Since we know 
that one autoencoder can't learn a model of the different writing style within one class, this 
seems to be evidence that the example emphasizing of AdaBoost was able to assign them 
automatically to different machines. Bagging yields 2.2 % error in this case. The surprising 
effect of continuously decreasing generalization error even after training error reaches zero 
has already been observed by others [ 1, 2, 4, 6]. This seems to contradict Occam's razor, but 
it may be explained by the recently proven theorem of Schapire et al. [7]: the bound on the 
generalization error (equation 1) depends only on the margin distribution and on the VC- 
dimension of the basic learning machine (one Diabolo classifier or MLP respectively), not 
on the number of machines combined by AdaBoost. Figure 3 bottom shows the margins 
distributions, i.e. the fraction of examples whose margin is at most z as a function of 
z E [-1, 1]. It is clearly visible that AdaBoost increases the number of examples with 
high margin: with 100 machines all examples have a margin higher than 0.5. Note that the 
margin distribution using hundred 22-30-10 MLPs is better as the one from 100 Diabolo 
classifiers, but not the error rate. We hypothesize that the difference in performance may 
be due in part to a lower effective VC-dimension of the Diabolo classifier. It may also 
be that the generalization error bounds of Freund et al. are too far away from the actual 
generalization error. Similar experiments were performed with MLPs on the "Letters" data 
Table 3: Test error rates on the UCI data sets 
CART [1] C4.5 [4] MLP 
alone bagged boosted alone bagged boosted alone bagged boosted 
letter 12.4% 6.4% 3.4% 13.8% 6.8% 3.3% 6.1% 4.3% 1.5% 
satellite 14.8% 10.3% 8.8% 14.8% 10.6% 8.9% 12.8% 8.7% 8.1% 
set from the UCI Machine Learning database. It has 16000 training and 4000 test patterns, 
16 input features, and 26 classes (A-Z) of distorted machine-printed characters from 20 
different fonts. The experiments were performed with a 16-70-50-26 MLP using 500 online 
back-propagation epochs and resampling after each epoch (E), which performed best on 
Training Methods for Adaptive Boosting of Neural Networks 653 
the experiments with the online data set. Each input feature was normalized according to 
its mean and variance on the training set. The plain, bagged and boosted networks are 
compared to decision trees (results from [1, 4]), Table 3. In all experiments 100 classifiers 
were combined. The results obtained with the boosted network are extremely good (1.5 % 
error) and are the best that the authors know to be published for this data set. The best 
performance reported in STATLOG [3] is 6.4%. Note also that we need to combine only 
few neural networks to get already important improvements: with 20 neural networks the 
error falls already under 2 % whereas boosted decision trees typically "converge" later. The 
W-version of AdaBoost yields also the same results on this data, but again, the networks 
have to be trained longer. Similar conclusions hold for the UCI "satellite" data set (Table 3). 
5 Conclusion 
As demonstrated in three real-world applications, AdaBoost can significantly improve neu- 
ral classifiers such as multi-layer networks and Diabolo networks. The behavior of Ada- 
Boost for neural networks confirms previous observations on other learning algorithms 
[ 1, 2, 4, 6, 7], such as the continued generalization improvement after zero training error 
has been reached, and the associated improvement in the margin distribution. It seems also 
that AriaBoost is little sensitive to overtraining of the individual classifiers, so that the neu- 
ral networks can be trained for a fixed (preferably high) number of training epochs. This 
makes the choice of neural networks design parameters easier. 
Another interesting finding of this paper is that the "weighted training" version of AdaBoost 
works well for MLPs, but requires many more training epochs (because of the weights 
on the cost function terms, the conditioning of the Hessian matrix is probably worse). 
These results add credence to the view of Freund and Schapire that the improvement in 
generalization error brought by AdaBoost is mainly due to the emphasizing (that increases 
the margin), rather than to a variance reduction due to the randomization of the resampling 
process. 
References 
[8] 
[9] 
[10] 
[11] 
[1] L. Breiman. Bias, variance, and Arcing classifiers. Technical Report 460, Statistics Department, 
University of California at Berkeley, 1996. 
[2] H. Drucker and C. Cortes, Boosting decision trees. In NIPS*8, pages 479-485, 1996. 
[3] Feng. C., Sutherland, A., King, R., Muggleton, S., & Henery, R. (1993). Comparison of machine 
learning classifiers to statistics and neural networks. In Proceedings of the Fourth International 
Workshop on Artificial Intelligence and Statistics (pages 41-52). 
[4] Y. Freund and R.E. Schapire. Experiments with a new boosting algorithm. In Machine Learn- 
ing: Proceedings of Thirteenth International Conference, pages 148-156, 1996. 
[5] Y. Freund and R.E. Schapire. A decision theoretic generalization of on-line learning and an 
application to boosting. Journal of Computer and System Science, to appear. 
[6] J.R. Quinlan. Bagging, Boosting and C4.5. In 14th Ntnl Conf. on Artificial Intelligence, 1996. 
[7] R.E. Schapire, Y. Freund, P. Bartlett, and W.S. Lee. Boosting the margin: A new explanation 
for the effectiveness of voting methods. In Machine Learning: Proceedings of Fourteenth 
International Conference, in press, 1997. 
H. Schwenk and M. Milgram. Transformation invariant autoassociation with application to 
handwritten character recognition. NIPS*7, pages 991-998. MIT Press, 1995. 
H. Schwenk and M. Milgram. Learning discriminant tangent models for handwritten character 
recognition. In ICANN*96, pages 585-590. Springer Verlag, 1995. 
H. Schwenk and M. Milgram. Constraint tangent distance for online character recognition. In 
International Conference on Pattern Recognition, pages D 520-524, 1996. 
P. Simard, Y. Le Curt, and J. Denker. Efficient pattern recognition using a new transformation 
distance. NIPS*5, pages 50-58. Morgan Kaufmann, 1993. 
