Removing Noise in On-Line Search using 
Adaptive Batch Sizes 
Genevieve B. Orr 
Department of Computer Science 
Willamette University 
900 State Street 
Salem, Oregon 97301 
gorr)willamette. edu 
Abstract 
Stochastic (on-line) learning can be faster than batch learning. 
Howevex', at late times, the learning rate must be annealed to re- 
move the noise present in the stochastic weight updates. In this 
annealing phase, tile convergence rate (in mean square) is at best 
proportional to 1/- where - is the number of input presentations. 
An alternative is to increase the batch size to remove the noise. In 
this paper we explore convergence for LMS using 1) small but fixed 
batch sizes and 2) an adaptive batch size. We show that the best 
adaptive batch schedule is exponential and has a rate of conver- 
gence which is the same as for annealing, i.e., at best proportional 
to 
I Introduction 
Stochastic (on-line) learning can speed learning over its batch training particularly 
when data sets are large and contain redundant information [M193]. However, at 
late times in learning, noise present in the weight updates prevents complete conver- 
gence fi'om taking place. To reduce the noise, the learning rate is slowly decreased 
(annealed} at late times. The optimal annealing schedule is asymptotically propor- 
tional to  xvhere t is tile iteration [Go187, LO93, Orr95]. This results in a rate of 
convergence (in mean square) that is also proportional to - 
t' 
An alternative method of reducing tile noise is to simply switch to (noiseless) batch 
mode when the noise regime is reached. Howevex', since batch mode can be slow, 
a better idea is to slowly increase the batch size starting with 1 (pure stochastic) 
and slowly increasing it only "as needed" until it reaches the training set size (pure 
batd). In this paper we 1) investigate the convergence behavior of LMS when 
Removing Noise in On-Line Search using Adaptive Batch Sizes 233 
using sxnall fixed batch sizes, 2) determine the best schedule when using an adaptive 
batch size at each iteration, 3) analyze the convergence behavior of the adaptive 
batch algorithm, and 4) compare this convergence rate to the alternative method 
of annealing the learning rate. 
Other authors have approached the problem of redundant data by also proposing 
tedmiques fox' training on subsets of the data. For example, Pluto [PW93] uses 
active data selection to choose a concise subset for training. This subset is slowly 
added to over time as needed. Moller [M193] proposes combining scaled conjugate 
gradient descent (SCG) with what he refers to as blocksize updating. His algorithm 
uses an iterative approach and assumes that tile block size does not vary rapidly 
during training. In this paper, we take tile simpler approach of .just choosing ex- 
emplars at random at each iteration. Given this, we then analyze in detail the 
convergence behavior. Our results are more of theoretical than practical interest 
since the equations we derive are complex functions of quantities such as the Hessian 
that are impractical to compute. 
2 Fixed Batch Size 
In this section we examine the convergence behavior fox' LMS using a fixed batch 
size. We assrune that we are given a large but finite sized training set T -_- {zi --- 
(xi, d)}i= wilere x  T " is the i th input and di   is the corresponding target. 
We further assume that the targets are generated using a signal plus noise model 
so that we can write 
di =WXi +6i (1) 
where w. 6  is the optimal weight vector and the ei is zero me noise. Since 
the training set is sumed to be large we take the average of ei and xi6i over the 
training set to be approximately zero. Note that we consider only the problem of 
optimization of the w over the training set and do not address the issue of obtaining 
good generalization over the distribution from whi( the training set w drawn. 
At each iteration, we sume that exactly n samples are randomly drawn without 
replacement froIn T where 1  n  N. We denote this batch of size n drawn at 
time t by Bn(t)  {z}= 1. When n = 1 we have pure on-line training and when 
n. = N we have pm'e batch. XVe choose to sample without replacement so that  
the batch size is increded, we have a smooth transition from on-line to batch. 
For LMS, the squared error at iteration t for a batch of size n is 
i  wt xi) (2) 
z.B,(t) 
and where ct  T4  is tile current weight in the network. The update weight 
a?XP- wilere p is the fixed learning rate. Rewriting 
equation is then t+l = t --  Owt 
this in terms of the weight error v   - w. and defining g,t  OS()/Ov we 
obtain 
= ,., + - - = - (3) 
z,6B 
Convergence (in mean square) to c'. can be characterized by the rate of change of 
the average squared norm of the weight error E[v ] where v 2 _= vTv. lrom (3) we 
obtain an expression fox' vt2+ in terms of vt, 
Ut2q_l -- Vt 2 2p ,T p2 2 
- - v g,e.,, + g,e.,' (4) 
234 G. B. Orr 
To compute the expected value of vt2+ conditioned on vt we can average the right 
side of (4) over all possible ways that the batch B,(t) can be chosen from the N 
training examples. In appendix A, we show that 
(sB,,). = (s,t)N (5) 
(})B = :v-  2 (- 1):v 
where (.). denotes average over all examples in the training set, (.) denotes average 
over the possible batches drawn at time t, and gi,t  OS(zi)/Ovt. The averages over 
the entire training set are 
1  O(zi) 
,,) =  .  - - ) ; i, - ii = n, (7) 
=1 i=1 
N 
1 
i=1 
2 
where R  (xx), S  (xxxx) , a  (e 2) d ( R) is the trace of 
These equations together with (5) and (6) in (4) gives the expected value of 
conditioned on v, 
(vt2+lvt)=vtT{i_2pR+p2 (N(n-1)R2 N-n )} p2a2(TrR)(N-n) 
Y-' i2 + (: i. s "' + .(N - 1) 
Note that this reduces to the standard stochastic and batch update equations when 
r, = 1 and r, = N, respectively. 
2.0.1 Special Cases: 1-D solution and Spherically Symmetric 
In 1-dimension we can average over vt in (9) to give 
where 
a = l - 2ftR + ft 2 f N(r- l) Ra N-n _X 
(10) 
7(:v- .) 
: n(N- 1) (11) 
and where R and $ sinplify to R = (x2)., $ = (x4)N. This is a difference equation 
which can be solved exactly to give 
1 - a t-to 
<vta>=a'-t<v>+ 1--a  (12) 
where (v') is the expected squared weight error at the initial time to. 
Figure la conpares equation (12) with simulations of 1-D LMS with gaussi inputs 
for N = 1000 md batch sizes n =10, 100, and 500. As  be seen, the agreement is 
good. Note that {v 2) decre,es exponentially until flattening out. The equihbrium 
value can be computed from (12) by setting t =  (suming ]a] < 1) to give 
, (-.) 
<'v2> -- 1--rt- 2Rn(N-1)-p(N(n-1)R2+(N-n)S) ' (13) 
Note that (v2) decre,es  n incre, es and is zero only if n = N. 
For real zero mean gaussian inputs, we can apply the gaussi moment factoring 
theorem [Hv91] which states that (xlxxxt) = RORm + RiRit + R, Ri where the 
subscripts on x denote components of x. From this, we find that S = (Trace R)R + 2R 2. 
Removing Noise in On-Line Search using Adaptive Batch Sizes 235 
E[ v^2 ] 
0.1 
0.01 
0.001; 
10 20 
(a) o.00o 
n=l 0 
0.01 
30 _ --60--.. t 0.001 
n=500 (b) 
E[ v^2 ] 
1' n=10 
0,1. 
0 '5;300 1 oof)o 1 .(3('  POOO0 
Figure 1: Simulations(solid) vs Theoretical (dashed) predictions of the squared weight 
error of 1-D LMS a) as function of the number of t, the batch updates (iterations), and 
b) as function of the number of input presentations, r. Training set size is N = 1000 and 
batch sizes are n --10, 100, and 500. Inputs were gaussian with R -- 1, er = 1 and/ - .1. 
Simulations used 10000 networks 
Equation (9) can also be solved exactly in multiple dimensions in the rather restric- 
tive case where we assume that the inputs are spherically symmetric gaussians with 
R = aI where a is a constant, I is the identity matrix, and m is the dimension. 
The update equation and solution are the same as (10) and (12), respectively, but 
where c and fi are now 
c, = l - 2ta + tfi a 2 IN(n-l) 
U2a2ma(N - n) 
N-n (m+2) fi-- (14) 
+ (N- 1)n ' n(N- 1) 
3 Adaptive Batch Size 
The tine it takes to compute the weight update in one iteration is roughly propor- 
tional to the number of input presentations, i.e the batch size. To make the compar- 
ison of convergence rates for different batch sizes meaningful, we must compute the 
change in squared weight error as a function of the number of input presentations, 
r, rather than iteration number t. 
For fixed batch size, r = nt. Figure lb displays our 1-D LMS simulations plotted as 
a function of r. As can be seen, training with a large batch size is slow but results 
in a lower equilibrium value than obtained with a small batch size.. This suggests 
that we could obtain the fastest decrease of (v 2) overall by varying the batch size at 
each iteration. The batch size to choose for the current (v 2) would be the smallest 
n that has yet to reach equilibrium, i.e. for which (v 2) > (v2)o. 
To determine the best batch size, we take the greedy approach by demanding that 
at each iteration the batch size is chosen so as to reduce the weight error at the 
next iteration by the greatest amount per input presentation. This is equivalent to 
asking what value of n naximizes h _= (('v)- (vt2+l))/n? Once we determine n ve 
then express it as a function of r. 
We treat the 1-D case, although the analysis would be similar for the spherically 
symmetric case. From (10)we have h = -} ((rt - 1)(vt 2) + fi). Differentiating h with 
respect to n and solving yields the batch size that decreases the weight error the 
most to be 
nt = min N, (2R(N- 1) + la(S- NR2))(v) + laaR ' (15) 
We have n, exactly equal to N when the current value of (vt 2) satisfies 
7) < 7 = 2(- ) - ,((X-- 2)- S) (' = )' 
(16) 
236 G. B. Orr 
E[ v^2l 
1 -, Adaptive Batch 
-'-...'".,. -- theory 
O.Ol 
o.ool 
(a) o lO o o .o o o ,o (b) ' ' 
n 
1 oo. . Batc ' 
50. 
lO. 
5 
o 1 o 20 30 40 50 60 70 
Figure 2: 1-D LMS: a) Comparison of the simulated and theoretically predicted (equation 
(18)) squared weight error as a function of t with N = 1000, R = 1, a = 1,/ = .1, and 
10000 networks. b) Corresponding batch sizes used in the simulations. 
Thus, after (v) has decreased to %, training will proceed as pure batch. 
(v) > 7c, we have nt < N and we can put (15) into (10) to obtain 
2(:v- 1)' 
2(:v- 1) / 
Solving (17) we get 
1 - a- ; 
<v) = q-,0<,v) +  _ -1 
where rl, and 1 e constants 
r = 1 - R + 2 (S - N 2) _p2 
2(- t) '  = 2(- t)' 
When 
(17) 
(18) 
(19) 
Figure 2a compares equation (18) with 1-D LMS simulations. The adaptive batch 
size was chosen by rounding (15) to the nearest integer. Early in training, the 
predicted nt is always smaller than 1 but the simulation always rounds up to 1 (can't 
have n=0). Figure 2b displays the batch sizes that were used in the simulations. A 
logarithmic scale is used to show that the batch size increases exponentially in t. 
.Ve next examine the batch size as a function of r. 
3.1 Convergence Rate per Input Presentation 
When we use (15) to choose the batch size, the number of input presentations will 
vary at each iteration. Thus, r is not simply a multiple of t. Instead, we have 
t 
+ - (20) 
i=to 
where r0 is the number of inputs that have been presented by to. This can be 
evaluated when N is very large. In this case, equations (18) and (15) reduce to 
i eR e (21) 
2((s- t)<} + t) 2(s- t) 
= (2R- R)(vt ) = 2R- R  + (2 - R)(voa)a -t' (22) 
Putting (22) into (20), and summing gives gives 
(2- tR)<v} 1 - ra 
a(t) = 2(s- a ) 
(2- R)R At + 
(23) 
Removing Noise in On-Line Search using Adaptive Batch Sizes 237 
1 
0.5 
0.1 
0.05 
0.01 
0.005 
E[v^2] 
 n= 1 O0 
 
"%""' .....  n=10 
n=adaptiQ .......... 
0.01 
0.001 
E[v^2] 
11 annealed 
theory (solid) 
simulation (dashed) 
(a) so lOO ,so =oo =so aoo aso 400 (b) ,o. 20. so. ,oo. 200. 5OO. lOOO2OOO. 
Figure 3: 1-D LMS: a) Simulations of the squared weight error as a function of r, the 
number of input presentations for N = 1000, R = 1, tr = 1, kt = .1, and 10000 networks. 
Batch sizes are n =1, 10, 100, and nt (see (15)). b) Comparison of simulation (dashed) 
and theory (see (24)) using adaptive batch size. Simulation (long dash) using an annealed 
learning rate with n = 1 and kt = R -x is also shown. 
where At = t - to and Ar = r -- 0. Assuming that [a3] < 1, the term with a -zxt 
will dominate at late times. Dropping the other terms and solving for a gives 
4 (24) 
<'vt2) = <vt2) aaat  (2 -/R) a R (r - r0)' 
Thus, when using an adaptive batch size, (v a) converges at late times as . Figure 
3a compares simulations of (v a) with adaptive and constant batch sizes. As can 
be seen, the adaptive n curve follows the n = 1 curve until just before the n = 1 
curve starts to flatten out. Figure 3b compares (24) with the simulation. Curves 
are plotted on a log-log plot to illustrate the 1/r relationship at late times (straight 
line with slope of-1). 
4 Learning Rate Annealing vs Increasing Batch Size 
With online learning (n = 1), we can reduce the noise at late times by annealing 
the learning rate using a /t schedule. During this phase, (v 2) decreases at a rate of 
1/r if/ >/?-1/2 [LO93] and slower otherwise. In this paper, we have presented an 
alternative method for reducing the noise by increasing the batch size exponentially 
in t. Here, (v 2) also decreases at rate of 1/r so that, from this perspective, an 
adaptive batch size is equivalent to annealing the learning rate. This is confirmed 
in Figure 3b which compares using an adaptive batch size with annealing. 
An advantage of the adaptive batch size comes when n reaches N. At this point r 
remains constant so that (v 2 ) decreases exponentially in . However, with annealing, 
the convergence rate of (v 2) al;vays remains proportional to 1/r. A disadvantage, 
though, occurs in multiple dimensions with nonspherical R. where the best choice of 
nt would likely be different along different directions in the weight space. Though 
it is possible to have a different learning rate along different directions, it is not 
possible to have different batch sizes. 
5 Appendix A 
In this appendix we use simple counting arguments to derive the two results in 
equations (5) and (6). We first note that there are M = (  ) ways of choosing n 
examples out of a total of N examples. Thus, (5) can be rewritten as 
1 a4 1  1 )_ g,t. (25) 
i=1 i=1 zBi) 
238 G. B. Orr 
where B(n  is the i th batch (i = 1,... ,M), and gj,t -= O(zj)lOvt for j = 1,... ,N. 
If we were to expand (25) we would find that there are exactly nM terms. From 
symmetry and since there are only N unique gj,t, we conclude that each gj,t occurs 
exactly aM times. The above expression can then be written as 
1 nM ^' 1 v 
(g"-">"- .M N g" =  g" = <g'">" (26) 
j=l 
Thus, we have equation (5). The second equation () is 
By the same argument to derive (5), the first term on the right (1/n){g,,),. In the 
second term, there are a total n(n- 1)M terms in the sum, of which only N(N- 1) 
are unique. Thus, a given g,g, occurs exactly n(n- 1)M/(N(N- 1)) times so 
that 
N 
n2M E g5't g'z = n2M N(N- 1) 1,=z,5k 
'= 
: n  ' g"g"+g" - 
,=, = '= 
(. z) (-1) 
= .- 1)(g'"); .- 1)(g,')' (s) 
Putting the simplified first term together d (28) both into (22) we obtn our 
second result in equation (6). 
References 
[Go187] 
[t-Iay91] 
[L093] 
[Me,193] 
[Orr95] 
[PW93] 
Larry Goldstein. Mean square optimality in the continuous time Robbins Monro 
procedure. Technical Report DRB-306, Dept. of Mathematics, University of 
Southern California, LA, 1987. 
Simon Haykin. Adaptive Filter Theory. Prentice Hall, New Jersey, 1991. 
Todd K. Leen and Genevieve B. Orr. Momentum and optimal stochastic search. 
In Advances in Neural Information Processing Systems, vol. 6, 1993. to appear. 
Martin Moller. Supervised learning on large redundant training sets. International 
Journal of Neural Systems, 4(1):15-25, 1993. 
Genevieve B. Orr. Dynamics and Algorithms for Stochastic learning. PhD thesis, 
Oregon Graduate Institute, 1995. 
Mark Plutowski and Halbert White. Selecting concise training sets from clean 
data. IEEE Transactions on Neural Networks, 4:305-318, 1993. 
