Sample Size Requirements For 
Feedforward Neural Networks 
Michael J. Tarmort 
Cornell Univ. Electrical Engineering 
Ithaca, NY 14853 
mjt@ee.cornell.edu 
Terrence L. Fine 
Cornell Univ. Electrical Engineering 
Ithaca, NY 14853 
tlfine@ee.cornell.edu 
Abstract 
We estimate the number of training samples required to ensure that 
the performance of a neural network on its training data matches 
that obtained when fresh data is applied to the network. Existing 
estimates are higher by orders of magnitude than practice indicates. 
This work seeks to narrow the gap between theory and practice by 
transforming the problem into determining the distribution of the 
supremum of a random field in the space of weight vectors, which 
in turn is attacked by application of a recent technique called the 
Poisson clumping heuristic. 
I INTRODUCTION AND KNOWN RESULTS 
We investigate the tradeoffs among network complezity, training set size, and sta- 
tistical performance of feedforward neural networks so as to allow a reasoned choice 
of network architecture in the face of limited training data. Nets are functions 
/(x; w), parameterized by their weight vector w  42 C_ R a, which take as input 
points x  R . For classifiers, network output is restricted to {0, 1} while for fore- 
casting it may be any real number. The architecture of all nets under consideration 
is jV', whose complexity may be gauged by its Vapnik-Chervonenkis (VC) dimension 
v, the size of the largest set of inputs the architecture can classify in any desired way 
('shatter'). Nets r/6 A f are chosen on the basis of a training set T = 
These n samples are i.i.d. according to an unknown probability law P. Performance 
of a network is measured by the mean-squared error 
(w) = E(/(x; w) - y)' (1) 
= P(/(x; w) / y) (for classifiers) (2) 
328 Michael Turmon, Terrence L. Fine 
and a good (perhaps not unique) net in the architecture is w  = arg minEw (w). 
To select a net using the training set we employ the empirical error 
1 (,(,; ) _ ,). 
r(w) =  
i=1 
sustained by r/(-; w) on the training set T. A good choice for a classifier is then 
w* = arg min, oEw ,r(w). In these terms, the issue raised in the first sentence of the 
section can be restated as, "How large must n be in order to ensure (w*)-(w ) _< 
e with high probability?" 
For purposes of analysis we can avoid dealing directly with the stochastically chosen 
network w* by noting 
(w*) - (w ) 5 Ir(w *) - (w*)l + Ir(w ) - (w)l _< 2 sup Ir(w) - (w)l 
A bound on the last quantity is also useful in its own right. 
The best-known result is in (Vapnik, 1982), introduced to the neural network com- 
munity by (Baum &: Haussler, 1989): 
P( sup Ir(w) - (w)l > ) _< 6 (2.,) e-"'*/' (4) 
This remarkable bound not only involves no unknown constant factors, but holds 
independent of the data distribution P. Analysis shows that sample sizes of about 
nc = (4v/e ') log 3/e (5) 
are enough to force the bound below unity, after which it drops exponentially to 
zero. Taking e = .1, v = 50 yields nc = 68000, which disagrees by orders of 
magnitude with the experience of practitioners who train such simple networks. 
More recently, Talagrand (1994) has obtained the bound 
_ _ 
(up I-r(w) - (w)l > ) < x. 7 -", () 
w 
yielding a sufficient condition of order v/e 2, but the values of K and K2 are inac- 
cessible so the result is of no practical use. 
Formulations with finer resolution near E(w) = 0 are used. Vapnik (1982) bounds 
(.p I.r(w)-()l/(w)/  )--not (w)X/  (.r(w))X/ wh. 
E(w)  O--while Blumer et al. (1989) and Anthony and Biggs (1992) work with 
(-p I-r(w) - (w)l l{0(-r(w))  ). h htt otin th .mi.t o.di- 
tion 
., = (5.8/) log 2/ (7) 
for nets, if any, having yT(w) = 0. If one is guaranteed to do reonably well on 
the training set, a smaller order of dependence rults. 
Rults (rmon & Fine, 1993) for percepttons and P a Gaussian mixture imply 
that at let v/280e 2 samples are needed to force E(w*) -E(w ) < 2e with high 
probability. (Here w* is the best linear discriminant with weights estimated from 
the data.) Combining with Talagrand's result, we see that the general (not suming 
small yr(w)) functional dependence is vie . 
Sample Size Requirements for Feedforward Neural Networks 329 
2 APPLYING THE POISSON CLUMPING HEURISTIC 
We adopt a new approach to the problem. For the moderately large values of n 
we anticipate, the central limit theorem informs us that x/[v'(w)- (w)] has 
nearly the distribution of a zero-mean Gaussian random variable. It is therefore 
reasonable x to suppose that 
P( sup Ivr(w) - (w)l > ) - P( sup IZ(w)l > ) < 2P(sup z(w) > ,/) 
wEl wEl wEl 
where g(w) is a Gaussian process with mean zero and covariance 
R(w. v) = EZ(w)Z(v) = Cov((y - ,(x; w)) '. (y - ,(x; ))2) 
The problem about extrema of the original empirical process is equivalent to one 
about extrema of a corresponding Gaussian process. 
The Poisson clumping heuristic (PCH), introduced in the remarkable (Aldous, 
1989), provides a general tool for estimating such exceedance probabilities. Con- 
sider the excursions above level b(= evf >> 1) by a stochastic process Z(w). At 
left below, the set {w: Z(w) _> b} is seen as a group of "clumps" scattered in weight 
space YV. The PCH says that, provided Z has no long-range dependence and the 
level b is large, the centers of the clumps fall according to the points of a Poisson 
process on 142, and the clump shapes are independent. The vertical arrows (below 
right) illustrate two clump centers (points of the Poisson process); the clumps are 
the bars centered about the arrows. 
z(w) 
In fact, with pb(w) = P(Z(w) _> b), Cb(w) the size of a clump located at w, and 
Ab(w) the rate of occurrence of clump centers, the fundamental equation is 
pb(w) - b(w)EC(w). (8) 
The number of clumps in YV is a Poisson random variable N with parameter 
A(w)dw. The probability of a clump is P(N, > O)= 1-exp(-fwAb(w)dw ) - 
A,(w) dw where the approximation holds because our goal is to operate in a 
regime where this probability is near zero. Letting (b) = P(N(O, 1) > b) and 
a(w) = R(w, w), we have p,(w) = (b/a(w)). The fundamental equation becomes 
( p Z(w) > b) f (/(w)) w () 
 - - EC(w) 
It remains only to find the mean clump size EC,(w) in terms of the network archi- 
tecture and the statistics of (x, y). 
See ch. 7 of (Pollard, 1984) for treatment of some technical details in this limit. 
330 Michael Turmon, Terrence L. Fine 
3 POISSON CLUMPING FOR SMOOTH PROCESSES 
Assume Z(w) has two mean-square derivatives in w. (If the network activation 
functions have two derivatives in w, for example, Z(w) will have two almost sure 
derivatives.) Z then has a parabolic approximation about some w0 via its gradient 
G = VZ(w) and Hessian matrix H = VVZ(w) at w0. Provided z0 ) b, that is 
that there is a clump at w0, simple computations reveal 
((Zo - b) - 
where a is the volume of the unit ball in R a and [. [ is the determinant. The mean 
clump size is the expectation of this conditioned on Z(wo)  b. 
The same argument used to show that Z(w) is approximately normal shows that G 
and H are approximately normal too. In hct, 
z 
[nlZ(w0) = z] = 
(wo) = = w)l=o 
so that, since b (and hence z) is large, the second term in the numerator of (10) 
may be neglected. The expectation is then eily computed, resulting in 
Lemma 1 (Smooth process clump size) Let the network activation functions 
be twice continuously differenttable, and let b >> a(w). Then 
EC,(w)  (2) d/2 A(w) -Uz d 
Substituting into (9) yields 
,ew - a2(w) o'(-w) dw, (11) 
where use of the asymptotic expansion (z) _ (zx/) -x exp(-z2/2) is justified 
since (w)b >> a(w) is necessary to have the individual P(Z(w) _ b) low--let alone 
the supremum. To go farther, we need information about the variance a2(w) of 
(y - ?(x; w)) 2. In general this must come from the problem at hand, but suppose 
for example the process has a unique variance maximum 2 at . Then, since 
the level b is large, we can use Laplace's method to approximate the d-dimensional 
integral. 
Laplace's method finds asymptotic expansions for integrals 
w g( w) exp(- f ( w) ' /2 ) dw 
when f(w) is C ' with a unique positive minimum at w0 in the interior of 14/C_ R d, 
and g(w) is positive and continuous. Suppose f(wo) >> 1 so that the exponential 
factor is decreasing much faster than the slowly varying g. Expanding f to second 
order about wo, substituting into the exponential, and performing the integral shows 
that 
wg(W exp(-f(w) '/2) dw _ exp(-f(w0)'/2) 
Satnple Size Requirements for Feedforward Neural Networks 331 
where K = VVf(w)lo, the Hessian of f. See (Wong_, 1989) for a proof. Applying 
this to (11) and using the asymptotic expansion for  in reverse yields 
Theorem 1 Let the network activation functions be twice continuously cliffetch- 
liable. Let the variance have a unique mazimum  at  in the interior of YV and 
the level b >> . Then the PCH estimate of ezceedance probability is given by 
P( sup Z(w) > b) _ IA()l/' 
Ew - la()- r()l / (b/a) (12 
w r() = vv(w, v)l==. ro, - r i positive-definte at ; 
it is -1/2 the Hessian of a(w). The leading constant thus stscfly eceeds unity. 
The above probability is just P(Z()  b) multiplied by a factor accounting for the 
other networks in the supremum. Letting b = e reveals 
 log(lA()l/lA() - r(w)l) (is) 
samples force P(sup Ivr(w)  S(w)l  e) below unity. If the variance mmum is 
not unique but occurs over a d-dimensional set within W, the sample size estimate 
becomes proportionM to /e . With  playing the role of VC dimension v, this 
is similar to Vapnik's bound although we retain dependence on P and . 
The above probability is determined by behavior near the mimum-variance point, 
which for example in clsification is where E(w) = 1/2. Such nets are uninterest- 
ing  cidsifters, and certainly it is undesirable for them to dominate the entire 
probability. This problem is avoided by replacing Z(w) with Z(w)/a(w), which ad- 
ditionally allows a finer resolution where E(w) nears zero. Indeed, for clsification, 
if n is such that with high probability 
up Ir(w) - S(w)l Ir(w)- S(w)l <  
 .(w) = up , (14) 
 gs()(1 - S(w)) 
then vr(w*) = O g(w*) < e(l+e) -  e  << e. Near vr(w*) = 0, condi- 
tion (14)/is much more powerful than the corresponding unnormalized one. Sample 
size estimates using this setup give rults having a functional form similar to (7). 
4 ANOTHER MEANS OF COMPUTING CLUMP SIZE 
Conditional on there being a clump center at w, the upper bound 
Cb(w) _ Db(w) = /w l[0,o)(Z(w')-b)dw' (15) 
is evidently valid: the volume of the clump at w is no larger than the total volume of 
all clumps. (The right hand side is indeed a function of w because we condition on 
occurrence of a clump center at w.) The bound is an overestimate when the number 
Nb of clumps exceeds one, but recall that we are in a regime where b (equivalently 
n) is large enough so that P(Nb > 1)/P(Nb = 1) _ fw ,b(w) dw << 1. Thus error 
in (15) due to this source is negligible. To compute its mean, we approximate 
EDb(w) - /w P(Z(w') _ blw a clump center) 
332 Michael Turmon, Terrence L. Fine 
-- /w P(Z(w') _> blZ(w ) _> b)dw' (16) 
The point is that occurrence of a clump center at w0 is a smaller class of events than 
merely Z(wo) >_ b: the latter can arise from a clump center at a nearby w  )4/ 
capturing w0. Since Z(w) and Z(w') are jointly normal, abbreviate a = a(w), 
o" = o'(w'), p = p(w, w') = R(w, w')/(o'o"), and let 
1- pa'/a (17) 
= ((1 - p)/(1 + p)) /' (constant variance case) (18) 
Evaluating the conditional probabilities of (16) presents no problem, and we obtain 
Lemma 2 (Clump size estimate) For b )) a the mean clump size is 
ECb(w) EDb(w) aw' 
Remark 1. This integral will be used in (9) to find 
f (b/) dw (20) 
P(su pZ(w) > b)- f((b'-) dw' 
Since b is large, the main contribution to the outer integral occurs for w near a 
variance maximum, i.e. for at/a _< 1. If the variance is constant then all w  14/ 
contribute. In either case ( is nonnegative. By lemma i we expect (19) to be, as 
a function of b, of the form (const a/b)r for, say, p = d. In particular, we do not 
anticipate the exponentially small clump sizes resulting if (Vw')((w, w') >_ M >> 0. 
Therefore ( should approach zero over some range of w t, which happens only when 
p m 1, that is, for w t near w. The behavior of p(w, w ) for w t  w is the key to 
finding the clump size. 
Remark 2. There is a simple interpretation of the clump size; it represents the 
volume of w t  142 for which Z(w t) is highly correlated with Z(w). The exceedance 
probability is a sum of the point exceedance probabilities (the numerator of (20)), 
each weighted according to how many other points are correlated with it. In effect, 
the space )4/is partitioned into regions that tend to "have exceedances together," 
with a large clump size EC,(w) indicating a large region. The overall probability 
can be viewed as a sum over all these regions of the corresponding point exceedance 
probability. This has a similarity to the Vapnik argument which lumps networks 
together according to their nV/v] possible actions on n items in the training set. In 
this sense the mean clump size is a fundamental quantity expressing the ability of 
an architecture to generalize. 
5 EMPIRICAL ESTIMATES OF CLUMP SIZE 
The clump size estimate of lemma 2 is useful in its own right if one has information 
about the covariance of Z. Other known techniques of finding ECb(w) exploit 
special features of the process at hand (e.g. smoothness or similarity to other well- 
studied processes); the above expression is valid for any covariance structure. In 
Sample Size Requirements for Feedforward Neural Networks 333 
this section we show how one may estimate the clump size using the training set, 
and thus obtain probability approximations in the absence of analytical information 
about the unknown P and the potentially complex network architecture Af. 
Here is a practical way to approximate the integral giving EDb (w). For 7 < 1 define 
a set of significant w  
S(w) - {w'  W ' ((w,w') <_ /) V(w) = vol($(w)) ; (21) 
then monotonicity of  yields EDb(w) _> fs, ((b/a)()dw'  V(w)((b/a)7). 
This apparently crude lower bound for  is accurate enough near the origin to give 
satisfactory results in the ces we have studied. For example, we can characterize 
the covariance R(w, w ) of the smooth process of lemma i and thus find its ( func- 
tion. The bound above is then eily calculated and differs by only small constant 
factors from the clump size in the lemma. 
The lower bound for ED,(w) yields the upper bound 
f (bl) 
r(? Z(w)  b)  v() ((b/)) dw () 
We call V, (w) the correlation volume,  it represents those weight vectors w t whose 
errors Z(w t) are highly correlated with Z(w); one simple way to estimate the cor- 
relation volume is  follows. Select a weight w t and using the training set compute 
(y - .x; ))2,..., (y. _ ..; ))2 u (yx - .x; w'))2,..., (y. - ..; w')) 2  
It is then ey to estimate a 2, a 2, and p, and finally ((w, w), which is compared 
to the chosen 7 to decide if w'  S(w). 
The difficulty is that for large d, S(w) is far smaller than any approximately- 
enclosing set. Simple Monte Carlo sampling and even importance sampling methods 
fail to estimate the volume of such high-dimensional convex bodies because so few 
hits occur in probing the space (Lovz, 1991). The simplest way to concentrate the 
search is to let w  = w except in one coordinate and probe along each coordinate 
axis. The correlation volume is approximated  the product of the one-dimensional 
meurements. 
Simulation studies of the above approach have been performed for a perceptton 
architecture in input uniform over [-1, 1] d. The integral (22) is computed by Monte 
Carlo sampling, and bed ona training set of size 100d, V7 (w) is computed at each 
point via the above method. The result is that an estimated sample size of 5.4d/e 2 
is enough to ensure (14) with high probability. For nets, if any, having y(w) = 0, 
sample sizes larger than 5.4die will ensure reliable generalization, which compares 
favorably with (7). 
6 SUMMARY AND CONCLUSIONS 
To find realistic estimates of sample size we transform the original problem into 
one of finding the distribution of the supremum of a derived Gaussian random 
field, which is defined over the weight space of the network architecture. The 
latter problem is amenable to solution via the Poisson dumping heuristic. In terms 
of the PCH the question becomes one of estimating the mean clump size, that 
334 Michael Turnon, Terrence L. Fine 
is, the typical volume of an excursion above a given level by the random field. 
In the "smooth" case we directly find the clump volume and obtain estimates of 
sample size that are (correctly) of order v/e 2. The leading constant, while explicit, 
depends on properties of the architecture and the data--which has the advantage 
of being tailored to the given problem but the potential disadvantage of our having 
to compute them. 
We also obtain a useful estimate for the clump size of a general process in terms of 
the correlation volume V,(w). For normalized error, (22) becomes approximately 
where the expectation is taken with respect to a uniform distribution on . The 
probability of reliable generalization is roughly given by an exponentially decreeing 
factor (the exceedance probability for a single point) times a number representing 
degrees of freedom. The latter is the mean size of an equivalence cls of "similarly- 
acting" networks. The parallel with the Vapnik approach, in which a worst-ce 
exceedance probability is multiplied by a growth function bounding the number of 
closes of networks in  that can act differently on n pieces of data, is striking. In 
this fhion the correlation volume is an analog of the VC dimension, but one that 
depends on the interaction of the data and the architecture. 
Ltly, we have proposed practical methods of estimating the correlation volume 
empirically from the trMning data. Initial simulation studies bed on a perceptron 
with input uniform on a region in R a show that these approximations can indeed 
yield informative estimates of sample complexity. 
References 
Aldous, D. 1989. Probability Approzimations via the Poisson Clumping Heuristic. 
Springer. 
Anthony, M., &; Biggs, N. 1992. Computational Learning Theory. Cambridge Univ. 
Baum, E., &; Haussler, D. 1989. What size net gives valid generalization? Pages 
81-90 of: Touretzky, D. S. (ed), NIPS 1. 
Blumer, A., Ehrenfeucht, A., Haussler, D., &; Warmuth, M. K. 1989. Learnability 
and the Vapnik-Chervonenkis dimension. Jour. Assoc. Comp. Mach., 36, 929-965. 
Lovfisz, L. 1991. Geometric Algorithms and Algorithmic Geometry. In: Proc. 
Internat. Congr. Mathematicians. The Math. Soc. of Japan. 
Pollard, D. 1984. Convergence of Stochastic Processes. Springer. 
Talagrand, M. 1994. Sharper bounds for Gaussian and empirical processes. Ann. 
Probah., 22, 28-76. 
Turmon, M. J., & Fine, T. L. 1993. Sample Size Requirements of Feedforward 
Neural Network Classifiers. In: IEEE 1993 Intern. Sympos. Inform. Theory. 
Vapnik, V. 1982. Estimation of Dependences Based on Empirical Data. Springer. 
Wong, R. 1989. Asymptotic Approzimations of Integrals. Academic. 
