Generalization Error and The Expected 
Network Complexity 
Chuanyi Ji 
Dept. of Elec., Cornpt. and Syst Engr. 
Rensselaer Polytechnic Inst, itule 
Troy, NY 12180-3590 
chuanyi@ecse.rpi.edu 
Abstract 
For two layer networks with n sigmoidal hidden units, the generalization error is 
shown to be bounded by 
0(: ) + O( (:C)d 
N 
where d and N are the input dimension and the number of training samples, re- 
spectively. E represents the expectation on random number I( of hidden units 
(1 _< X _< n). The proba,bility Pr(I( = k) (1 <_ k <_ n) is dctermined by a prior 
distribution of weights, which corresponds to a Gibbs distribtttion of a regularizer. 
This relationship makes it possible to characterize explicitly how a regularization 
term affects bias/variance of networks. The bound can be obta.ined analytically 
for a large c. lass of commonly used priors. It can also be applied to estimate the 
expected network complexity E.r in practice. The result provides a quantitative 
explanation on how large networks can generalize well. 
1 Introduction 
Pegularization (or weight-deca.y) methods are widely used in supervised learning by 
adding a regularization term to an energy function. Although it is well known that 
such a regularization term effectively reduces network complexity by introducing 
more bias and less variance[4] to the networks, it is not clear whether and how the 
information given by a regularization term can be used alone to characterize the 
effective network complexity and how the estimated effective network complexity 
relates to the generalization error. This research a. ttempts to provide answers to 
these questions for two layer feedforward networks with sigmoidal hidden units. 
367 
368 Ji 
Specifically, the effective network complexity is characterized by the expected 
bet of hidden units determined by a Gibbs dist, ribution corresponding to a regulat'- 
ization term. The generalization error can then be bounded by the expected network 
complexity, and thus be tighter than the original bound given by Barron[2]. The 
new bound shows explicitly, through a bigger approximation error and a smaller 
estimation error, how a regularization term introduces more bias and less variance 
to the networks It therefore provides a quantitative explanation on how a network 
larger than necessary can also generalize well under certain conditions, which can 
not, be explained by the existing learning theory[9]. 
For a class of commonly-used regularizers, the expec'ced netxvork complexity can 
be obtained in a closed form. It is then used to estimate the expected network 
complexity for Gaussion mixture model[6]. 
2 Background and Previous Results 
A relationship has been developed bv Barron[2] between generalization error and 
network complexity for two la,yer networks nsed for function approximation. We 
will briefly describe this restlit in this section and give onr extension subsequently. 
Consider a cla.ss of two layer networks of fixed architecture with n sigmoidal hidden 
units and one (linear)output unit. Let .f,,(z; w)= w?)g(w)rz)be a net'work 
/:1 
fitnction, where w  O is the network weight vector comprising both w? and w?) 
for 1 5 l 5 ". w ) and w? are the incoming weights to the/-th hidden unit 
the weight from the /-th hidden unit to the output, respectively.   R  is 
the weight space for n hidden nnits (and input dimension d). Each sigmoid unit 
g(z) is assumed to be oftanh type: g(z)  1 as z   for 1 5 I 5  
The input is z  D  Ra. Without loss of generality, D is assumed to be a unit 
hypercube in R d, i.e., all the components of x are in [--1, 1]. 
Let j'(x) be a. target function defined in the same domain D and satisfy some 
smoot. hness conditions [2]. Consider N [raining samples independently drawn from 
some distribntion It(a:): (z,./(x)), ...,(x2v, f(:t:v)). Define au energy function e, 
where e = c + h z'"*'(" L, (w) is a regularization term as a function of w 
Ar ,,N 
for a fixed ,. A is a constant. c 
N 
1 
fimction such that 'b minimizes the 
is a quadratic error function on N training 
))2. Let ./;,,(.v;'&) be the (optimal)network 
energy fimction e: b = arg rain . The gert- 
eralization error E.q is defined to be the squared L 2 -- 
norm = f- II -- 
f(f(x) -- fn,N(X;tb))2d/t(x), vhere  is the expectation over a]] training sets of 
D 
size Ar drawn from the same distribution. Thus, the generalization error measures 
the mean squared distance between the unknoxvn function an,l the best network 
function that can be obtained for training sets of size A; The ,,e  ' ' 
 g me ahzgton error 
In the previous work by Barron, the sigmoidal hidden units arc, ,,(:)+l 
2 
show tha, t his results axe applica,ble to the class of .qt(z)'s we consider here. 
It is ea.sy to 
Generalization Error and the Expected Network Complexity 369 
Ea is shown[2] to be bounded as 
 _< o ( , ) , 
where ]?.,,, called the index of resolvability [2], can be expressed as 
R,,N = rain {11/- I1-0+ (2) 
wO, N 
where .f, is the clipped fn(a:; w) (see [2]). The index of resolvability can be further 
bounded as 
bounded as 
7.([ T' 
where O(.3) and D("rtlo9N) are the bonnds for gpproximation error 
es[iamtion error (variance), respec[ively. 
In addRion, the bound br Eq can be minimized if an additiongl regulariza.tion [erm 
L () is used in the energy hmcion to minimize the number of hidden units, i.e., 
50(toT). 
R.,,,;v _< O(,)+ O( . Therefore, the generalization en'or is 
'- l o 9 N ) 
(3) 
(bias) and 
3 Open Questions and Motivations 
Two open questions, which can not be answered by the previous result, are of the 
primary interest of this work. 
1) How do large networks generalize? 
Tle large networks refer to those with a ratio [ to be somewhat big, where W 
and N are the total number of independently modifiable weights (W  cl, for 
 large) and the nnmber of training samples, respectively. Netvorks trained witIx 
regularization terms ntay fall into this category. Such large networks are found 
o }r. a10]r to generalize well sometimes. Ilowever, when ' is big, the bomd in 
Equation (3) is too loose to loound the actual generalization error meaningfully. 
Therefore. tbr the large networks, the total nnmber of hidden units l may no longer 
be a. good estimate fbr netsyork complexity. Efforts have been made to develop 
measures on effective network complexity both analytically and empirically[I][5][10]. 
These measures depend on training data as well as a regularization term in an 
implicit way which make it dicult to see direct effects of a regularization term on 
generalization error. This naturally leads to our second qnestion. 
2) Is it possi101e to characterize network complexity for a el:,,, of networks using 
only the information given by a regularization term2? Hmv to relate the estimated 
network complexity rigorously with generalization error? 
In practice, when a regularization term (L,,,a, (w)) is used to penalize the ma g,itude 
of weights, it effectively minimizes the number of hidden units as well even tl,,,tlh an 
additional regularization term Lsr0z ) is not used. This is due to the fact th:tt, some 
of the hidden nnits may only operate in the linear region of a sigmoid when their 
This was posed as an open problem by Solla et.al. [8] 
370 Ji 
incoming weights are small and inputs are bounded. Therefore, a L,,,v(w) term ca.n 
effectively act like a L;v(n) term that reduces the effective number of hidden units, 
and thus result in a degenerate parameter space whose degrees of freedom is fewer 
than rid. This fact was not taken into consideration in the previous work, and as 
shovn later in this work, will lead to a tighter bound on 
In vhat follovs, ve will first define the expected network complexity, then use it to 
bound the generalization error. 
4 The Expected Network Complexity 
For reasons that will become apparent, ve choose to define the effective complexity 
of a feed forward two layer network as the expected mtm10er of hidden units Elf 
(1 _< If <_ ,.) which are effectively nonlinear, i.e. operating outside the central 
linear regions of their sigmoid response fnnction g(z). We define the linear region 
as an interval [ z [< b with b a positive constant. 
Consider the presynaptic input _" = w'ZPz to a hidden unit g(z), where w' is the 
incoming weight vector for the unit. Then the unit is considered to be effectively 
linear if[ z {< b for all z  D. This will happen if [ z' {< b, where z' = w'Ta ' with 
x' being any vertex of the unit hypercube D. This is because {z {_< w'T:, where : 
is the vertex of D whose elements are the sgn functions of the elements of w'. 
Next, consider network veights as random variables with a distribution p(w) = 
Aexp(-L,,,A,(w)), which corresponds to a Gibbs distribntion of a regularization 
term with a normalizing consta. ut A. C, ousider the vector :' to 10e a random vector 
also with equally probable l's and -l's. Then ] z' [< b will be a random event. The 
probability for this hidden unit to be effectively nonlineal' equals to 1 - Pr(I z [< b), 
which can be completely determined by the distributions of weights p(w) and z' 
(equally pro10able). Let, If 10e the number of hidden units which are effectively 
nonlinear. Then the probability, Pt(If = k) (1 _< /c' _< n), can be determined 
through a joint probability of k hidden units that are operating beyond the central 
linear region of sigmoid functions. The expected network complexity, EK, can then 
be obtained through Pt(It = k), which is determined by the Gibbs distribution of 
Lz,,, (w). The motivation on ntilizing such a Gibbs distri10ution comes from the fact 
that/k,N is independent of training samples but dependent of a regularization term 
vhich corresponds to a prior distril0ution of weights. Using such a formulation, as 
will be shown later, the effect of a regularization term on bias and variance can be 
characterized explicitly. 
5 A New Bound for The Generalization Error 
To develop a tighter bound For the generalization error, we consider snbspaces of 
t. he weights indexed by different munber of effectively nonlinear hidden units: O g 
02... g 0,,,. For each O j, there are j out of, hidden units which are effectively 
nonlinea. r tbr 1 5 J 5 n. Ihere[oze, the index of resolvability R,,v can be expressed 
as 
R,r = rain R;,.,,V, (4) 
Generalization Error and the Expected Network Complexity 371 
where each Rk, rain {[[ f j ][2 ,N(w) ). Next let us consider the number 
of effectively nonlinear units to be random. Since the minimum is no bigger than 
the average, we hve 
,  E,, () 
where the expectation is taken over the random variable I4 utilizing the probability 
Pt(I( = k). For cach I(, however, the two terms in K,N can be bounded as 
by the triangle inequality, where f-c, is the actual network function with n -  
hidden units operating in the region bounded by the constant b, a,nd fK is the 
corresponding network function which treats the , - I units as linea.r nnits. In 
ddition, we lmve 
L,,O)  O([I .f,-,, - J II ) + O(Zoc. V), () 
where the first term also results from the triangle inequality, and the second term 
is obtained by discretizing the degenerate parameter space $z using simHa.r tech- 
niques as in ['2] a. Applying Taylor expansion on the term [l f-,,- fK 11 u, we 
have 
II f,-x, - fc [I u  O(Y=(n- Z)). (8) 
Puting Equations (5) (6) (7) and (8) into Equa.ton (]), w, have 
1 (EI4)dlogN) + O(b(, - EK)) + o(bC), (9) 
z,  o() + o(  
where EI is the expected mnber of hidden units which are effectively nonlinear. 
If b5 0(%), wehave 
6 
A Closed Form Expression For a Class of Regularization 
Te r 111 s 
For commonly used regula. rization terms, how can we actually find the probability 
distribution of the number of (nonlinear) hidden units Pt(If = k)? And how shall 
we evaluate Elf and E? 
As a simple example, we consider a special class of prior distrilmtions for iid weights, 
i.e, p(w) = Hp(w), where zvi are the ,lcments of w  , This corresponds to 
a. large class of regularization terms which minimize the magnitudes of individual 
weights independently[7]. 
Consider each weight as a random variaMe with zero incan and a. common variance 
or. Then for large input dimension d,  -' is a.pproximately normal with zero-mean 
3Deta,ils will be given in a longer version of the pa,per in prepa,ra,tion. 
372 Ji 
and variance cr by the Central Limit Theorem[3]. Let q denote the probability that 
a unit is effectively nonlinear. We have 
b 
q: (11) 
where 
hidden units are nonlinear. 
a/, I( has a binomial distribution 
where 1 < k < n. Then 
where A = 
v 2 
'-' dy. Next consider the probability that N out of n 
Based on our (independence) assumptions on w' and 
(12) 
= (1:3) 
1 1 
--=-+zX, (14) 
E N n 
+ (1 - q)". Then the generalization error E satisfies 
1 nqd,  
< o(7 + zx) + ) 
7 Application 
As an example for applications of the t.]]eoretical resnlts, the expected netxvork com- 
plexity EIf is estimated [br Ga. ussian mixt, ure model used for time-series prediction 
(details can be found in [6])4 
In general, using only a prior distribntion of weights to estimate the network com- 
plexity E/( may lead to a less accurate measure on the effective network complexi/,y 
than incorporating information on training data also. However, if parameters of a 
regularization term also get optimized during training, as shown in this example, 
the resulting Gibbs prior distribution of weights may lead to a good estimate of the 
effective number of hidden units. 
Specifically, the corresponding Gibbs distribution p(w) of the weights from the 
Gaussion mixture is lid, which consists of a linear combination of eight Gaussian 
distributions. This function results in a skewed distribution with a sharp peak 
around the zero (see [6]). The mean and variance of the presynaptic inputs z to 
the hidden units can thus be estimated as 0.02 and 0.04, respectively. The other 
parameters used are n = 8, d = 12. b = 0. is chosen. Then q  0.4 is obtained 
through Equation (11). The effective netsyork complexity is EIV m 3 (or 4). The 
empirical result[10], which estimates the effective number of hidden units using the 
dominated eigenvalues at the hidden layer, results in about 3 effective hidden unit. s. 
4Strictly spea.]dng, the theoretical results deal ;vith regularization terms with discrete 
weights. It, can a,nd has been extended to continuous weights by D.F. McCaffrey and A.R. 
Galla,nt. Details are beyond the content of titis paper. 
Generalization Error and the Expected Network Complexity 373 
4.5 
4 
3.5 
3 
2.5 
2 
1.5 
1 
0.5 
0  
0 0.2 
variance 
increase in bias 
0.4 0.6 0.8 1 
q 
Fignre 1: Illustration of an increase A in bias and variance Bqn, as a function of q. 
A scaling actor B = 0.25 is used for the convenience of the plot. n - 20 is chosen. 
8 Discussions 
Is this new bound for the generalization tighter than the old one which takes no 
account of nwork-weight-dependent, information? If so. what does it, tell us? 
Compared with the bomd in Equation (3), the new bound results in an increase /X 
in approximatiou error (bias), and qn instea. d of r as estimat. ion error (varia,ce). 
These two terms are plotted as bract. ions of q in Figure (1). Since q is a hmction of 
cr which characterizes how strongly the magnitude of the weights is penalized, the 
larger the rr, the less the weights get penalized, the larger the q, the more hidden 
units are likely to be effectively nonlinear, thus the smaller the bia.s and larger the 
variance. When q = 1, all the hidden units are effectively nonlinear and the new 
bound reduces to the old one. This shows how a regularization term directly affects 
bias/variance. 
When the estimation error dominates, the bound for the generalization error will be 
proportional to 'nq inst, ead of n,. The valne of nq, however, depends on the choice of 
rr. For sinall or, the new bound can be nmch tighter than the old one, especially for 
large mtworks with ,. large but nq small. This will provide a practical method to 
eatingate generalization error for large networks as well as an explanation of when 
and why large networks can generalize well. 
l tow tight the bonnd really is depends on how well L,,,x (w) is chosen. Let. nc denote 
the optimal number of (nonlinear) hidden units needed to approximate .f(.). If 
L,,,,(w) is chosen so that the corresponding l>(w) is almost a delta function at n0, 
then ERzc,x' m/il,.o,,V, xvhich gives a very tight bound. Otherwise, if, for instance, 
374 Ji 
L,,N(W) penalizes network complexity so little that Et{K,N  ln,N, the bound 
will be as loose as the original one. It should also be noted that an exact value for 
the bound cannot be obtained unless some information on the unknown function f 
itself is available. 
For commonly used regularization terms, the expected network complexity can be 
estimated through a close form expression. Such expected network complexity is 
shown to be a good estimate for the actual network complexity if a Gibbs prior 
distribution of weights also gets optimized through training, and is also sharply 
peaked. More research will be done to evaluate the applicability of the theoretical 
results. 
Acknowledgement 
The support of National Science Foundation is gratefully acknowledged. 
References 
[1] S. Amari and N. Murata, "Statistical Theory of Learning Curves under En- 
tropic Loss Criterion," Neural Computalion, 5, 140-153, 1993. 
[2] A. Barron, "Approximation and Estimation Bounds for Artificial Neural Net- 
works," Proc. of The 4th Workshop on Computalional Learning Theory, 243- 
249, 1991. 
[3] W. Feller, An Introduction to Probability Theory and Its Applications, New 
york: John Wiley and Sons, 1968. 
[4] S. Geman, E. Bienenstock, and R. Doursat, "Neural Networks and the 
Bias/Variance Dilemma," Neural Compttation, 4, 1-58, 1992. 
[5] J. Moody, "Generalization, Weight Decay, and Architecture Selection for Non- 
linear Learning Systems," Proc. of Neural Information Processing Systems, 
1991. 
[6] S.J. Nowlan, and G.E. Hinton, "Simplifying Neural Networks by Soft Weight 
Sha.ring," Neural computation, 4, 473-493(1992). 
[7] R. Reed, "Pruning Algorithms-A Survey," I Trans. Neural Networks Vol. 
4,740-747, (1993). 
[8] S. Solla, "The Emergence of Generalization Ability in Learning," Presented at 
NIPS02. 
[9] V. Va. pnik, "Estimation of Dependences Based on Empirical Data," Springer- 
Vcrlag, New york, 1982. 
[10] A.S . Weigend and D.E . Rumelhart, "The Effective Dimension of the Space of 
Hidden Units," Proc. of International oint Conference on Neural Networks, 
1992. 
