Learning Curves, Model Selection and 
Complexity of Neural Networks 
Noboru Murata 
Departlnent of Mathematical Engineering and Information Physics 
University of rIbkyo, Tokyo 113, JAPAN 
E-mail: murasa;. t. u-tokyo. ac. jp 
Shuji Yoshizawa 
Dept. Mech. Info. 
University of Tokyo 
Shun-ichi Amari 
Dept. Math. Eng. and Info. 
University of Tokyo 
Phys. 
Abstract 
Learning curves show hoxv a neural network is improved as the 
number of training examples increases and how it is related to 
the network complexity. The present paper clarifies asymptotic 
properties and their relation of two learning curves, one concerning 
the predictive loss or generalization loss and the other the training 
loss. The result gives a natural definition of the complexity of a 
neural network. Moreover, it provides a new criterion of model 
selection. 
1 INTRODUCTION 
The learning curve shows hoxv well the behavior of a neural network is improved as 
the number of training exanq)les increases and how it is related with the complexity 
of neural networks. This provides us with a. criterion for choosing an adequate 
network in relation to the number 
of training examples. Some researchers have attacked this problem by using sta- 
tistical mechanical methods (see Levin et al. [1990], Seung et al. [1991], etc.) 
and some by information theory and algorithmic inethods (see Baum and Haussler 
607 
608 Murata, Yoshizawa, and Amari 
[1989], etc.). The present paper elucidates asymptotic properties of the learning 
curve from the statistical point of view, giving a new criterion for model selection. 
2 STATEMENT OF THE PROBLEM 
Let us consider a stochastic neural network, which is parameterized by a set of rn 
weights 0 = (01, ... ,0 ') and whose input-output relation is specified by a condi- 
tional probability p(yl x, 0). In other words, for an input signal is x  R '", the 
probability distribution of output y It 'u' is given by p(yl x, 0). 
A typical form of the stochastic neural network is as follows: let us consider a 
multi-layered network fix, O) where 0 is aset of m parameters 0 = (01 , ...,0 m) and 
its components correspond to weights and thresholds of the network. When some 
input x is given, the network produce an output 
y = fix,O) + l(x), (1) 
where r/(x) is noise whose conditional distribution is given by a(r/]x). Then the 
conditional distribution of the network, which specifies the input-output relation, 
is given by 
p(y[x,O) = a(y- f(x,O)]x). (2) 
We define a training sample t _ {(x,yx),...,(x,,y,)) as a set of t examples 
generated fi'om the true conditional distribution q(ylx), where xi is generated from 
a probability distribution r(x) independently. We should note that both r(x) and 
q(ylx) are unknown and we need not assume the faithfulness of the model, that is, 
we do not assume that there exists a parameter O* which realize the true distribution 
q(ylx) such that p(ylx,O*) = q(ylx). 
Our purpose is to find an appropriate parameter 0 which realizes a good approxi- 
mation p(y[x, O) to q(y[x). For this purpose, we use a loss function 
L(O) = D(r;qlp(O)) + S(O) (3) 
a.s a criterion to bc minimized, where D(r; q[p(O)) represents a general divergence 
measure between two conditional probabilities q(yx) and p(y]x, O) in the expecta- 
t. ion form under the true input-output probability 
D(r;q]p(O)) = f r(x)q(y]x)k(x,y,O)dxdy (4) 
and S(O) is a regularization term to fit the smoothness condition of outputs (Moody 
[1992]). So the loss function is rewritten as a expectation tbrm 
L(O) = f d(x,y,O) = k(x,y,O) + S(O), (5) 
and d(x, y, 0) is callcd the pointwise loss filllCt. iOll. 
A typical case of the divergence D of the multi-layered network f(x, O) with noise 
is the squared error 
D(r;q[p(O)) = f r(x)q(y[x)][y- f(x,O)l]udxdy, (6) 
Learning Curves, Model Selection and Complexity of Neural Networks 609 
The error function of an ordinary multi-layered network is in this form, and the 
conventional Back-Propagation method is derived from this type of loss function. 
Another typical case is the Kullback-Leibler divergence 
m(,';q[p(O)) = f ,'(x)q(ylx)log q(ylx) 
p(ylx,O)dxlv. (7) 
The integration f r(z)q(ylz)logq(yl')(xdy is a constant called a conditional en- 
tropy, and we usually use the following abbreviated form instead of the previous 
divergence: 
D(r; qlp(O)) - - / r(x)q(ylx) log p(y]x, O)dxdy. (8) 
Next, we define an optimum of the parameter in the sense of the loss function that 
we introduced. We denote by 0* the optimal parameter that minimizes the loss 
function L(O), that is, 
L(O*) = min L(0), (9) 
0 
and we regard p(yl x, 0') as the best realization of the model. 
When a training sample t is given, we can also define an clnpirical loss function: 
.(0): r)(,'; ql,(0)) + 3(0), (10) 
where i , 0 are the empirical distributions given by the sample ct, that is, 
t 
1 
e (11) 
i=1 
In practical case, we consider the empirical loss function and search for the quasi- 
optimal parameter 0 defined by 
(0): (0), 
0 
because the true distributions r(x') and q(ylz) are unknown and we can only use 
exanples (.l:i,.qi) obserwed fi'on the [rue distribution r(.)q([). We should note 
that the quasi-ol>timl parancter 0 is a random variable depending on the sample 
t, each element of which is chosen randomly. 
The following lemma guarantees that we can use the empirical loss function instead 
of the actual loss function when the number of examples t is large. 
Lemma 1 If the umbcr of escamplcs t is large enough, it is shown that the quasi- 
optimal parameter 0 is 7ormally distributed around the optimal parameter 0', that 
is, 
 O , (13) 
(.7 = r(x)q(ylx)Wd(x,y.O*)Wd(x,y,O*)q'dzdy, 
j. 
Q : r(x)q(ylx)VWd(x,v,O*)ddv, 
and W denoles the d(lferenhal operator wilh respect to O. 
(14) 
(15) 
610 Murata, Yoshizawa, and Amari 
This lemma is proved by using the usual statistical methods. 
3 LEARNING PROCEDURE 
In many cases, however, it is difficult to obtain tile quasi-optimal parameter t by 
minimizing the equation (10) directly. We theretbre often use a stochastic descent 
method to get an approximation to the quasi-optimal parameter t. 
Definition 1 (Stochastic Descent Method) In each learning step, an example 
is re-sampled from the gi'vc sample t randomly, and the following modification is 
applied to the parameter On at step n, 
0,,+ = O, - cVd(xi(,), Yi(n), On), (16) 
where  is a positive value called a learnig coefficient and (xi(,), Yi(n)) is the re- 
sampled example at step n. 
This is a sequential learning method and the operations of random sampling 
fi'om (t in each learning step is called the re-sampling plan. The parameter 
O, at step , is a randon variable as a function of the re-sampled sequence 
w = {(a'il 1, .qill),"', (.vit,, I, Yit,, I)}' However, if the initial value of 0 is appropriate 
(this assunaption prevents being stuck in local minima) and if the learning step n 
is large enough, it. is shown that the learned parameter 0, is normally distributed 
around the qnasi-optimal parameter. 
Lemma 2 If the learning step n is large enough and the learning coefficient e is 
small enough, the parameter 0,. is normally distributed asymptotically, that is, 
O... N(O, cl/), (17) 
wb. cre I/ satisfies the followig rclatim 
G: QV + VQ, 
t t 
l ' 
i=1 i=1 
77d( 3:i, yi , 0). 
(18) 
In tile IBlloving discussion, we assume that , is large enough and c is small enough, 
and we denote the learned parameter by 
0,,). (19) 
The distribution of the randon variable , theretbre, can be regarded as the normal 
distrii>ution N(O, eV). 
4 LEARNING CURVES 
It. is important, to evalua[c the diflbrence between two quantities L(t) and (t). The 
quanlity L(t) is called the predictive loss or the generalization error, which shows 
Learning Curves, Model Selection and Complexity of Neural Networks 611 
the average loss of the trained net.vork when a novel example is given. On the other 
hand, tile quantity L()is called the training loss or the training error, which shows 
the average loss evaluated by tile examples used in training. Since these quantities 
depend on the sample sct and the re-sampled sequence w', we take the expectation 
E and the variance Vat with respect to tile sample sct and the re-sampling sequence 
First, let us consider the predictive loss which is tile average loss of the trained 
network when a new example (which does not belong to the sample sct) is given. 
This averaging operation is replaced by averaging all over the input-output pairs, 
because the measure of the sample sct is zero. Then the predictive loss is written as 
L((}) - J r(x)q(ylx)a(x,y,)dxay. (20) 
From the properties of  and , we can prove the following important relations. 
Theorem 1 Tb.e predictive loss asymptotically satisfies 
E[L(t)] : L(O*) +. t,.C:O - + trOV, 
?- 
1 trC;O_C;Q_ +troVQV+ trGV. 
Var[L()]- 2t   
(2) 
(22) 
Roughly speaking, there exist two rado values Y1 and YS, and the predictive loss 
can be xvrilten as the lb!lowing tbl'n: 
I.() = L(O*)+ trGO -1 + t,-Ok 
+ s': + %7) + op), (2a) 
where }5 mad } satisfy 
-g[]: o, 
I , GQ_  
Va,-[Y] = :trOO -  
1 
,;[>.:_,]: 0, Varig':_,]: tt'Ovov, 
Cov[Sq:_,] = trGV, 
E, \;at and Cov denote tim exl>ectal. ion, the variance and the covariance respectively. 
Next, we consider tim training loss, i.e., the average loss evaluated by the examples 
used in training. Just as we did in the previous theorem, we can get the following 
relations. 
Theorem 2 
The trai.iu9 loss asymptotically satisfy 
1 t, rGQ_ 1 + trQV, 
= (o*)- 2-7 
: 7. ,.(.)q(l.c)d(.r,,O*)-d:,.d 
(24) 
(25) 
612 Murata, Yoshizawa, and Amari 
Intuitively speaking like the predictive loss, the training loss can be expanded as 
L(O*)- trGQ -x + trQV 
+ + 
where Ya satisfies 
- o, 
VatlYe] - / r(x)q(ylx)d(x, 
(2s) 
Y'O')2dxdy - (/r(x)q(y[x)d(x'Y'O*)dxdy)  
When we look at two curves E[L()] and E[L()] as functions of t, they are called 
learning curves which represent the characteristics of learning. The expectations of 
the predictive loss and the training loss look quite similar. They are different in 
the sign of the term 1It. As the learning coefficient e increases, the expectations 
E[L(t)] and E[,(t)] increase, but as the number of examples t increases, the average 
predictive loss E[L(()] decreases and the average training loss ELL()] conversely 
increases. Moreover, their variances are different in the order of t. The coefficients 
trGQ -1, trQV, etc. are calculated from the matrices G, Q and V, which reflect 
the architecture of the network and the loss criterion to be minimized. We can 
consider these matrices as representing the complexity of the network. In earlier 
work, Alnari and Murata. [1991] introduced an effective complexity of the network, 
trGQ -, by analogy to Akaike's Information Criterion (AIC) (see Akaike [1974]). 
5 AN APPLICATION FOR MODEL SELECTION 
These results naturally leads us to a model selection criterion, which is like the AIC 
criterion of statistical model selection and which is related those proposed by some 
researchers (see Murata et al. [1991], Moody [1992]). From the previous relations, 
we can easily show the following relation 
L() = [,() + ltrGQ- + c (27) 
l ' 
where c is a quantity ot' order l/vq and common to all the networks of the same 
architecture. We conq)are the abilities of two different networks, which have the 
same architecture and are trained by the same sample, but differ in the number of 
weights or neurons (see Fig.l). We can use a quantity, NIC (Network Information 
Criterion), 
1 trO-i (28) 
NIC() = L() + 7 ' 
where 
t 
8 ---- ' Z Vd(xi'yi')Vd(xi,yi,)T, 
i=1 
for selecting an optimal network model. 
calculable, since all elements of it, L(), 
t 
{ = 7 Z VVd(xi,Yi,t), (29) 
i=1 
Note that this quantity NIC is directly 
(3', Q, are given by summing over the 
Learning Curves, Model Selection and Complexity of Neural Networks 613 
salnple {t. When we have two models ]tit and M2, and the NIC of M1 is smaller 
than that of ]tl,.,, the predictive loss of ]tit is expected smaller than that of M2, so 
M can be regarded as a better model in the sense of the loss thnction. 
Titis criterion cannot be used when we compare two networks of different architec- 
tures, for example a multi-layered network and a radial basis expansion network. 
This is because the value c of the order l/x/7 term is common only to two networks 
in which one is included in the other as a submodel. The criterion is in general 
valid only for such a family of networks (see Fig.2). 
6 CONCLUSIONS 
In this paper, we show that there is nice relation between the expectation of the 
predictive loss and that of the training loss. This restlit naturally leads us to a new 
model selection criterion. 
We will consider the application of this result as an algorithm for automatically 
changing the number of hidden units in the learning as future work. 
References 
H. Aka. ike. (1974) A new look at the statistical model identification. IEEE Trans. 
AC, 19(6):716-723. 
S. Areart. (1967) Theory of adaptive pattern classifiers. IEEE Trans. EC, 
16(3):299-307. 
S. Amari and N. Murata. (1991) Statistical theory of learning curves under entropic 
loss criterion. Technical Report METPt 91-12, University of Tokyo, Tokyo, Japan. 
E. B. Baum and D. Ilaussler. (1989) What size net gives valid generalization? 
Neural Computation, 1:151-160. 
E. Levin, N. Tishby, and S. A. Solla. (1990) A statistical approach to learning and 
generalization in layered neural networks. Proc. of IEEE, 78(10):1568-1574. 
J. E. Moody. (1992) The effective number of parameters: An analysis of generaliza- 
tion and regularization in nonlinear learning systems. In J. E. Moody, S. J. Hanson, 
and R. P. Lippmann, (eds.), Advances i Neural Information Processing Systems d. 
San Mateo, CA: Morgan Ka. ufinann. 
N. Murata. (1992) Statistteal asyv.ptotic study on learning (In Japanese). PhD 
thesis, University of Tokyo, Tokyo, Japan. 
N. Mnrata, S. Yoshizawa, and S. Areart. (1991) A criterion for determining the 
number of parameters in an artificial neural network model. In T. Kohonen et al., 
(eds.), Artificial Neural Networks, 9-14. Holland: Elsevier Science Publishers. 
H. S. Seung, H. Sompolinsky, and N. Tishby. (1991) Statistical mechanics of learning 
froin examples II. quenched theory and unrealizable rules. Submitted to Physical 
Review A. 
614 Murata, Yoshizawa, and Amari 
origin of the large 
variance  
q(y lx) 
(y l x) 
model M 2 
model M 
01 
Figure 1: Geometrical representation of hierarchical models: the solid lines between 
q(/,tl z) and i show predictiw losses, and the dashed lines between 0('dlz) and i show 
training losses. The large variance of t. he training loss originated in the discrepancy 
el' q('yl.r) and q(!ll,r). When we estilmle the pre(!ict, ion loss fi'om t. he training loss, 
the large variance still renains. But in the case that the model M includes the 
model Jl12, this variance is common to two models, so we do not have to take care 
of it. 
0 
I 
origin of the large 
variance 
model M 2 
0 
model M  
Figure 2' Geometrical representation of non-hiera.rchical models: the solid lines 
between q(!llx) and  show l)redictive losses, and the dashed lines between O(ylx) 
and i show training losses. rI'he discrepa, ncy of q(l]x) and O(ylx) works differently 
on two models :'1,1 and M., i estimating predictive losses. 
