The VC-Dimension versus the Statistical 
Capacity of Multilayer Networks 
Chuanyi Ji *and Demetri Psaltis 
Department of Electrical Engineering 
California Institute of Technology 
Pasadena, CA 91125 
Abstract 
A general relationship is developed between the VC-dimension and the 
statistical lower epsilon-capacity which shows that the VC-dimension can 
be lower bounded (in order) by the statistical lower epsilon-capacity of a 
network trained with random samples. This relationship explains quan- 
titatively how generalization takes place after memorization, and relates 
the concept of generalization (consistency) with the capacity of the optimal 
classifier over a class of classifiers with the same structure and the capacity 
of the Bayesian classifier. Furthermore, it provides a general methodology 
to evaluate a lower bound for the VC-dimension of feedforward multilayer 
neural networks. 
This general methodology is applied to two types of networks which are 
important for hardware implementations: two layer (N - 2L - 1) net- 
works with binary weights, integer thresholds for the hidden units and 
zero threshold for the output unit, and a single neuron ((N- 1) net- 
works) with binary weigths and a zero threshold. Specifically, we obtain 
w , O(N). Here W is the total number 
O(?WE ) _ d2 _ O(W) and dl ~ 
of weights of the (N - 2L - 1) networks. dl and d2 represent the VC- 
dimensions for the (N- 1) and (N- 2L- 1) networks respectively. 
I Introduction 
The information capacity and the VC-dimension are two important quantities that 
characterize multilayer feedforward neural networks. The former characterizes their 
*Present Address: Department of Electrical Computer and System Engineering, Rens- 
selaer Polytech Institute, Troy, NY 12180. 
928 
The VC-Dimension versus the Statistical Capacity of Multilayer Networks 929 
memorization capability, while the latter represents the sample complexity needed 
for generalization. Discovering their relationships is of importance for obtaining 
a better understanding of the fundamental properties of multilayer networks in 
learning and generalization. 
In this work we show that the VC-dimension of feedforward multilayer neural net- 
works, which is a distribution-and network-parameter-indenpent quantity, can be 
lower bounded (in order) by the statistical lower epsilon-capacity C- (McEliece 
et.al, (1987)), which is a distribution-and network-dependent quantity, when the 
samples are drawn from two classes: 1(+1) and 22(-1). The only requirement 
on the distribution from which samples are drawn is that the optimal classification 
error achievable, the Bayes error Pbe, is greater than zero. Then we will show that 
the VC-dimension d and the statistical lower epsilon-capacity C- are related by 
C- <_ Ad, (1) 
where  = P,o- for 0 <  _< P,o; or  = Pb,- for 0 <  _< Pb,. Here e 
is the error tolerance, and P,o represents the optimal error rate achievable on the 
class of classifiers considered. It is obvious that P,o _> Pb,. The relation given 
in Equation (1) is non-trivial if Pb, > 0, P,o _< ' or Pb, < ' so that  is a non- 
negative quantity Ad is called the universal sample bound for generalization, where 
A < ,2: is a positive constant. When the sample complexity exceeds Ad, all the 
networks of the same architechture for all distributions of the samples can generalize 
with almost probability 1 for d large. A special case of interest, in which Pbe = 3, 
corresponds to random assignments of samples. Then C- represents the random 
storage capacity which characterizes the memorizing capability of networks. 
Although the VC-dimension is a key parameter in generalization, there exists no 
systematic way of finding it. The relationship we have obtained, however, brings 
concomitantly a constructive method of finding a lower bound for the VC-dimension 
of multilayer networks. That is, if the weights of a network are properly con- 
structed using random samples drawn from a chosen distribution, the statistical 
lower epsilon-capacity can be evaluated and then utilized as bounds for the VC- 
dimension. In this paper we will show how this constructive approach cqntributes 
to finding lower bounds of the VC-dimension of multilayer networks with binary 
weights. 
2 
A Relationship Between the VC-Dimension and the 
Statistical Capacity 
2.1 Definition of the Statistical Capacity 
Consider a network s whose weights are constructed from M random samples be- 
longing to two classes Let (s) = __z where Z is the total number of samples 
 M' 
classified incorrectly by the network s. Then the random variable P(s) is the train- 
ing error rate. Let 
P(M) = Pr((s) _< e), (2) 
930 Ji and Psaltis 
where 0 <_ e <_ 1. Then the statistical lower epsilon-capacity (statistical capacity in 
short) C- is the maximum M such that P(M) > 1 - r/, where r/can be arbitrarily 
small for sufficiently large N. 
Roughly speaking, the statistical lower epsilon-capacity defined here can be regarded 
as a sharp transition point on the curve P(M) shown in Fig.1. When the number 
of samples used is below this sharp transition, the network can memorize them 
perfectly. 
2.2 The Universal Sample Bound for Generalization 
Let Pe(xls) be the true probability of error for the network s. Then the gener- 
alization error AE(s) satisfies AE(s) =[ P(s)- Pe(xls) 1. We can show that the 
probability for the generalization error to exceed a given small quantity  satisfies 
the following relation. 
Theorem I 
Pr(m,aS(s) > ,') < a(au;a,e), 
wh e re 
(2M') a _ dM 
either 2M_ d, or6-., e --_ l&2Md, 
; otherwise. 
Here S is a class of networks with the same architecture. The function h(2M; d, 
has one sharp transition occurring at Ad shown in Fig. l, where A is a constant 
satisfying the equation A = ln(2A) + 1 - -A = O. 
This theorem says that when the number M of samples used exceeds Ad, generaliza- 
tion happens with probability 1. Since Ad is a distribution-and network-parameter- 
independent quantity, we call it the universal sample bound for generalization. 
2.3 A Relationship between The VC-Dimension and Cg 
Roughly speaking, since both the statistical capacity and the VC-dimension rep- 
resent sharp transition points, it is natural to ask whether they are related. The 
relationship can actually be given through the theorem below. 
Theorem 2 Let samples belonging to two classes Q1(-{-1) and Q(-1) be drawn 
independently from some distribution. The only requirement on the distributions 
considered is that the Bayes error Poe satisfies 0 ( Poe _ . Let S be a class 
of feedforward multilayer networks with a fized structure consisting of threshold 
elements and s be one network in $, where the weights of Sl are constructed from 
M (training) samples drawn from one distribution as specified above. For a given 
distribution, let Peo be the optimal error rate achievable on S and Pbe be the Bayes 
error rate. Then 
and 
Pr(P(Sl) < Peo -') _< h(2M;d,'), 
Prff(Sl) ( P&- ) _(h(2M;d,d), 
(4) 
The VC-Dimension versus the Statistical Capacity o Multilayer Networks 931 
P (M) ,. gh(2M;d,f 
 M 
C Ad 
Figure 1' Two sharp transition points for the capacity and the universal sample 
bound for generalization. 
where f(s) is equal to the training error rate of s. (It is also called the resub- 
stitution error estimator in the pattern recognition literature.) These relations are 
nontrivial if Peo > e', Poe > e' and d > 0 small. 
The key idea of this result is illustrated in Fig.1. That is, the sharp transition which 
stands for the lower epsilon-capacity is below the sharp transition for the universal 
sample bound for generalization. 
To interpret this relation, let us compare Equation (2) and Equation (5) and exam- 
ine the range of e and e' respectively. Since d, which is initially given in Inequal- 
ity (3), represents a bound on the generalization error, it is usually quite small. 
For most of practical problems, P,e is small also. If the structure of the class of 
networks is properly chosen so that Peo , Pbe, then e = Peo - e' will be a small 
quantity. Although the epsilon-capacity is a valid quantity depending on M for any 
network in the class, for M sufficiently large, the meaningful networks to be consid- 
ered through this relation is only a small subset in the class whose true probability 
of error is close to Peo. That is, this small subset contains only those networks 
which can approximate the best classifier contained in this class. 
For a special case in which samples are assigned randomly to two classes with equal 
probability, we have a result stated in Corollary 1. 
Corollary 1 Let samples be drawn independently from some distribution and then 
assigned randomly to two classes Q1(+1) and 2(-1) with equal probability. This 
is equivalent to the case that the two class conditional distributions have complete 
overlap with one another. That is, Pr(z I Q1) = pr(z [2). Then the Bayes error 
is . Using the same notation as in the above theorem, we have 
C_ d _< Ad. (6) 
932 Ji and Psaltis 
Although the distributions specified here give an uninteresting case for classification 
purposes, we will see later that the random statistical epsilon-capacity in Inequal- 
ity (6) can be used to characterize the memorizing capability of networks, and to 
formulate a constructive approach to find a lower bound for the VC-dimension. 
3 
Bounds for the VC-Dimension of Two Networks with 
Binary Weights 
3.1 A Constructive Methodology 
One of the applications of this relation is that it provides a general constructive ap- 
proach to find a lower bound for the VC-dimension for a class of networks. Specifi- 
cally, using the relationship given in Inequality (6), the procedures can be described 
as follows. 
1) Select a distribution. 
2) Draw samples independently from the chosen distribution, and then assign them 
randomly to two classes. 
3) Evaluate the lower epsilon-capacity and then use it as a lower bound for the 
VC-dimension. 
Two example are given below to demonstrate how this general approach can be 
applied to find lower bounds for the VC-dimension. 
3.2 Bounds for Two-Layer Networks with Binary Weigths 
Two-layer (N - 2L - 1) networks with binary weights and integer thresholds are 
considered in this section. 
3.2.1 A lower Bound 
The construction of the network we consider is motivated by the one used by Baum 
(Bantu, 1988) in finding the capacity for two layer networks with real weights. 
Although this particular network will fail if the accuracy of the weights and the 
thresholds is reduced, the idea of using the grandmother-cell type of network will 
be adopted to construct our network. 
We consider a two layer binary network with 2L hidden threshold units and one 
output threshold unit shown in Fig.2 a). 
The weights at the second layer are fixed and equal to +1 and -1 alternately. The 
hidden units are allowed to have integer thresholds in [-N, N], and the threshold 
for the output unit is zero. 
Let )m) = (Xl[n) X?V )) be a N dimensional random vector, where '(m)'s 
, ..., eli are 
independent random variables taking (+1) and (-1) with equal probability , 0 _< 
I _< L, and 0 _< rn _< M. Consider the/th pair of hidden units. The weights at the 
first layer for this pair of hidden units are equal. Let wli denote the weight from 
the ith input to these two hidden units, then we have 
The VC-Dimension versus the Statistical Capacity of Multilayer Networks 933 
) 
) 
2L 
N 
+ 
+ 
+ 
ith 
(a) (b) 
Figure 2' a) The two-layer network with binary weights. b) Illustration on how a 
pair of hidden units separates samples. 
M 
'li ! ' 
(7) 
where stIn(z) = i if z > O, and -1 otherwise. a's , 1 _( 1 _( L, which are indepen- 
dent random variables which take on two values +1 or -1 with equal probability, 
represent the random assignments of the LM samples into two classes fx(+l) and 
The thresholds for these two units are different and are given as 
a,L(1 q= k) (8) 
where 0 < k < 1, and t correspond to the thresholds for the units with weight +1 
and -1 at the second layer respectively. 
Fig.2 b) illustrates how this network works. Each pair of hidden units forms two 
parallel hyperplanes separated by the two thresholds, which will generates a presy- 
naptic input either +2 or (-2) to the output unit only for the samples stored in 
this pair which fall in between the planes when at equals either +1 or -1, and a 
presynaptic input 0 for the samples falling outside. When the samples as well as 
the parallel hyperplanes are random, with a certain probability they will fall either 
between a pair of parallel hyperplanes or outside. Therefore, statistical analysis is 
needed to obtain the lower epsilon-capacity. 
934 Ji and Psaltis 
Theorem 3 A lower bound C_, for the lower epsilon-capacity U_, for this 
network is: 
(1-k)2NL 
,., O( I-L ). (9) 
3.2.2 An Upper Bound 
Since the total number of possible mappings of two layer (N- 2L- 1) networks vith 
binary weights and integer thresholds ranging in t-N, N] is bounded by 2 W+ log 2N, 
the VC-dimension d is upper bounded by W + L log 2N, which is in the order of 
W. Then d _< O(W). By combining both the upper and lower bounds, we have 
W 
O( I-Z ) _< d2_< O(W). 
(lo) 
3.3 Bounds for One-Layer Networks with Binary Weigths 
The one-layer network we consider here is equivalent to one hidden unit in the above 
(N- 2L- 1) network. Specifically, the weight from the i-th input unit to the neuron 
is 
M 
wi = - x 
Cm i 1, 
(11) 
m=l 
where (1 _< i < N), x?)'s and m'S are independent and equally probable 
binary(+i) random variables, which represent elements of N-dimensional sample 
vectors and their random assignments to two classes respectively. 
Theorem 4 
The lower epsilon-capacity C'_, of this network satisfies 
Then by Corollary i we have O(N) _< O(dl), where d is the VC-dimension of 
one-layer (N- 1) networks. 
Using the similar counting arguement, an upper bound can be obtained as dl _< N. 
Then combining the lower and upper bounds, we have dl ,-, O(N) 
4 Discussions 
The general relationship we have drawn between the VC-dimension and the sta- 
tistical lower epsilon-capacity provides a new view on the sample complexity for 
generalization. Specifically, it has two implications to learning and generalziation. 
1) For random assignments of the samples (P& = ), the relationship confirms that 
generalization occurs after memorization, since the statistical lower epsilon-capacity 
The VC-Dimension versus the Statistical Capacity of Multilayer Networks 935 
for this case is the random storage capacity which charaterizes the memorizing 
capability of networks and it is upper bounded by the universal sample bound for 
generalization. 
2) For cases where the Bayes error is smaller than 1 
, the relationship indicates 
that an appropriate choice of a network structure is very important. If a network 
structure is properly chosen so that the optimal achievable error rate P,o is close 
to the Bayes error P,b, than the optimal network in this class is the one which has 
the largest lower epsilon-capacity. Since a suitable structure can hardly be chosen 
a priori due to the lack of knowledge about the underlying distribution, searching 
for network structures as well as weight values becomes necessary. Similar idea 
has been addressed by Devroye (Devroye, 1988) and by Vapnik (Vapnik, 1982) for 
structural minimization. 
We have applied this relation as a general constructive approach to obtain lower 
bounds for the VC-dimension of two-layer and one-layer networks with binary inter- 
connections. For the one-layer networks, the lower bound is tight and matches the 
upper bound. For the two-layer networks, the lower bound is smaller than the upper 
bound (in order) by a In factor. In an independent work by Littlestone (Littlestone, 
1988), the VC-dimension of so-called DNF expressions were obtained. Since any 
DNF expression can be implemented by a two layer network of threshold units with 
binary weights and integer thresholds, this result is equivalent to showing that the 
VC-dimension of such networks is O(W). We believe that the In factor in our lower 
bound is due to the limitations of the grandmother-cell type of networks used in 
our construction. 
Acknowledgement 
The authors would like to thank Yaser Abu-Mostafa and David Haussler for helpul 
discussions. The support of AFOSR and DARPA is gratefully acknowledged. 
References 
E. Baum. (1988) On the Capacity of Multilayer Perceptron. J. of Complexity, 
4:193-215. 
L. Devroye. (1988) Automatic Pattern Recognition: A Study of Probability of 
Error. IEEE Trans. on Pattern Recognition and Machine Intelligence, Vol. 10, 
No.4: 530-543. 
N. Littlestone. (1988) Learning Quickly When Irrelevant Attributes Abound: A 
New Linear-Threshold Algorithm. Machine Learning : 285-318. 
R.J. McEliece, E.C. Posner, E.R. Rodemich, S.S. Venkatesh. (1987) The Capacity 
of the Hopfield Associative Memory. IEEE Trans. Inform. Theory, Vol. IT-33, No. 
4,461-482. 
V.N . Vapnik (1982) Estimation of Dependences Based on Empirical Data, New 
York: Springer-Verlag. 
