Counting function heorem for 
multi-layer networks 
Adam Kowalczyk 
Telecom Australia, Research Laboratories 
770 Blackburn Road, Clayton, Vic. 3168, Australia 
(a.kowalczyk@trl.oz.au) 
Abstract 
We show that a randomly selected N-tuple  of points of R ' with 
probability > 0 is such that any multi-layer perceptron with the 
first hidden layer composed of hi threshold logic units can imple- 
of. > 
ment exactly 2 z_,i:0 - 
then such a perceptron must have all units of the first hidden layer 
fully connected to inputs. This implies the maximal capacities (in 
the sense of Cover) of 2n input patterns per hidden unit and 2 input 
patterns per synaptic weight of such networks (both capacities are 
achieved by networks with single hidden layer and are the same as 
for a single neuron). Comparing these results with recent estimates 
of VC-dimension we find that in contrast to the single neuron case, 
for sufficiently large n and hi, the VC-dimension exceeds Cover's 
capacity. 
I Introduction 
In the course of theoretical justification of many of the claims made about neural 
networks regarding their ability to learn a set of patterns and their ability to gen- 
eralise, various concepts of maximal storage capacity were developed. In particular 
Cover's capacity [4] and VC-dimension [12] are two expressions of this notion and 
are of special interest here. We should stress that both capacities are not easy 
to compute and are presently known in a few particular cases of feedforward net- 
works only. VC-dimension, in spite of being introduced much later, has been far 
375 
376 Kowalczyk 
more researched, perhaps due to its significance expressed by a well known relation 
between generalisation and learning errors [12, 3]. Another reason why Cover's ca- 
pacify gains less attention, perhaps, is that for the single neuron case it is twice 
higher than VC-dimension. Thus if one would hypothesise a similar relation to be 
true for other feedforward networks, he would judge Cover's capacity to be quite 
an unattractive parameter for generalisation estimates, where VC-dimenslon is be- 
lieved to be unrealistically big. One of the aims of this paper is to show that this 
last hypothesis is not true, at least for some feedforward networks with sufficiently 
large number of hidden units. In the following we will always consider multilayer 
perceptrons with r continuously-valued inputs, a single binary output, and one or 
more hidden layers, the first of which is made up of threshold logic units only. 
The derivation of Cover's capacity for a single neuron in [4] is based on the so-called 
Function Counting Theorem, proved for the linear function in the sixties (c.f. [4]), 
which states that for an N-tuple  of points in general position one can implement 
C'(N, n) ae 2 Y-i=0 Jv- different dichotomies of. Extension of this result to the 
multilayer case is still an open problem (c.f.T. Cover's address at NIPS'92). One of 
the complications arising there is that in contrast to the single neuron case even for 
percepttons with two hidden units the number of implementable dichotomies may 
be different for different N-tuples in general position [8]. Our first main result states 
that this dependence on  is relatively weak, that for a multilayer perceptron the 
number of implementable dichotomies (counting function) is constant on each of a 
finite number of connected components into which the space of N-tuples in general 
position can be decomposed. Then we show that for one of these components 
C'(N,rh) different dichotomies can be implemented, where h is the number of 
hidden units in the first hidden layer (all assumed to be linear threshold logic 
units). This leads to an upper bound on Cover's capacity of 2n input patterns per 
(hidden) neuron and 2 patterns per adjustable synaptic weight, the same as for a 
single neuron. Comparing this result with a recent lower bound on VC-dimension 
of multilayer perceptrons [10] we find that for for sufficiently large n and h the 
VC-dimension is higher than Cover's capacity (by a factor log2(hl)). 
The paper extends some results announced in [5] and is an abbreviated version of 
a forthcoming paper [6]. 
2 Results 
2.1 Standing assumptions and basic notation 
We recall that in this paper a multilayer ioercelotror means a layered feedforward 
network with one or more hidden layers, and the first hidden layer built exclusively 
from threshold logic units. 
A dickotomy of an N-tuple  = (xx,...,xnr)  (Rr) nr is a function $: {xx,...,xnr}  
{0,1}. For a multilayer perceptron F: R r ---} {0, 1} let  -* C'F() denote the 
number of different dichotomies of  which can be implemented for all possible 
selections of synaptic weights and biases. We shall call C'F() a courttire# furctior 
following the terminology used in [4]. 
Counting Function Theorem for Multi-Layer Networks 377 
) 
Example 1. C(g) = C(N, n) de__f 2 Ei=0 N-I for a single threshold logic unit 
R" --, {0, x) 
Points of an N-tuple  6 (R') N are said to be in general position if there does not 
exist an l de= f min(N,n- 1)-dimensional affine hyperplane in R ' containing (l + 2) 
of them. We use a symbol {7P(n, N) C (R') N to denote that set of all N-tuples  
in general position. 
Throughout this paper we assume to be given a probability measure dp dc__f f dx on 
R ' such that the density f: R ' --, R is a continuous function. 
2.2 Counting function is locally constant 
We start with a basic characterisations of the subset 679(n, N) C (R') N. 
Theorem 1 (i) (77)(n,N) is an open and dense subset of (Rn) N with a finite 
number of connected components. 
(it) Any of these components is unbounded, has an infinite Lebesgue measure and 
has a positive probability measure. 
Proof outline. (i) The key point to observe is that G73(n, N) = { : p() =/: 0}, 
where p: (R') N ---} R is a polynomial on (R') N. This implies immediately that 
G73(n, N) is open and dense in (R') N. The finite number of connected components 
follows from the results of Milnor [7] (c.f. [2]). 
(it) This follows from an observation that each of the connected components C/has 
the property that if (x, ...,XN) 6 Ci and a > 0, then (axe, ...,aXN) 6 i. [] 
As Example 1 shows, for a single neuron the counting function is constant on 
G7)(n, N). However, this may not be the case even for perceptrons with two hidden 
units and two inputs (c.f. [8, 0] for such examples and Corollary 8). Our first main 
result states that this dependence on  is relatively weak. 
Theorem 2 CF(X) is constant on connected components of G7)(n, N). 
Proof outline. The basic heuristic behind the proof of this theorem is quite simple. 
If we have an N-tuple  6 (R') N which is split into two parts by a hyperplane, 
then this split is preserved for any sufficiently small perturbation f  (R') N of , 
and vice versa, any split of : corresponds to a split of . The crux is to show that 
if  is in general position, then a minute perturbation f of  cannot allow a bigger 
number of splits than is possible for . We refer to [0] for details. [] 
The following corollary outlines the main impact of Theorem 2 on the rest of the 
paper. It reduces the problem of investigation of the function CF(X) on G7)(n, N) to 
a consideration of a set of individual, special cases of N-tuples which, in particular, 
are amenable to be solved analytically. 
Corollary 3 If   Q73(n,N), then CF() = CF(f) for a randomly selected N- 
tuple 2 6 (Rn) N with a probability > O. 
378 Kowalczyk 
2.3 A case of special component of gyP(n, N) 
The following theorem is the crux of the paper. 
Theorem 4 There exists a connected component CC C gYP(n, N) C (R') N such 
that 
cF() = = 2 - (for CC) 
i=o i 
with equality iff the input and first hidden/dyer are fully connected. The synaptlc 
weights to units not in the first hidden layer can be constant. 
Using now Corollary 3 we obtain: 
Corollary 5 C',() = C(N, nh) for   (R') r with a probabilit.l > O. 
The component C'C C gyP(n, N) in Theorem 4 is defined as the connected compo- 
nent containing 
N de__f (c(t),c(t.),...,c(tN)) e (R') N, (1) 
where c : R  R ' is the curve defined as c(t) ae (t,t.,...,t) for t  R and 
0 < tx < t < -.. < tr are some numbers (this example has been considered 
previously in [11]). The essential part of the proof of Theorem 4 is showing the 
basic properties of the N-tuple 15v which will be described by the Lemma below. 
Any dichotomy  of the N-tuple fiv (c.f. 1) is uniquely defined by its value at 
(2 options) and the set of indices 1 < i < i. < .-- < ia < N of all transitional 
pairs (c(ti), c(ti+x)), i.e. all indices ii such that 5(c(ti)) y 5(c(ti+x)), where j = 
1,..., k, (additional (v-x) options). Thus it is easily seen that there exist altogether 
2(v -) different dichotomies of 15v for any given number k of transitional pairs, 
where0<k<N. 
Lemma6 Given integers n,N,h > 0, k > 0 and a dichotomy  Of N with k 
transitional pairs. 
(i) Ilk _< nh, then there exist hyperplanes H(w,,,,), (w, b) 6 R n x It, such that 
5(p) = O bo+ viO(wi .pj +hi) , (2) 
+ 0, (3) 
der der 
for i = 1, ..., h and j = 1, ..., N; here vi = 1 if n is even and vi = (--1) i if n is odd, 
der 
bo = -0.5 ifn is odd, h is even and 5(Po) = 1, and bo 
= 0.5, otherwise. 
(ii) Ilk = nh, then w  0 for j = 1,...,n and i = 1,...,h, where w = 
(wi, wia, ..., wi,). 
(iii) If k > nh, *hen (2) and (3) cannot be satised. 
The proof of Lemma  relies on usage of the Vandermonde determinant and its 
derivatives. It is quite technical and thus not included here (c.f. [] for details). 
Counting Function Theorem for Multi-Layer Networks 379 
10 4 
10 3 
10 2 
10 
2 
(Theorem 7 
(Mitchison & Burbin [10]]. 
(Huang & Huang 
[2], Sakurai [1 ..' 
(Akaho & Amari -'" .. 
ts]) --" 
I 2 5 10 102 103 104 
Number of hidden units (hi) 
Figure 1: Some estimates of capacity. 
3 Discussion 
3.1 An upper bound on Cover's capacity 
The Cover's capacity (or just capacity) of a neural network F : R ' --, {0,1}, 
Gap(F), is defined as the maximal N such that for a randomly selected N-tuple 
 = (xt,...,xv) 6 (R') v of points of R ', the network can implement 1/2 of all 
dichotomies of  with probability 1 [4, 8]. 
Corollary 5 implies that Gap(F) is not greater than maximal N such that 
CF(IN)/2 N -- C(N, nhx) _> 1/2. (4) 
since any property which holds with probability 1 on (R') N must also hold prob- 
ability 1 on CC (c.f Theorem 4). The left-hand-side of the above equation is just 
the sum of the binomial expansion of (1/2 + 1/2) 'v-t up to h. xn-th term, so, using 
the symmetry argument, we find that it is >_ 1/2 if and only if it has at least half 
of the all terms, i.e. when N- 1+1 <_ 2(hxn+ 1). Thus the 2(htn+l) is the 
maximal value of N satisfying (4). t Now let us recall that a multilayer perceptron 
as in this paper can implement any dichotomy of any N-tuple  in general position 
if N _< nht + 1 [1, 11]. This leads to the following result: 
Theorem 7 
+ 1 _< Cap(F) <_ + 1). 
;Note that for large N the choice of cutoff value 1/2 is not critical, since the probability 
of a dichotomy being implementable drops rapidly as ht n approaches 2r/2. 
380 Kowalczyk 
N/ 
#w lO 
._o 8 
r-, 
 6 
E 
-. 4 
a. 2 
 0 
t::: 
I 2 5 10 10 2 10 3 10 4 
Number of hidden units (hi) 
#w 
dvdF) / [11]  
k.(Sakurai 
ap(F) / #w 'h 
eorem 7)/ 
Figure 2: Comparison of estimates of the ratios of Cover's capacity per synaptic 
weight (Cap(F)/#w) and VC-dimension per synaptic weight (dvc(F)/w). (Note 
that the upper bound for VC-dimension has so far been proved for low number of 
hidden layers [9,10].) 
for any multilayer perceptton F: R ' -* {0, 1} with the rst hidden layer built from 
the hi threshold logic units. For the most efficient networks in this class, with a 
single hidden layer, we thus obtain the following result: 
1- _< Cap(F)/#w < 2, 
where w denotes the number of synaptic weights and biases. 
3.2 A relation to C-dimension 
The VC-dimension, dye(F), is defined as the largest N such that there exists an 
N-tuple [ = (xl, ..., xv) E (R') v for which the network can implement all possible 
2 r dichotomies. Recent results of Sakurai [10] imply 
dye(F) > (1/2)hal(log2 hi + o(]og 2 hi) q- o((log 2 hl)2/r)). (5) 
For sufficiently large n and hi this estimate exceeds 2(nhz + 1) which is an upper 
bound on Cap(F). Thus, in contrast to the single threshold logic unit case we have 
the following (c.f. Fig. 3): 
Corollary 8 Cap(F) < ave(F) if hz >> 1. 
3.3 Memorisation ability of multilayer perceptton 
Corollary 8 combined with Theorem 7 and Figure 2 imply that for some cases of 
patters in general position multilayer perceptton can memorise and reliably retrieve 
Counting Function Theorem for Multi-Layer Networks 381 
(even with 100% accuracy) much more ( log.(h) times more) than 2 patterns 
per connection, as is the case for a single neuron [4]. This proves that co-operation 
between hidden units can significantly improve the storage efficiency of neural net- 
works. 
3.4 A relation to PAC learning 
Vapnik's estimate of generalisation error [12] (an error rate on independent test set) 
E.(F) _< EL(F)+ 
holds for N > dye(F) with probability larger that (1 
(i) learning error EL(F) and (ii) confidence interval 
where 
(6) 
It contains two terms: 
qt(N, dvc, r/) = (ln 2N .dvc In r/ 
The ability of obtaining small learning error EL(F) is, in a sense, controlled by 
Cap(F), while the size of the confidence interval D is controlled by both dye(F) 
and Cap(F) (through EL(F)). For a multilayer perceptron as in Theorem 7 when 
dvc(F) >> Cap(F) (Fig. 2) it can turn out that actually the capacity rather than 
the VC-dimension is the most critical factor in obtaining low generalisation error 
E(F). This obviously warrants further research into the relation between capacity 
and generalisation. 
The theoretical estimates of generalisation error based on VC-dimension are believed 
to be too pessimistic in comparison with some experiments. One may hypothesise 
that this is caused by too high values of dye(F) used in estimates such as (6). Since 
Cover's capacity in the case multilayer perceptron with h >> i turned up to be 
much lower than VC-dimension, one may hope that more realistic estimates could 
be achieved with generalisation estimates linked directly to capacity. This subject 
will obviously require further research. Note that some results along these lines can 
be found in Cover's paper [4]. 
3.5 Some open problems 
Theorem 7 gives estimates of capacity per variable connection for a network with 
the minimal number of neurons in the first hidden layer showing that these neurons 
have to be fully connected. The natural question arises at this point as to whether 
a network with a bigger number but not fully connected neurons in the first hidden 
layer can achieve a better capacity (per adjustable synaptic weight). 
The values of the counting function  -} C,() are provided in this paper for the 
particular class of points in general position, for   CC C (R') N. The natural 
question is whether they may be by chance a lower or upper bound for the counting 
function for the general case of   (R') N ? The results of Sakurai [11] seem 
to point to the former case: in his case, the sequences 15N = (p, ...,pN) turned 
out to be "the hardest" in terms of hidden units required to implement 100% of 
382 Kowalczyk 
dichotomies. Corollary 8 and Figure 1 also support this lower bound hypothesis. 
They imply in particular that there exists a N'-tuple  = (Yl, Y2, ..., yJv,)  (R') N', 
where N'  VC-dimension > N, such that Cr(:) = 2 N' >> 2 N > Cr(fir) for 
sufficiently large n and h. 
4 Acknowledgement 
The permission of Managing Director, Research and Information Technology, Tele- 
com Australia, to publish this paper is gratefully acknowledged. 
References 
[lo] 
[11] 
[12] 
[1] E. Baum. On the capabilities of multilayer perceptrons. Journal of Complezity, 
4:193-215, 1988. 
[2] S. Ben-David and M. Lindenbaum. Localization rs. identification of semi- 
algebraic sets. In Proceedings of the Sixth Annual Workshop or Computational 
Learning Theory (to appear), 1993. 
[3] A. Blumer, A. Ehrenfeucht, D. Haussler, and M.K. Warmuth. Learnability and 
the Vapnik-Chernovenkis dimensions. Journal of the ACM, 36:929-965, (Oct. 
1989). 
[4] T.M. Cover. Geometrical and statistical properties of linear inequalities with 
applications to pattern recognition. IEEE Trans. Elec. Comp., EC-14:326- 
334, 1965. 
[5] A. Kowalczyk. Some estimates of necessary number of connections and hidden 
units for feed-forward networks. In S.J. Hanson et al., editor, Advances in Neu- 
ral Information Processing Systems, volume 5. Morgan Kaufman Publishers, 
Inc., 1992. 
[6] A. Kowalczyk. Estimates of storage capacity of multi-layer perceptton with 
threshold logic hidden units. In preparation, 1994. 
[7] J. Milnor. On Betti numbers of real varieties. Proceedings of AMS, 15:275-280, 
1964. 
[8] G.J. Mitchison and R.M. Durbin. Bounds on the learning capacity of some 
multi-layer networks. Biological Cybernetics, 60:345-356, (1989). 
[9] A. Sakurai. On the VC-dimension of depth four threshold circuits and the 
complexity of boolean-valued functions. Manuscript, Advanced Research Lab- 
oratory, Hitachi Ltd., 1993. 
A. Sakurai. Tighter bounds of the VC-dimension of three-layer networks. In 
WCNN93, 1993. 
A. Sakurai. n-h-1 networks store no less n. h + 1 examples but sometimes no 
more. In Proceedings of IJCNN9, pages III-936-III-941. IEEE, June 1992. 
V. Vapnik. Estimation of Dependences Based on Empirical Data. Springer- 
Verlag, 1982. 
