Some Estimates of Necessary Number of 
Connections and Hidden Units for 
Feed-Forward Networks 
Adam Kowalczyk 
Telecom Australia, Research Laboratories 
770 Blackburn Road, Clayton, Vic. 3168, Australia 
(a.kowalczyk@trl.oz.au) 
Abstract 
The feed-forward networks with fixed hidden units (FHU-networks) 
are compared against the category of remaining feed-forward net- 
works with variable hidden units (VttU-networks). Two broad 
classes of tasks on a finite domain X C R n are considered: ap- 
proximation of every function from an open subset of functions on 
X and representation of every dichotomy of X. For the first task 
it is found that both network categories require the same minimal 
number of synaptic weights. For the second task and X in gen- 
eral position it is shown that VHU-networks with threshold logic 
hidden units can have approximately 1In times fewer hidden units 
than any FHU-network must have. 
1 Introduction 
A good candidate artificial neural network for short term memory needs to be: (i) 
easy to train, (ii) able to support a broad range of tasks in a domain of interest and 
(iii) simple to implement. The class of feed-forward networks with fixed hidden 
units (HU) and adjustable synaptic weights at the top layer only (shortly: FHU- 
networks) is an obvious candidate to consider in this context. This class covers a 
wide range of networks considered in the past, including the classical perceptron, 
higher order networks and non-linear associative mapping. Also a number of train- 
ing algorithms were specifically devoted to this category (e.g. perceptron, madaline 
639 
640 Kowalczyk 
or pseudoinverse) and a number of hardware solutions were investigated for their 
implementation (e.g. optical devices [8]). 
Leaving aside the non-trivial tasks of constructing the domain specific HU for a 
FHU-network [9] and then optimal loading of specific tasks, in this paper we con- 
centrate on assessing the abilities of such structures to support a wide range of tasks 
in comparison to more complex feedforward networks with multiple layers of variable 
HU (VHU-networks). More precisely, on a finite domain X two benchmark tests 
are considered: approximation of every function from an open subset of functions 
on X and representation of every dichotomy of X. Some necessary and sufficient 
estimates of minimal necessary numbers of adaptable synaptic weights and of HU 
are obtained and then combined with some sufficient estimates in [10] to provide 
the final results. In Appendix we present an outline some of our recent results on 
the extension of the classical Function-Counting Theorem [2] to the multilayer case 
and discuss some of its implications to assessing network capacities. 
2 Statement of the main results 
In this paper X will denote a subset of R ' of N points. Of interest to us are 
multilayer feed-forward networks (shortly FF-networks), Fw: X - R, depending 
on the k-tuple w: (w, ..., w) C R  of adjustable synaptic weights to be selected on 
loading to the network desired tasks. The FF-networks are split into two categories 
defined above: 
 FHU-network with fixed hidden units qSi: X - R 
k 
Fw(x) eff (x C X), 
i----1 
 VHU-networks with variable hidden units bw,,,i : X - R depending on 
__-- R/d ' 
some adjustable synaptic weights w", where w (w , w") G R ' x -- 
R  
Fw(x) (x e x). (2) 
i=1 
Of special interest are situations where hidden units are built from one or more layers 
of artificial neurons, which, for simplicity, can be thought of as devices computing 
simple functions of the form 
(y, ...,y)  R TM -* cr(wiy + wiy2 + "' + wi,,yrn), 
where cr  R - R is a non-decreasing squashing function. Two particular examples 
of squashing functions are (i) infinitely differentiable sigmoid function t  (1 + 
exp(-t)) - and (ii) the step function O(t) defined as 1 for t _> 0 and = O, otherwise. 
In the latter case the artificial neuron is called a threshold logic neuron (ThL- 
In the formulation of results below all biases are treated as synaptic weights attached 
to links from special constant HUs (-- 1). 
Estimates of Necessary Connections & Hidden Units for Feed-Forward Networks 641 
2.1 Function approximation 
The space R x of all real functions on X has the natural structure of a vector space 
isomorphic with R v. We introduce the euclidean norm 
on R X and denote by L/C R X an open, non-empty subset. We say that the FF- 
network Fw can approximate a function f on X with accuracy e 
for a weight vector w C R k. 
Theorem 1 Assume the FF-network Fw is continuously differenttable with respect 
to the adjustable synaptic weights w  R  and k ( N. If it can approximate 
any function in Lt with any accuracy then for almost every function f  Lt, if 
lin_o IIFw(,) - ill-- 0, where w(1), w(2), ... C R , then lirra_oo IIw(i)[I-- . 
In the above theorem "almost every" means with the exception of a subset of the 
Lebesgue measure 0 on R X  R N. The proof of this theorem relies on use of 
Sard's theorem from differential topology (c.f. Section 3). Note that the above the- 
orem is applicable in particular to the popular "back-propagation" network which 
is typically built from artificial neurons with the continuously differenttable sigmoid 
squashing function. 
The proof of the following theorem uses a different approach, since the network is 
not differentiably dependent on its synaptic weights to HUs. This theorem applies 
in particular to the classical FP-networks built from ThL-neurons. 
Theorem 2 A FF-network Fw must have _ N HU in the top hidden layer if all 
units o this layer have a finite number of activation levels and the network can 
approximate any unction in Lt with any accuracy. 
The above theorems mean in particular that if we want to achieve an arbitrarily good 
approximation of any function in// ae__ {f: X - R; If(x)l  A), where A  0, 
and we can use one of VHU-networks of the above type with synaptic weights of a 
restricted magnitude only, then we have to have at least N such weights. However 
that many weights are necessary and sufficient to achieve the same, with a FHU- 
network (1) if the functions bi are linearly independent on X. So variable hidden 
units give no advantage in this case. 
2.2 Implementation of dichotomy 
We say that the FF-network Fw can implement a dichotomy (X_, X+) of X if 
there exists w ff R  such that Fw ( 0 on X_ and Fw ) 0 on X+. 
Proposition 3 A FHU-network Fw can implement every dichotomy of X if and 
only i it can exactly compute every function on X. In such a case it must have 
_ N HU in the top hidden layer. 
The non-trivial part of the above theorem is necessity in the first part of it, i.e. that 
being able to implement every dichotomy on X requires N (fixed) hidden units. In 
Section 3.3 we obtain this proposition from a stronger result. Note that the above 
642 Kowalczyk 
proposition can be deduced from the classical Function-Counting Theorem [2] and 
also that an equivalent result is proved directly in [3, Theorem 7.2]. 
We say that the points of a subdomain X C R  are in general position if every 
in R  contains no more than n points of X. Note that points of every finite 
subdomain of R n are in general position after a sufficiently small perturbation and 
that the property of being in general position is preserved under sufficiently small 
perturbations. Note also that the points of a typical N-point subdomain X C R n 
are in general position, where "typical" means with the exception of subdomains X 
corresponding to a certain subset of Lebesgue measure 0 in the space (Rn) N of all 
N-tuples of points from R '. 
It is proved in [10] that for a subdomain set X C R  of N points in general position 
a VHU-network having [(N- 1)/n] (adjustable) ThL-neurons in the first (and the 
only) hidden layer can implement every dichotomy of X, where the notation 
denotes the smallest integer > t. Furthermore, examples are given showing that the 
above bound is tight. (Note that this paper corrects and gives rigorous proofs of 
some early results in [1, Lemma 1 and Theorem 1] and also improves [6, Theorem 4].) 
Combining these results with Proposition 3 we get the following result. 
Theorem 4 Assume that all N points of X C R  are in general position. In the 
class of all FF-networks which can implement every dichotomy on X there exists a 
VHU-network with threshold logic HU having a fraction 1/n+O(1/N) of the number 
of the HU that any FHU-network in this class must have. There are examples of 
X in genera/position of any even cardinality N > 0 showing that this estimate is 
tight. 
3 Proofs 
Below we identify functions f  X -* R with N-tuples of their values at N-points 
of X (ordered in a unique manner). Under this identification the FF-networks Fw 
can be regarded as a transformation 
w 6 R  - Fw 6 R S (3) 
with the range T(Fw) aeg {Fw; w 6 R k) C R N. 
3.1 Proof of Theorem 1. 
In this case the transformation (3) is continuously differentiable. Every value of it 
is singular since k < N, thus according to Sard's Theorem [5], 7(Fw) C R N has 
Lebesgue measure 0. It is enough to show now that if 
f 6 L/- 7(Fw) 
(4) 
and 
lira [[Fw(i)-fl[=0 and [[w(i)[[ <M, 
for some M > 0, then a contradiction follows. Actually from (5) it follows that 
f belongs to the topological closure cl(7M) of 7M ae__f {Fw ; w 6 R k & [[w[I < 
Estimates of Necessary Connections & Hidden Units for Feed-Forward Networks 643 
}. However, TM is a compact set as a continuous image of a closed ball {w C 
; ]]w]l _< M}, so cI(TM) -- TM. Consequently f C Td.M C T(Fw) which 
contradicts (4). Q.E.D. 
3.2 Proof of Theorem 2. 
We consider the FF-network (1) for which there exists a finite set V C R of s points 
R a" k' 
such that bw,,,i(x)  V for every w"  , 1 < i < and x  X. It is sufficient 
to show that the set T(Fw) of all functions computable by Fw is not dense in L/if 
k  < N . Actually, we can write T(Fw) as a union 
IJ Lw,,c (6) 
w,,R ,, 
def k' 
' ' R} R v 
where each Lw,, = {i= wPw,,,i; w, ...,w,  C is a linear subspace 
of dimension <_ k  _< N uniquely determined by the vectors bw,,,i  V N C R N, 
i = 1, ..., k . However there is a finite number (_< s Jr) of different vectors in V N, 
thus there is only a finite number (_< s N) of different linear subspaces in the family 
; R " k  
{Lw,, w"  }. Hence, as < N, the union (6) is a closed no-where dense 
subset of R N as a finite union of proper linear subspaces (each of which is a closed 
and nowhere dense subset). Q.E.D. 
3.3 Proof of Proposition 3. 
We state first a stronger result. We say that a set L of functions on X is convex if 
for any couple of functions qb, qb2 on X any a > 0,/3 > 0, a +/3 = 1, the function 
a +/32 also belongs to L. 
Proposition 5 Let L be a convex set of functions on X = {xx,xa,...,xv} im- 
plementing. every dichotomy of X. Then for each i  {1, 2, ..., N} there exists a 
function q5   L such that i(x) y 0 and i(x.) = 0 for 1 _< i y j _< N. 
Proof. We define a transformation SGN ' R X ---} {-1, 0, +1} N 
de_____f (sgn(qb(x)),sgn(qb(xa)),...,sgn(qb(xN))) G {-1,0, +1} N 
where sgn() deal -1 if  < 0, sgn(0) d___f 0 and sgn() deal -1-1 if  > 0. We denote by 
W the subset of {-1,0, +1} v of all points q = (qx,...,qr) such that ]=x Iq[ -- k, 
for k = 0, 1, ..., N. 
We show first that convexity of L implies for k C {1, 2, ..., N} the following 
W/ C SGN(L) => W/_ C SGN(L). 
(7) 
For the proof assume W C sG(L) and q = (qx, ...,qN)  {--1, 0, +1} N is such that 
Iqgl-- 1, We need to show that there exists L such that 
SGN(q) = q. (S) 
644 Kowalczyk 
The vector q has at least one vanishing entry, say, without loss of generality, q: 0. 
Let 4 + and 4- be two functions in L such that 
SGN(+ ) -_ q+ de_____f 
def 
SGN(-): q- = 
(+1, q, ..., qN), 
(--1, q,...,qN). 
Such 4 + and 4- exist since q+, q- 6 Wk. The function 
 de____f q_ 
belongs to L as a convex combination of two functions from L and satisfies (8). 
Now note that the assumptions of the proposition imply that WN C SGN(L). Ap- 
plying (7) repeatedly we find that W C SGN(L), which means that for every index 
i, 1 _< i _< N, there exists a function i 6 L with vanishing all entries but the i-th 
one. Q.E.D. 
Now let us see how Proposition 3 follows from the above result. Sufficiency is 
obvious. For the necessity we observe that the family Fw of functions on X is 
convex being a linear space in the case of a FHU-network (1). Now if this network 
can compute every dichotomy of X, then each function i as in Proposition 5 equals 
to Fw for some wi 6 R k. Thus T(Fw) = R N since those functions make a basis 
of R X m R N. Q.E.D. 
4 Discussion of results 
Theorem 1 combined with observations in [4] allows us to make the following contri- 
bution to the recent controversy on relevance/irrelevance of Kolmogorov's theorem 
on representation of continuous functions I n --o R, I de=f [0, 1] (c.f. [4, 7]), since I n 
contains subsets of any cardinality. 
The FF-networks for approximations of continuous functions on 
I n of rising accuracy have to be complex, at least in one of the 
following ways: 
 involve adjustment of a diverging number of synaptic weights 
and hidden units, or 
 require adjustment ofsynaptic weights of diverging magnitude, 
or 
 involve selection of "pathological" squashing functions. 
Thus one can only shift complexity from one kind to another, but not eliminate 
it completely. Although on theoretical grounds one can easily argue the virtues 
and simplicity of one kind of complexity over the other, for a genuine hardware 
implementation any of them poses an equally serious obstacle. 
For the classes of FF-networks and benchmark tests considered, the networks with 
multiple hidden layers have no decisive superiority over the simple structures with 
fixed hidden units unless dimensionality of the input space is significant. 
Estimates of Necessary Connections & Hidden Units for Feed-Forward Networks 645 
5 Appendix: Capacity and Function-Counting Theorem 
The above results can be viewed as a step towards estimation of capacity of networks 
to memorise dichotomies. We intend to elaborate this subject further now and 
outline some of our recent results on this matter. A more detailed presentation will 
be available in future publications. 
The capacity of a network in the sense of Cover [2] (Cover's capacity) is defined as 
a maximal N such that for a randomly selected subset X C R  of N points with 
probability 1 the network can implement 1/2 of all dichotomies of X. For a linear 
perceptton 
k 
Vw(x) (x x), (9) 
i=1 
where w C R n is the vector of adjustable synaptic weights, the capacity is 2n,and 2k 
for a FI-IU-network (1) with suitable chosen hidden units qSx, ..., qSk. These results are 
based on the so-called Punction-Counting Theorem proved for the linear perceptron 
in the sixties (c.f. [2]). Extension of this result to the multilayer case is still an open 
problem (c.f.T. Cover's talk on NIPS'92). However, we have recently obtained the 
following partial result in this direction. 
Theorem 6 Given a continuous probability density on R , for a randomly selected 
subset X C R  of N points the FF-network having the/rst hidden layer built from 
h ThL-neurons can implement 
C(N, n) de=f 2 i ' 
i=0 
dichotomies of X with a non-zero probability. Such a network can be constructed 
using nh variabJe synaptic weights between input and hidden layer only. 
For h - 1 this theorem reduces to its classical form for which the phrase "with 
non-zero probability" can be strengthened to "with probability 1" [2]. 
The proof of the theorem develops Sakurai's idea of utilising the Vandermonde 
determinant to show the following property of the curve c(t) $ (t, t2,...,t_) ' 
(*) for any subset X of N points x = c(t), ...,xjv = C(tN), t < 
t9. < ... < tN, any hyperpJane in R ' can intersect no more then n 
different segments [xi, xi+x] of c. 
The first step of the proof is to observe that the property (*) itself implies that 
the count (10) holds for such a set X. The second and the crucial step consists in 
showing that for a sufficiently small e > 0, for any selection of points l, ...,v C R ' 
such that [[ki - xill ( e for i = 1,..., n, there exists a curve a passing through these 
points and satisfying also the property (*). 
Theorem 6 implies that in the class of multilayer PP-networks having the first hidden 
layer built from ThL-neurons only the single hidden layer networks are the most 
646 Kowalczyk 
efficient, since the higher layers have no influence on the number of implemented 
dichotomies (at least for the class of domains x C R n considered). 
Note that by virtue of (10) and the classical argument of Cover [2] for the class 
of domains X as in the Theorem fi the capacity of the network considered is 2nh. 
Thus the following estimates hold. 
Corollary 7 In the class of FF-networks with a fized number h of hidden units 
the ratio of the mazimal capacity per hidden unit achievable by FHU-network to 
the mazimal capacity per hidden unit achievable by VHU-networks having the ThL- 
neurons in the first hidden layer only is 2h/2nh = 1In. The analogous ratio for 
capacities per variable synaptic weight (in the class of FF-networks with a fized 
number s of variable synaptic weights) is < 2s/2s = 1. 
Acknowledgement. I thank A. Sakurai of Hitachi Ltd., for helpful comments 
leading to the improvement of results of the paper. The permission of the Direc- 
tor, Telecom Australia Research Laboratories, to publish this material is gratefully 
acknowledged. 
References 
[1] E. Baum. On the capabilities of multilayer perceptrons. Journal of Complezity, 
4:193-215, 1988. 
[2] T.M. Cover. Geometrical and statistical properties of linear inequalities with 
applications to pattern recognition. IEEE Trans. Elec. Comp., EC-14:326- 
334, 1965. 
[3] R.M. Dudley. Central limit theorems for empirical measures. Ann. Probability, 
6:899-929, 1978. 
[4] F. Girosi and T. Poggio. Representation properties of networks: Kolmogorov's 
theorem is irrelevant. Neural Computation, 1:465-469, (1989). 
[5] M. Golubitsky and V. Guillemin. Stable Mapping and Their Singularities. 
Springer-Verlag, New York, 1973. 
[6] S. Huang and Y. Huang. Bounds on the number of hidden neurons in multilayer 
percepttons. IEEE Transactions on Neural Networks, 2:47-55, (1991). 
[7] V. Kurkova. Kolmogorov theorem is relevant. Neural Computation, 1, 1992. 
[8] D. Psaltis, C.H. Park, and J. Hong. Higher order associative memories and 
their optical implementations. Neural Networks, 1:149-163, (1988). 
[9] N. Redding, A. Kowalczyk, and T. Downs. Higher order separability and 
minimal hidden-unit fan-in. In T. Kohonen et al., editor, Artificial Neural 
Networks, volume 1, pages 25-30. Elsevier, 1991. 
[10] A. Sakurai. n-h-1 networks store no less n. h + I examples but sometimes no 
more. In Proceedings of IJCNN92, pages III-936-III-941. IEEE, June 1992. 
PART VIII 
SPEECH AND 
SIGNAL 
PROCESSING 
