Size of multilayer networks for exact 
learning: analytic approach 
Andr Elisseeff 
D?pt Mathdmatiques et Informatique 
Ecole Normale Supdrieure de Lyon 
46 allde d'Italie 
F69364 Lyon cedex 07, FRANCE 
H61[ne Paugam-Moisy 
LIP, URA 1398 CNRS 
lcole Normale Supdrieure de Lyon 
46 allde d'Italie 
F69364 Lyon cedex 07, FRANCE 
Abstract 
This article presents a new result about the size of a multilayer 
neural network computing real outputs for exact learning of a finite 
set of real samples. The architecture of the network is feedforward, 
with one hidden layer and several outputs. Starting from a fixed 
training set, we consider the network as a function of its weights. 
We derive, for a wide family of transfer functions, a lower and an 
upper bound on the number of hidden units for exact learning, 
given the size of the dataset and the dimensions of the input and 
output spaces. 
I RELATED WORKS 
The context of our work is rather similar to the well-known results of Baum et al. [1, 
2, 3, 5, 10], but we consider both real inputs and outputs, instead of the dichotomies 
usually addressed. We are interested in learning exactly all the examples of a fixed 
database, hence our work is different from stating that multilayer networks are 
universal approximators [6, 8, 9]. Since we consider real outputs and not only 
dichotomies, it is not straightforward to compare our results to the recent works 
about the VC-dimension of multilayer networks [11, 12, 13]. Our study is more 
closely related to several works of Sontag [14, 15], but with different hypotheses on 
the transfer functions of the units. Finally, our approach is based on geometrical 
considerations and is close to the model of Coetzee and Stonick [4]. 
First we define the model of network and the notations and second we develop 
our analytic approach and prove the fundamental theorem. In the last section, we 
discuss our point of view and propose some practical consequences of the result. 
Size of Multilayer Networks for Exact Learning: Analytic Approach 163 
2 THE NETWORK AS A FUNCTION OF ITS WEIGHTS 
General concepts on neural networks are presented in matrix and vector notations, 
in a geometrical perspective. All vectors are written in bold and considered as 
column vectors, whereas matrices are denoted with upper-case script. 
2.1 THE NETWORK ARCHITECTURE AND NOTATIONS 
Consider a multilayer network with N input units, Ns hidden units and Ns output 
units. The inputs and outputs are real-valued. The hidden units compute a non- 
linear function f which will be specified later on. The output units are assumed to 
be linear. A learning set of Np examples is given and fixed. For all p  { 1..Np}, the 
pth example is defined by its input vector dp  N and the corresponding desired 
output vector tp  ]Ns. The learning set can be represented as an input matrix, 
with both row and column notations, as follows 
I d d2 ... dlN 
dN,l dN,2 ... dN,N 
= [6,..., 
 ,tv,] , with independent row vectors. 
Similarly, the target matrix is T = [t,.. r r 
2.2 THE NETWORK AS A FUNCTION g OF ITS WEIGHTS 
For all h  {1..N/}, w} = (w,..., Wv,) T  v, is the vector of the weights be- 
tween all the input units and the h th hidden unit. The input weight matrix W 1 is de- 
2 2 2 T N. 
-- ....... WsNs) 
fined as W 1 [w}, ,Wv.] Similarly, a vector w, = (w,, ,  
represents the weights between all the hidden units and the s th output unit, for all 
s e {1..Ns}. Thus the output weight matrix W  is defined as W2= [w,..., wvs]. 
For an input matrix D, the network computes an output matrix 
zl(dl) 
z(v) =  
... zNs(di) 
z(dl) r 
z(dv.) r I 
-- [i(D),... ,N$(D)] 
where each output vector z(dp) must be equal to the target tp for exact learning. 
The network computation can be detailed as follows, for all s  {1..Ns} 
Ns N 
= 
h=l i=1 
= wsh.f(d; .w}) 
h=l 
Hence, 
(1) 
for the whole learning set, the s th output component is 
- ' 
-- Wsh. 
h=x f(dTN..W}) 
Ns 
w,h.F(D.wn) 
s()) -- E 2 1 
h--1 
164 A. Elisseeff and H. Paugam-Moisy 
In equation (1), F is a vector operator which transforms a n vector v into a n vector 
F(v) according to the relation [F(v)]i -- f([v]i), i  {1..n}. The same notation F 
will be used for the matrix operator. Finally, the expression of the output matrix 
can be deduced from equation (1) as follows 
z(v) : 
(2) z(v) : F(V.W).W 
From equation (2), the network output matrix appears as a simple function of 
the input matrix and the network weights. Unlike Coetzee and Stonick, we will 
consider that the input matrix 7) is not a variable of the problem. Thus we express 
the network output matrix 2(7)) as a function of its weights. Let g be this function 
g : NtxNI-I+NI-IXN$ ) NpxNs 
w = (w w 2) > r(v.w).w 
The # function clearly depends on the input matrix and could have be denoted by 
#v but this index will be dropped for clarity. 
3 FUNDAMENTAL RESULT 
3.1 PROPERTY OF FUNCTION g 
Learning is said to be exact on 7) if and only if there exists a network such that 
its output matrix Z(7)) is equal to the target matrix 7-. If g is a diffeomorphic 
function from R Nt xNH+NH xN$ Onto/Np xNs then the network can learn any target 
in R NP x Ns exactly. We prove that it is sufficient for the network function g to be 
a local diffeomorphism. Suppose there exist a set of weights X, an open subset 
U C 7 NNH+NHNs including X and an open subset V C 7 N'Ns including g(X) 
such that g is diffeomorphic from U to V. Since V is an open neighborhood of 
g(X), there exist a real A and a point y in V such that 7- = A(y-g(X)). Since g is 
diffeomorphic from U to V, there exists a set of weights Y in U such that y = g(Y), 
hence 7- = A(g(Y)- g(X)). The output units of the network compute a linear 
transfer function, hence the linear combination of g(X) and g(Y) can be integrated 
in the output weights and a network with twice NNH + NHNs weights can learn 
(V, 7-) exactly (see Figure 1). 
= (g(Y)-g(X)) 
Nt -"'--.2N a 
Z 
Figure 1: A network for exact learning of a target 7- (unique output for clarity) 
For g a local diffeomorphism, it is sufficient to find a set of weights X such that the 
Jacobian of g in X is non-zero and to apply the theorem of local inversion. This 
analysis is developed in next sections and requires some assumptions on the transfer 
function f of the hidden units. A function which verifies such an hypothesis 7/will 
be called a 7/-function and is defined below. 
Size of Multilayer Networks for Exact Learning: Analytic Approach 165 
3.2 DEFINITION AND THEOREM 
Definition 1 Consider a function f: JZ --> JZ which is C  (JZ) (i.e. with continuous 
derivative) and which has finite limits in -cx> and +oe. Such a function is called a 
-function iff it verifies the following property 
() 
f'(ax) 
(VaJZ/lal> 1) lim [ f, 
 (x) 
From this hypothesis on the transfer function of all the hidden units, the funda- 
mental result can be stated as follows 
Theorem 1 Exact learning of a set of Np examples, in general position, from 7 Nt 
to 7 Ns, can be realized by a network with linear output units and a transfer function 
which is a 7i-function, if the size NH of its hidden layer verifies the following bounds 
Lower Bound NH = I NpNs ] hidden units are necessary 
Nt + Ns 
Upper Bound NH--2 [ Np ] 
Nt+Ns Ns hidden units are sufficient 
The proof of the lower bound is straightforward, since a condition for g to be 
diffeomorphic from R NtxNH+NsNs onto R NPxNs is the equality of its input and 
output space dimensions Ni NH + NH Ns - Np Ns. 
3.3 SKETCH OF THE PROOF FOR THE UPPER BOUND 
The g function is an expression of the network as a function of its weights, for a given 
input matrix: g(W , W 2) = F(.W).W  and g can be decomposed according to 
its vectorial components on the learning set (which are themselves vectors of size 
Ns). For all p  { 1..Np} 
gp(W W ) z(d) [.wh r  
, = = f(dp .wh),... 
kh=l 
SNs h 
h=l 
T 
The derivatives of g w.r.t. the input weight matrix W  are, for all i  {1..NI}, for 
all h  { 1..NH } 
_ ' 
Ow, i ..., wNs f (d .)di] r 
For the output weight matrix W , the derivatives ofg are, for all h  {1..NH}, for 
all s  {1..Ns} 
,0,...,0,, 
cgwh --[ $(d }),,0,... ,0,] T 
s- 1 Ns-s 
The Jacobian matrix AAa(g) of g, the size of which is NINH q- NHN$ columns 
and NsNp rows, is thus composed of a block-diagonal part (derivatives w.r.t. W a) 
and several other blocks (derivatives w.r.t. W). Hence the Jacobian J(g) can be 
rewritten J(g) =[ J, Ja,..., JNH I, after permutations of rows and columns, and 
using the Hadamard and Kronecker product notations, each Ja being equal to 
2 
(3) Jh = [F(/).w})  IN,[F'(I).w})oJ...F'(I).w})oJN,]  [w...WNsa]] 
where INs is for the identity matrix in dimension Ns. 
166 A. Elisseeff and H. Paugam-Moisy 
Our purpose is to prove that there exists a point X = (W , W 2) such that the 
Jacobian J(g) is non-zero at X, i.e. such that the column vectors of the Jacobian 
matrix AA j (g) are linearly independent at X. The proof can be divided in two steps. 
First we address the case of a single output unit. Afterwards, this proof can be used 
to extend the result to several output units. Since the complete development of both 
proofs require a lot of calculations, we only present their sketchs below. More details 
can be found in [7]. 
3.3.1 Case of a single output unit 
The proof is based on a linear arrangement of the projections of the column vectors 
of Ja onto a subspace. This subspace is orthogonal to all the Ji for i < h. We 
build a vector w and a scalar wa such that the projected column vectors are an 
independent family, hence they are independent with the Ji for i < h. Such a con- 
struction is recursively applied until h NH. We derive then vectors w,  
-- ... ,iONH 
and w such that J(g) is non-zero. The assumption on 7/-fonctions is essential for 
proving that the projected column vectors of Ja are independent. 
3.3.2 Case of multiple output units 
In order to extend the result from a single output to s output units, the usual idea 
consists in considering as many subnetworks as the number of output units. From 
9 v,v which 
this point of view, the bound on the hidden units would be NH = 
differs from the result stated in theorem 1. A new direct proof can be developed 
(see [7]) and get a better bound: the denominator is increased to N + Ns . 
4 DISCUSSION 
The definition of a 7/-function includes both sigmoids and gaussian functions which 
are commonly used for multilayer perceptrons and RBF networks, but is not valid 
for threshold functions. Figure 2 shows the difference between a sigmoid, which 
is a 7/-function, and a saturation which is not a 7/-function. Figures (a) and (b) 
represent the span of the output space by the network when the weights are varying, 
i.e. the image of g. For clarity, the network is reduced to i hidden unit, i input 
unit, i output unit and 2 input patterns. For a 7/-function, a ball can be extracted 
from the output space T 2, onto which the g function is a diffeomorphism. For the 
saturation, the image ofg is reduced to two lines, hence g cannot be onto on a ball 
of T a. The assumption of the activation function is thus necessary to prove that 
the jacobian is non-zero. 
Our bound on the number of hidden units is very similar to Baum's results for 
dichotomies and functions from real inputs to binary outputs [1]. Hence the present 
result can be seen as an extension of Baum's results to the case of real outputs, 
and for a wide family of transfer functions, different from the threshold functions 
addressed by Baum and Haussler in [2]. An early result on sigmoid networks has 
been stated by Sontag [14]: for a single output and at least two input units, the 
number of examples must be twice the number of hidden units. Our upper bound 
on the number of hidden units is strictly lower than that (as soon as the number of 
input units is more than two). A counterpart of considering real data is that our 
results bear little relation to the VC-dimension point of view. 
Size of Multilayer Networks for Exact Learning: Analytic Approach 167 
o 
-2 
-1 5 -1 -0.5 
0.5 1 1.5 2 
(a)  A saturation function 
(b)  A sigmoid function 
Figure 2: Positions of output vectors, for given data, when varying network weights 
5 CONCLUSION 
In this paper, we show that a number of hidden units NH = 2 [Np Ns / (N + Ns) ] is 
sufficient for a network of 7/-functions to exactly learn a given set of Nf, examples in 
general position. We now discuss some of the practical consequences of this result. 
According to this formula, the size of the hidden layer required for exact learn- 
ing may grow very high if the size of the learning set is large. However, without 
a priori knowledge on the degree of redundancy in the learning set, exact learning 
is not the right goal in practical cases. Exact learning usually implies overfitting, 
especially if the examples are very noisy. Nevertheless, a right point of view could 
be to previously reduce the dimension and the size of the learning set by feature 
extraction or data analysis as pre-processing. Afterwards, our theoretical result 
could be a precious indication for scaling a network to perform exact learning on 
this representative learning set, with a good compromise between bias and variance. 
Our bound is more optimistic than the rule-of-thumb Nf, = 10w derived from the 
theory of PAC-learning. In our architecture, the number of weights is w = 2N,Ns. 
However the proof is not constructive enough to be derived as a learning algorithm, 
especially the existence of g(Y) in the neighborhood of g(X) where g is a local 
diffeomorphism (cf. figure 1). From this construction we can only conclude that 
NH = [NpNs/(N+Ns)] is necessary and NH : 2 [N,NNS/(N+Ns)] is sufficient 
to realize exact learning of Np examples, from 7 N to Td s. 
168 A. Elisseeff and H. Paugam-Moisy 
The opportunity of using multilayer networks as auto-associative networks and for 
data compression can be discussed at the light of this results. Assume that Ns - N 
and the expression of the number of hidden units is reduced to NH -- N, or at 
least NH -- Nt,/2. Since Nt, > Nr -k Ns, the number of hidden units must verify 
NH > Nr. Therefore, an architecture of "diabolo" network seems to be precluded 
for exact learning of auto-associations. A consequence may be that exact retrieval 
from data compression is hopeless by using internal representations of a hidden 
layer smaller than the data dimension. 
Acknowledgement s 
This work was supported by European Esprit III Project n  8556, NeuroCOLT 
Working Group. We thank C.S. Poon and J.V. Shah for fruitful discussions. 
References 
[1] 
[2] 
[3] 
[4] 
[5] 
[6] 
[7] 
[8] 
[9] 
[10] 
[11] 
[12] 
[14] 
E. B. Baum. On the capabilities of multilayer perceptrons. J. of Complexity, 
4:193-215, 1988. 
E. B. Baum and D. Haussler. What size net gives valid generalization ? Neural 
Computation, 1:151-160, 1989. 
E. K. Blum and L. K. Li. Approximation theory and feedforward networks. 
Neural Networks, 4(4):511-516, 1991. 
F. M. Coetzee and V. L. Stonick. Topology and geometry of single hidden layer 
network, least squares weight solutions. Neural Computation, 7:672-705, 1995. 
M. Cosnard, P. Koiran, and H. Paugam-Moisy. Bounds on the number of 
units for computing arbitrary dichotomies by multilayer perceptrons. J. of 
Complexity, 10:57-63, 1994. 
G. Cybenko. Approximation by superpositions of a sigmoidal function. Math. 
Control, Signal Systems, 2:303-314, October 1988. 
A. Elisseeff and H. Paugam-Moisy. Size of multilayer networks for exact learn- 
ing: analytic approach. Rapport de recherche 96-16, LIP, July 1996. 
K. Funahashi. On the approximate realization of continuous mappings by 
neural networks. Neural Networks, 2(3):183-192, 1989. 
K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks 
are universal approximators. Neural Networks, 2(5):359-366, 1989. 
S.-C. Huang and Y.-F. Huang. Bounds on the number of hidden neurones in 
multilayer perceptrons. IEEE Trans. Neural Networks, 2:47-55, 1991. 
M. Karpinski and A. Macintyre. Polynomial bounds for vc dimension of sig- 
moidal neural networks. In 7th A CM Symposium on Theory of Computing, 
pages 200-208, 1995. 
P. Koiran and E. D. Sontag. Neural networks with quadratic vc dimension. In 
Neural Information Processing Systems (NIPS*95), 1995. to appear. 
W. Maass. Bounds for the computational power and learning complexity of 
analog neural networks. In 5th A CM Symposium on Theory of Computing, 
pages 335-344, 1993. 
E. D. Sontag. Feedforward nets for interpolation and classification. J. Comp. 
Syst. Sci., 45:20-48, 1992. 
E. D. Sontag. Shattering all sets of k points in "general position" requires (k- 
1)/2 parameters. Technical Report Report 96-01, Rutgers Center for Systems 
and Control (SYCON), February 1996. 
