Neural Networks with Quadratic VC 
Dimension 
Pascal Koiran* 
Lab. de l'Informatique du Paralldlisme 
Ecole Normale Supdrieure de Lyon - CNRS 
69364 Lyon Cedex 07, France 
Eduardo D. Sontag* 
Department of Mathematics 
Rutgers University 
New Brunswick, NJ 08903, USA 
Abstract 
This paper shows that neural networks which use continuous acti- 
vation functions have VC dimension at least as large as the square 
of the number of weights w. This result settles a long-standing 
open question, namely whether the well-known O(w log w) bound, 
known for hard-threshold nets, also held for more general sigmoidal 
nets. Implications for the number of samples needed for valid gen- 
eralization are discussed. 
I Introduction 
One of the main applications of artificial neural networks is to pattern classification 
tasks. A set of labeled training samples is provided, and a network must be obtained 
which is then expected to correctly classify previously unseen inputs. In this context, 
a central problem is to estimate the amount of training data needed to guarantee 
satisfactory learning performance. To study this question, it is necessary to first 
formalize the notion of learning from examples. 
One such formalization is based on the paradigm of probably approximately correct 
(PAC) learning, due to Valiant (1984). In this framework, one starts by fitting some 
function f, chosen from a predetermined class jr, to the given training data. The 
class jr is often called the "hypothesis class", and for purposes of this discussion it 
will be assumed that the functions in jr take binary values {0, 1) and are defined on a 
common domain X. (In neural networks applications, typically r corresponds to the 
set of all neural networks with a given architecture and choice of activation functions. 
The elements of X are the inputs, possibly multidimensional.) The training data 
consists of labeled samples (Xi,ei) , with each xi 6 X and each ei 6 {0, 1}, and 
*koiranlip. ens-lyon. fr. 
t sont aghilbert. rutgers. edu. 
198 P. KOIRAN, E. D. SONTAG 
"fitting" by an f means that f(zi) ---- ei for each i. Given a new example z, one 
uses f(x) as a guess of the "correct" classification of x. Assuming that both training 
inputs and future inputs are picked according to the same probability distribution 
on X, one needs that the space of possible inputs be well-sampled by the training 
data, so that f is an accurate fit. We omit the details of the formalization of 
PAC learning, since there are excellent references available, both in textbook (e.g. 
Anthony and Biggs (1992), Natarajan (1991)) and survey paper (e.g. Maass (1994)) 
form, and the concept is by now very well-known. 
After the work of Vapnik (1982) in statistics and of Blumer et. al. (1989) in com- 
putational learning theory, one knows that a certain combinatorial quantity, called 
the Vapnik-Chervonenkis (VC) dimension VC(:) of the class : of interest com- 
pletely characterizes the sample sizes needed for learnability in the PAC sense. (The 
appropriate definitions are reviewed below. In Valiant's formulation one is also in- 
terested in quantifying the computational effort required to actually fit a function 
to the given training data, but we are ignoring that aspect in the current paper.) 
Very roughly speaking, the number of samples needed in order to learn reliably is 
proportional to VC(:). Estimating VC(:) then becomes a central concern. Thus 
from now on, we speak exclusively of VC dimension, instead of the original PAC 
learning problem. 
The work of Cover (1988) and Baum and Haussler (1989) dealt with the computa- 
tion of VC(:) when the class : consists of networks built up from hard-threshold 
activations and having w weights; they showed that VC(:)= O(wlogw). (Con- 
versely, Maass (1993) showed that there is also a lower bound of this form.) It 
would appear that this definitely settled the VC dimension (and hence also the 
sample size) question. 
However, the above estimate assumes an architecture based on hard-threshold 
("Heaviside") neurons. In contrast, the usually employed gradient descent learning 
algorithms ("backpropagation" method) rely upon continuous activations, that is, 
neurons with graded responses. As pointed out in Sontag (1989), the use of ana- 
log activations, which allow the passing of rich (not just binary) information among 
levels, may result in higher memory capacity as compared with threshold nets. This 
has serious potential implications in learning, essentially because more memory ca- 
pacity means that a given function f may be able to "memorize" in a "rote" fashion 
too much data, and less generalization is therefore possible. Indeed, Sontag (1992) 
showed that there are conceivable (though not very practical) neural architectures 
with extremely high VC dimensions. Thus the problem of studying VC(:) for ana- 
log networks is an interesting and relevant issue. Two important contributions in 
this direction were the papers by Maass (1993) and by Goldberg and Jerrum (1995), 
which showed upper bounds on the VC dimension of networks that use piecewise 
polynomial activations. The last reference, in particular, established for that case 
an upper bound of O(w2), where, as before, w is the number of weights. However 
it was an open problem (specifically, "open problem number 7" in the recent survey 
by Maass (1993) if there is a matching w 2 lower bound for such networks, and more 
generally for arbitrary continuous-activation nets. It could have been the case that 
the upper bound O(w ) is merely an artifact of the method of proof in Goldberg 
and Jerrum (1995), and that reliable learning with continuous-activation networks 
is still possible with far smaller sample sizes, proportional to O(w log w). But this is 
not the case, and in this paper we answer Maass' open question in the affirmative. 
Assume given an activation er which has different limits at -Vx, and is such that 
there is at least one point where it has a derivative and the derivative is nonzero 
(this last condition rules out the Heaviside activation). Then there are architec- 
tures with arbitrary large numbers of weights w and VC dimension proportional 
Neural Networks with Quadratic VC Dimension 199 
to w 2. The proof relies on first showing that networks consisting of two types of 
activations, Heavisides and linear, already have this power. This is a somewhat 
surprising result, since purely linear networks result in VC dimension proportional 
to w, and purely threshold nets have, as per the results quoted above, VC dimension 
bounded by wlog w. Our construction was originally motivated by a related one, 
given in Goldberg and Jerrum (1995), which showed that real-number programs (in 
the Blum-Shub-Smale (1989) model of computation) with running time T have VC 
dimension (T2). The desired result on continuous activations is then obtained, 
approximating Heaviside gates by a-nets with large weights and approximating lin- 
ear gates by a-nets with small weights. This result applies in particular to the 
standard sigmoid 1/(1 + e-). (However, in contrast with the piecewise-polynomial 
case, there is still in that case a large gap between our (w ) lower bound and 
the O(w 4) upper bound which was recently established in Karpinski and Macin- 
tyre (1995).) A number of variations, dealing with Boolean inputs, or weakening 
the assumptions on tr, are discussed. The full version of this paper also includes 
some remarks on thresholds networks with a constant number of linear gates, and 
threshold-only nets with "shared" weights. 
Basic Terminology and Definitions 
Formally, a (first-order, feedforward) architecture or network .A is a connected di- 
rected acyclic graph together with an assignment of a function to a subset of its 
nodes. The nodes are of two types: those of fan-in zero are called input nodes and 
the remaining ones are called computation nodes or gates. An output node is a node 
of fan-out zero. To each gate g there is associated a function ag ' ]1 -- ]1, called the 
activation or gate function associated to g. 
The number of weights or parameters associated to a gate g is the integer rg equal 
to the fan-in of g plus one. (This definition is motivated by the fact that each input 
to the gate will be multiplied by a weight, and the results are added together with 
a "bias" constant term, seen as one more weight; see below.) The (total) number 
of weights (or parameters) of Jl is by definition the sum of the numbers ha, over all 
the gates g of .A. The number of inputs m of .A is the total number of input nodes 
(one also says that ".A has inputs in '"); it is assumed that m > 0. The number 
of outputs p of Jl is the number of output nodes (unless otherwise mentioned, we 
assume by default that all nets considered have one-dimensional outputs, that is, 
p=l). 
Two examples of gate functions that are of particular interest are the identity or 
linear gate: Id(x) = x for all x, and the threshold or Heaviside function: H(x) - 1 
if z _ O, H(x) = 0 if x < 0. 
Let Jl be an architecture. Assume that nodes of .4 have been linearly ordered as 
rl,..., r,, gx,  ., g, where the rj's are the input nodes and the gj's the gates. For 
simplicity, write ri :-- rg,, for each i = 1,...,l. Note that the total number of 
I 
parameters is n = Ei=i/2i and the fan-in of each gi is/2i -- 1. To each architecture 
Jl (strictly speaking, an architecture together with such an ordering of nodes) we 
associate a function 
F  I " x I  - R e , 
where p is the number of outputs of Jl, defined by first assigning an "output" to 
each node, recursively on the distance from the the input nodes. Assume given 
an input x E 1" and a vector of weights w E " We partition w into blocks 
(wi,..., wl) of sizes hi,..., nl respectively. First the coordinates of x are assigned 
as the outputs of the input nodes r,..., rm respectively. For each of the other 
gates gi, we proceed as follows. Assume that outputs yl,..., Yn.-1 have already 
200 P. KOIRAN, E. D. SONTAG 
been assigned to the predecessor nodes of gi (these are input and/or computation 
nodes, listed consistently with the order fixed in advance). Then the output of gi 
is by definition 
O'g, (Wi,o q- Wi,lYl q- Wi,2Y2 q- ... q- Wi,n,-lYn,-1) , 
where we are writing wi = (wi,o, wi,, wi,2,..., wi,,,-). The value of F(x, w) is 
then by definition the vector (scalar ifp = 1) obtained by listing the outputs of the 
output nodes (in the agreed-upon fixed ordering of nodes). We call F the function 
computed by the architecture .A. For each choice of weights w  1", there is a 
function Fw ' 1 '  P defined by Fw(x) := F(x, w); by abuse of terminology we 
sometimes call this also the function computed by 4 (if the weight vector has been 
fixed). 
Assume that 4 is an architecture with inputs in 1 ' and scalar outputs, and that 
the (unique) output gate has range {0, 1}. A subset A C_ 5k " is said to be shattered 
by 4 if for each Boolean function /  A - (0, 1) there is some weight w  1" so 
that F,(x) =/(z) for all x 6 A. The Vapnik-Chervonenkis (VC) dimension of 4 
is the maximal size of a subset A C_ 1 ' that is shattered by 4. If the output gate 
can take non-binary values, we implicitly assume that the result of the computation 
is the sign of the output. That is, when we say that a subset A C_ 1 " is shattered 
by 4, we really mean that A is shattered by the architecture H(4) in which the 
output of 4 is fed to a sign gate. 
2 Networks Made up of Linear and Threshold Gates 
Proposition I For every n k 1, there is a network architecture .A with inputs in 
5k  and O(V/-) weights that can shatter a set of size N = n . This architecture is 
made only of linear and threshold gates. 
Proof. Our architecture has n parameters W1,  ., W, each of them is an element 
oft = {O.wi...wwi  {0,1}}. The shattered set will be $= [n]= {1,...,n} . 
For a given choice of W = (W,..., W,, 4 will compute the boolean function 
fw: $ '-* {0,1} defined as follows: fw(x,y) is equal to the x-th bit of Wy. Clearly, 
for any boolean function f on $, there exists a (unique) W such that f = fw. 
We first consider the obvious architecture which computes the function: 
fv(Y) = W1 q- (Wz - Wz-l)Z(y- z + 1/2) (1) 
sending each point y 6 [n] to Wy. This architecture has n - 1 threshold gates, 
3(n - 1) + 1 weights, and just one linear gate. 
Next we define a second multi-output net which maps w 6 T to its binary rep- 
resentation f2(w) = (Wl,..., w,). Assume by induction that we have a net A/'/ 
that maps w to (Wl,...,wi,O.wi+x...w,). Since Wi+l - H(O.Wi+l...w,- 1/2) 
and O.wi+...w, = 2 x O.wi+x...w, - Wi+l, Af/+ can be obtained by adding one 
threshold gate and one linear gate to A/'/ (as well as 4 weights). It follows that 
has n threshold gates, n linear gates and 4n weights. 
Finally, we define a net .M s which takes as input x 6 [hi and w = (Wl,..., w,) 
{0, 1} " and outputs w. We would like this network to be as follows: 
f3(:r, W)'-" Wl q-  wzZ(g - z q- 1/2)-  Wz-lZ(g - z q- 1/2). 
z=2 z=2 
Neural Networks with Quadratic VC Dimension 2 O1 
This is not quite possible, because the products between the wi's (which are inputs 
in this context) and the Heavisides are not allowed. However, since we are dealing 
with binary variables one can write uv = H(u + v - 1.5). Thus fir3 has one linear 
gate, q(n - 1) threshold gates and 12(n - 1) + n weights. Note that fw(x,y) 
f3(x, f2(fv(y)). This can be realized by means of a net that has n + 2 linear gates, 
(n-1)+n+q(n-1) = 6n-5 threshold gates, and (3n-2)+qn+(12n-11) = 19n-13 
weights. 
The following is the main result of this section: 
Theorem I For every n _ 1, there is a network architecture `4 with inputs in  
and O(x/-) weights that can shatter a set of size N = n . This architecture is 
made only of linear and threshold gates. 
Proof. The shattered set will be S = {0, 1,...,n  - 1}. For every z  S, there 
are unique integers x, y6 {0,1,...,n-I} such that u-- nx+y. The idea of the 
construction is to compute x and y, and then feed (x + 1, y + 1) to the network 
constructed in Proposition 1. Note that x is the unique integer such that u - nx  
{0, 1,..., n - 1}. It can therefore by computed by brute force search as follows: 
r--i 
x = y, kH[H(u - nk) + H(n - 1 - (u - nk)) - 1.5]. 
k=0 
This network has 3n threshold gates, one linear gate and 8n weights. Then of course 
y: u-nx. [3 
A Boolean version is as follows. 
Theorem 2 For every d _> 1, there is a network architecture .4 with O(v/-) 
weights that can shatter the N = 2 a points of {0, 1} a. This architecture is made 
only of linear and threshold gates. 
Proof. Given u  {0,13 a, one can compute x = 1 + El512i--lui and y = 1 + 
a=l 2i-lui+a with two linear gates. Then (x, y) can be fed to the network of 
Proposition 1 (with n = 2a). [3 
In other words, there is a network architecture with 2 d weights that can compute 
all boolean functions on 2d variables. 
3 Arbitrary Sigmoids 
We now extend the preceding VC dimension bounds to networks that use just 
one activation function er (instead of both linear and threshold gates). All that is 
required is that the gate function have a sigmoidal shape and satisfy a very weak 
smoothness property: 
1. er is differentiable at some point x0 (i.e., a(xo+h) = a(xo)+rrt(xo)h+o(h)) 
where crt(x0)- 0. 
2. lim_-._oo or(x) = 0 and lim_..+oo or(x) = 1 (the limits 0 and 1 can be 
replaced by any distinct numbers). 
function satisfying these two conditions will be called sigmoidal. Given any such 
we will show that networks using only er gates provide quadratic VC dimension. 
202 P. KOIRAN, E. D. SONTAG 
Theorem 3 Let er be an arbitrary sigmoidal function. There exist architectures 
and A2 with O(v/) weights made only of er gates such that: 
 .41 can shatter a subset of of cardinality N - n2; 
 .A can shatter the N = 2 a points of {0, 1} a. 
This follows directly from Theorems 1 and 2, together with the following simulation 
result: 
Theorem 4 Let a be a an arbitrary sigmoidal function. Let Af be a network of 
T threshold and L linear gates, with a threshold gate at the output. Then Af can 
be simulated on any given finite set of inputs by a network A? oft + L gates that 
all use the activation function  (except the output gate which is still a threshold). 
Moreover, if Af has n weights then AP has O(n) weights. 
Proof. Let $ be a finite set of inputs. We can assume, by changing the thresholds of 
threshold gates if necessary, that the net input Ig(x) to any threshold gate g of N' 
is different from 0 for all inputs x  $. 
Given e  0, let A/'e be the net obtained by replacing the output functions of all gates 
by the new output function x - a(x/e) if this output function is the sign function, 
and by x - er(x) -- [(xo-t-ex)-er(xo)]/[eer(xo)] if it is the identity function. Note 
that for any a  0, lim_.0+ (x/e)= H(x) uniformly for x ]- cx,-a] t [a, q-cx] 
and lim_.0 er(x)= x uniformly for x  [-l/a, l/a]. 
This implies by induction on .the depth of g that for any gate g of N' and any input 
x  $, the net input I9,e(x) to g in the transformed net A/' satisfies lime-0 Ig,(x) = 
Ig(x) (here, we use the fact that the output function of every g is continuous at 
Ig(x)). In particular, by taking g to be the output gate of N', we see that A/' and 
A/'e compute the same function on $ if e is small enough. Such a net A/'e can be 
transformed into an equivalent net AP that uses only er as gate function by a simple 
transformation of its weights and thresholds. The number of weights remains the 
same, except at most for a constant term that must be added to each net input to 
a gate; thus if A/' has n weights, A? has at most 2n weights. [] 
4 More General Gate Functions 
The objective of this section is to establish results similar to Theorem 3, but for 
even more arbitrary gate functions, in particular weakening the assumption that 
limits exist at infinity. The main result is, roughly, that any er which is piecewise 
twice (continuously) differentiable gives at least quadratic VC dimension, save for 
certain exceptional cases involving functions that are almost everywhere linear. 
A function  : I -- I is said to be piecewise C  if there is a finite sequence 
al ( a2 ( ... ( a r such that on each interval I of the form] - ,al[, ]ai,ai+l[ or 
lap,-t-x[, 1I is C . 
(Note: our results hold even if it is only assumed that the second derivative exists in 
each of the above intervals; we do not use the continuity of these second derivatives.) 
Theorem 5 Let er be a piecewise C  function. For every n _ 1, there exists an 
architecture made of er-gates, and with O(n) weights, that can shatter a subset of 
 of cardinality n , except perhaps in the following cases: 
1.  is piecewise-constant, and in this case the VC dimension of any architec- 
ture of n weights is O(n log n); 
Neural Networks with Quadratic VC Dimension 2 03 
2.  is affine, and in this case the VC dimension of any architecture of n 
weights is at most n. 
o 
there are constants a 0 and b such that a(x) = ax + b except at a finite 
nonempty set of points. In this case, the VC dimension of any architec- 
ture of n weights is O(n2), and there are architectures of VC dimension 
f(n log n). 
Due to the lack of space, the proof cannot be included in this paper. Note that 
the upper bound of the first special case is tight for threshold nets, and that of the 
second special case is tight for linear functions in I n. 
A cknowle dgement s 
Pascal Koiran was supported by an INRIA fellowship, DIMACS, and the Interna- 
tional Computer Science Institute. Eduardo Sontag was supported in part by US 
Air Force Grant AFOSR-94-0293. 
References 
M. ANTHONY AND N.L. BIGGS (1992) Computational Learning Theory: An Introduction, 
Cambridge U. Press. 
E.B. BAUM AND D. HAUSSLER (1989) What size net gives valid generalization?, Neural 
Computation 1, pp. 151-160. 
L. BLUM, M. Sims AND S. SMALE (1989) On the theory of computation and complex- 
ity over the real numbers: NP-completeness, recursive functions and universal machines, 
Bulletin of the AMS 21, pp. 1-46. 
A. BLUMER, A. EHRENFEUCHT, D. HAUSSLER, AND M. WARMUTH (1989) Learnability 
and the Vapnik-Chervonenkis dimension, J. of the ACM 36, pp. 929-965. 
T.M. COVER (1988) Capacity problems for linear machines, in: Pattern Recognition, L. 
KanaJ ed., Thompson Book Co., pp. 283-289. 
P. GOLDBERG AND M. JERRUM (1995) Bounding the Vapnik-Uhervonenkis dimension of 
concept classes parametrized by real numbers, Machine Learning 18, pp. 131-148. 
M. KARPINSKI AND A. MACINTYRE (1995) Polynomial bounds for VC dimension of sig- 
moidal neural networks, in Proc. 27th ACM Symposium on Theory of Computing, pp. 200- 
208. 
W. MAASS (1993) Bounds for the computational power and learning complexity of analog 
neural nets, in Proc. of the 25th ACM Symp. Theory of Computing, pp. 335-344. 
W. MAASS (1994) Perspectives of current research about the complexity of learning in neu- 
ral nets, in Theoretical Advances in Neural Computation and Learning, V.P. Roychowd- 
hury, K.Y. Siu, and A. Orlitsky, editors, Kluwer, Boston, pp. 295-336. 
B.K. NATARAJAN (1991) Machine Learning: A Theoretical Approach, M. Kaufmann Pub- 
lishers, San Mateo, CA. 
E.D. SONTAG (1989) Sigmoids distinguish better than Heavisides, Neural Computation 1, 
pp. 470-472. 
E.D. SONT^G (1992) Feedforward nets for interpolation and classification, J. Comp. 
Syst. Sci 45, pp. 20-48. 
L.G. VALIANT (1984) A theory of the learnable, Comm. of the ACM 27, pp. 1134-1142 
V.N. VAPNIK (1982) Estimation of Dependencies Based on Empirical Data, Springer, 
Berlin. 
