REMARKS ON INTERPOLATION AND 
RECOGNITION USING NEURAL NETS 
Eduardo D. Sontag* 
SYCON - Center for Systems and Control 
Rutgers University 
New Brunswick, NJ 08903 
Abstract 
We consider different types of single-hidden-layer feedforward nets: with 
or without direct input to output connections, and using either thresh- 
old or sigmoidal activation functions. The main results show that direct 
connections in threshold nets double the recognition but not the interpo- 
lation power, while using sigmoids rather than thresholds allows (at least) 
doubling both. Various results are also given on VC dimension and other 
measures of recognition capabilities. 
1 INTRODUCTION 
In this work we continue to develop the theme of comparing threshold and sigmoidal 
feedforward nets. In (Sontag and Sussmann, 1989) we showed that the "general- 
ized delta rule" (backpropagation) can give rise to pathological behavior -namely, 
the existence of spurious local minima even when no hidden neurons are used,- 
in contrast to the situation that holds for threshold nets. On the other hand, in 
(Sontag and Sussmann, 1989) we remarked that provided that the right variant be 
used, separable sets do give rise to globally convergent backpropagation, in com- 
plete analogy to the classical perceptton learning theorem. These results and those 
obtained by other authors probably settle most general questions about the case of 
no hidden units, so the next step is to look at the case of single hidden layers. In 
(Sontag, 1989) we announced the fact that sigmoidal activations (at least) double 
recognition power. Here we provide details, and we make several further remarks 
on this as well as on the topic of interpolation. 
Nets with one hidden layer are known to be in principle sufficient for arbitrary 
recognition tasks. This follows from the approximation theorems proved by various 
*E-mail: sontag@hilbert.rutgers.edu 
940 Sontag 
authors: (Funahashi, 1988), (Cybenko,1989), and (Hornik et. al., 1989). However, 
what is far less clear is how many neurons are needed for achieving a given recog- 
nition, interpolation, or approximation objective. This is of importance both in its 
practical aspects (having rough estimates of how many neurons will be needed is es- 
sential when applying backpropagation) and in evaluating generalization properties 
(larger nets tend to lead to poorer generalization). It is known and easy to prove 
(see for instance (Arai, 1989), (Chester, 1990)) that one can basically interpolate 
values at any n + 1 points using an n-neuron net, and in particular that any n + 1- 
point set can be dichotomized by such nets. Among other facts, we point out here 
that allowing direct input to output connections permits doubling the recognition 
power to 2n, and the same result is achieved if sigmoidal neurons are used but such 
direct connections are not allowed. Further, we remark that approximate interpo- 
lation of 2n - 1 points is also possible, provided that sigmoidal units be employed 
(but direct connections in threshold nets do not suffice). 
The dimension of the input space (that is, the number of "input units") can influ- 
ence the number of neurons needed, are least for dichotomy problems for suitably 
chosen sets. In particular, Baum had shown some time back (Baum, 1988) that 
the VC dimension of threshold nets with a fixed number of hidden units is at least 
proportional to this dimension. We give lower bounds, in dimension two, at least 
doubling the VC dimension if sigmoids or direct connections are allowed. 
Lack of space precludes the inclusion of proofs; references to technical reports are 
given as appropriate. A full-length version of this paper is also available from the 
author. 
2 DICHOTOMIES 
The first few definitions are standard. Let N be a positive integer. A dichotomy 
or two-coloring (S_, S+) on a set S C R N is a partition S = S_ [.J S+ of S into two 
disjoint subsets. A function f: R S - R will be said to implement this dichotomy 
if it holds that 
f(u) > O for u & $+ and f(u) < O for u & $_ 
Let jr be a class of functions from R v to R, assumed to be nontrivial, in the sense 
that for each point u  IR v there is some fz  jr so that fz(u) > 0 and some f2  jr 
so that f2 (u) < 0. This class shatters the set $ C_ R v if each dichotomy on $ can 
be implemented by some f  jr. 
Here we consider, for any class of functions jr as above, the following measures of 
classification power. First we introduce  and/z, dealing with "best" and "worst" 
cases respectively: (jr) denotes the largest integer I _> 1 (possibly oc>) so that there 
is at least some set $ of cardinality I in R S which can be shattered by jr, while 
/(jr) is the largest integer l _> 1 (possibly c) so that every set of cardinality I can 
be shattered by jr. Note that by definition,/(jr) _< (jr) for every class jr. 
In particular, the definitions imply that no set of cardinality (jr) + i can be 
shattered, and that there is at least some set of cardinality/(jr) + 1 which cannot be 
shattered. The integer  is usually called the Vapnik-Chervonenkis (VC) dimension 
of the class jr (see for instance (Saum, 1988)), and appears in formalizations of 
learning in the distribution-free sense. 
Remarks on Interpolation and Recognition Using Neural Nets 941 
A set may fail to be shattered by . because it is very special (see the example 
below with colinear points). In that sense, a more robust measure is useful: p() 
is the largest integer I >_ 1 (possibly cx) for which the class of sets S that can be 
shattered by  is dense, in the sense that given every/-element set S = {sx,..., st} 
there are points gi arbitrarily close to the respective si's such that  = {x,...,} 
can be shattered by 5. Note that 
<_ <_ 
for all/F. 
To obtain an upper bound rn for p(jr) one needs to exhibit an open class of sets of 
cardinality rn + 1 none of which can be shattered. 
Take as an example the class/F consisting of all arline functions f(a) = aa + by + c 
on 1 2. Since any three points can be shattered by an arline map provided that they 
are not colinear (just choose a line aa + by + c = 0 that separates any point which 
is colored different from the rest), it follows that 3 < p. On the other hand, no set 
of four points can ever be dichotomized, which implies that  < 3 and therefore the 
conclusion p =  = 3 for this class. (The negative statement can be verified by a 
case by case analysis: if the four points form the vertices of a 4-gon color them in 
"XOtL" fashion, alternate vertices of the same color; if 3 form a triangle and the 
remaining one is inside, color the extreme points differently from the remaining one; 
if all colinear then use an alternating coloring). Finally, since there is some set of 3 
points which cannot be dichotomized (any set of three colinear points is like this), 
but every set of two can, _p = 2 . 
We shall say that . is robust if whenever S can be shattered by . also every small 
enough perturbation of S can be shattered. For a robust class and I = p(), every 
set in an open dense subset in the above topology, i.e. almost every set of I elements, 
can be shattered. 
3 NETS 
We define a "neural net" as a function of a certain type, corresponding to the idea 
of feedforward interconnections, via additive links, of neurons each of which has a 
scalar response or activation function O. 
Definition 3.1 Let 0 : 1 - 1 be any function. A function f : 1 N - 1 is 
a single-hidden-layer neural net with k hidden neurons of type 0 and N inputs, 
or just a (k, O)-net, if there are real numbers w0, wx,..., wk, r,..., rk and vectors 
v0, v,..., v 6 1 v such that, for all u 6 1 v, 
f(u) = w0 + v0., + w, (2) 
i=1 
where the dot indicates inner product. A net with no direct i/o connections is one 
for which v0 = 0. 
For fixed 0, and under mild assumptions on 0, such neural nets can be used to 
approximate uniformly arbitrary continuous functions on compacts. In particular, 
they can be used to implement arbitrary dichotomies. 
942 Sontag 
In neural net practice, one often takes 0 to be the standard sigmoid a(x) = x 
1 +e_------- 7 or 
equivalently, up to translations and change of coordinates, the hyperbolic tangent 
tanh(x). Another usual choice is the hardlimiter, threshold, or Heaviside function 
{ ifz<O 
7(z) = if z > 0 
which can be approximated well by a(-rx) when the "gain" -), is large. Yet another 
possibility is the use of the piecewise linear function 
{ -1 if x_<-i 
r(x) = if x _> 1 
x otherwise. 
Most analysis has been done for 7/and no direct connections, but numerical tech- 
niques typically use the standard sigmoid (or equivalently tanh). The activation 
r will be useful as an example for which sharper bounds can be obtained. The 
examples a and r, but not 7'{, are particular cases of the following more general 
type of activation function: 
Definition 3.2 A function 0: IR - IR will be called a sigmoid if these two prop- 
erties hold: 
(Sl) t+ := limx-+oo 0(z) and t_ := limx._oo 0(z) exist, and t+ 
(S2) There is some point c such that 0 is differenttable at c and O'(c) = I  O. 
All the examples above lead to robust classes, in the sense defined earlier. More 
precisely, assume that 0 is continuous except for at most tintrely many points z, 
and it is left continuous at such z, and let $c be the class of (k, 0)-nets, for any 
fixed k. Then jr is robust, and the same statement holds for nets with no direct 
connections. 
4 CLASSIFICATION RESULTS 
We let I(k,O,N) denote/(), where  is the class of (k,O)-nets in IR N with no 
direct connections, and similarly for _ and , and a superscript d is used for the 
class of arbitrary such nets (with possible direct connections from input to output). 
The lower measure E is independent of dimension: 
Lemma 4.1 For each k,O,N, (k,O,N)=/(k, 0, 1) and _a(k,O,N) = _a(k,O, 1). 
This justifies denoting these quantities just as p(k,0) and _d(k,O) respectively, as 
we do from now on, and giving proofs only for N = 1. 
Lemma 4.2 For any sigmoid 0, and for each k, N, 
(k + x, 0, N) _>/(k,, N) 
and similarly for _ and . 
The main results on classification will be as follows. 
Remarks on Interpolation and Recognition Using Neural Nets 943 
Theorem ! For any sigmoid O, and .for each k, 
= 
>_ 
Theorem 2 For each k, 
4 
2) < 
2k+l 
4k+3. 
Theorem 3 For any sigmoid O, and for each k, 
+ 1 < 
4k+3 _ a(k,7-{,2) 
4k- i <_ (k, 0,2). 
These results are proved in (Sontag, 1990a). The first inequality in Theorem 2 
follows from the results in (Baum, 1988), who in fact established a lower bound 
of 2N L*-J for (,,N) (and hence for  too), for every N, not just N = 2 as 
in the Ieorem above. We conjecture, but have as yet been unable to prove, that 
direct connections or sigmoids should also improve these bounds by at least a factor 
of 2, just as in the two-dimensional case and in the worst-case analysis. Because 
of Lemma 4.2, the last statements in Theorems 1 and 3 are consequences of the 
previous two. 
5 SOME PARTICULAR ACTIVATION FUNCTIONS 
Consider the last inequality in Theorem 1. For arbitrary sigmoids, this is far too 
conservative, as the number/ can be improved considerably from 2k, even made 
infinite (see below). We conjecture that for the important practical case 0(x) = 
it is close to optimal, but the only upper bounds that we have are still too high. 
For the piecewise linear function r, at least, one has equality: 
Lemma 5.1 p_(k, ') = 2k. 
It is worth remarking that there are sigmoids O, as differentiable as wanted, even 
real-analytic, where all classification measures are infinite. Of course, the function 
0 is so complicated that there is no reasonably "finite" implementation for it. This 
remark is only of theoretical interest, to indicate that, unless further restrictions 
are made on (S1)-(S2), much better bounds can be obtained. (If only p and  
are desired to be infinite, one may also take the simpler example O(x) = sin(x). 
Note that for any I rationally independent real numbers xi, the vectors of the form 
(sin(7txx),...,sin(7x), with the 7i's real, form a dense subset of [-1, 1] , so all 
dichotomies on {xx,...,x} can be implemented with (1, sin)-nets.) 
Lemma 5.2 There is some sigmoid 0, which can be taken to be an analytic func- 
tion, so that _(1, 0) = cx. 
944 Sontag 
6 INTERPOLATION 
We now consider the following approximate interpolation problem. Assume given 
a sequence of k (distinct) points a:i,..., a:k in R N, any  > 0, and any sequence of 
real numbers yi,..., yk, as well as some class iF of functions from IR N to IR. We 
ask if there exists some 
f  iF so that If(wi)- YI <  for each i. (3) 
Let A_(iF) be the largest integer k _> 1, possibly infinite, so that for every set of data 
as above (3) can be solved. Note that, obviously, A_(iF) _< (iF). Just as in Lemma 
4.1, A_ is independent of the dimension N when applied to nets. Thus we let A_a(k, O) 
and A_(k,0) be respectively the values of A_(iF) when applied to (k,0)-nets with or 
without direct connections. 
We now summarize properties of A_. The next result --see (Sontag,1991), as well 
as the full version of this paper, for a proof-- should be compared with Theorem 
1. The main difference is in the second equality. Note that one can prove A_(k, 0) >_ 
A_a(k - 1, 7/), in complete analogy with the case of/, but this is not sufficient 
anymore to be able to derive the last inequality in t- Theorem from the second 
equality. 
Theorem 4 For any continuous sigmoid O, and for each k, 
= k+l 
A_(k, 0) _> 
Remark 6.1 Thus we can approximately interpolate any 2k - 1 points using k 
sigmoidal neurons. It is not hard to prove as a corollary that, for the standard 
sigmoid, this approximate interpolation property holds in the following stronger 
sense: for an open dense set of 2k - 1 points, one can achieve an open dense set 
of values; the proof involves looking first at points with rational coordinates, and 
using that on such points one is dealing basically with rational functions (after a 
diffeomorphism), plus some theory of semialgebraic sets. We conjecture that one 
should be able to interpolate at 2k points. Note that for k = 2 this is easy to 
achieve: just choose the slope d so that some zi - Zi+l becomes zero and the zi are 
allowed to be nonincreasing or nondecreasing. The same proof, changing the signs 
if necessary, gives the wanted net. For some examples, it is quite easy to get 2k 
points. For instance, A_(k, ') = 2k for the piecewise linear sigmoid -. [2 
7 FURTHER REMARKS 
The main conclusion from Theorem 1 is that sigmoids at least double recognition 
power for arbitrary sets. It may be the case that (k, cr, N)/F(k,7,N) . 2 for all 
N; this is true for N = 1 and is strongly suggested by Theorem 3 (the first bound 
appears to be quite tight). Unfortunately the proof of this theorem is based on a 
result from (Asano et. al., 1990) regarding arrangements of points in the plane, a 
fact which does not generalize to dimension three or higher. 
One may also compare the power of nets with and without connections, or threshold 
vs sigmoidal processors, on Boolean problems. For instance, it is a trivial conse- 
quence from the given results that parity on n bits can be computed with [__+_!q 
2  
Remarks on Interpolation and Recognition Using Neural Nets 945 
hidden sigmoidal units and no direct connections, though requiring (apparently, 
though this is an open problem) n thresholds. In addition, for some families of 
Boolean functions, the gap between sigmoidal nets and threshols nets may be in- 
finitely large (Sontag, 1990a). See (Sontag, 1990b) for representation properties of 
two-hidden-layer nets 
Acknowledgements 
This work was supported in part by Siemens Corporate Research, and in part by 
the CAIP Center, Rutgers University. 
References 
Arai, M., "Mapping abilities of three-layer neural networks," Proc. IJCNN Int. Joint 
Conf. on Neural Networks, Washington, June 18-22, 1989, IEEE Publications, 1989, 
pp. 1-419/424. 
Asano,T., J. Hershberger, J. Pach, E.D. Sontag, D. Souivaine, and S. Suri, "Sepa- 
rating Bi-Chromatic Points by Parallel Lines," Proceedings of the Second Canadian 
Conference on Computational Geometry, Ottawa, Canada, 1990, p. 46-49. 
Baum, E.B., "On the capabilities of multilayer percepttons," J. Complezity 4(1988): 
193-215. 
Chester, D., "Why two hidden layers and better than one," Proc. Int. Joint Conf. 
on Neural Networks, Washington, DC, Jan. 1990, IEEE Publications, 1990, p. 1.265- 
268. 
Cybenko, G., "Approximation by superpositions of a sigmoidal function," Math. 
Control, Signals, and Systems 2(1989): 303-314. 
Funahashi, K., "On the approximate realization of continuous mappings by neural 
networks," Proc. Int. Joint Conf. on Neural Networks, IEEE Publications, 1988, p. 
1.641-648. 
Hornik, K.M., M. Stinchcombe, and H. White, "Multilayer feedforward networks 
are universal approximators," Neural Networks 2(1989): 359-366. 
Sontag, E.D., "Sigmoids distinguish better than Heavisides," Neural Computation 
1(1989): 470-472. 
Sontag, E.D., "On the recognition capabilities of feedforward nets," Report 
SYCON-90-03, Rutgers Center for Systems and Control, April 1990. 
Sontag, E.D., "Feedback Stabilization Using Two-Hidden-Layer Nets," Report 
SYCON-90-11, Rutgers Center for Systems and Control, October 1990. 
Sontag, E.D., "Capabilities and training of feedforward nets," in Theory and Ap- 
plications of Neural Networks (R. Mammone and J. Zeevi, eds.), Academic Press, 
NY, 1991, to appear. 
Sontag, E.D., and H.J. Sussmann, "Backpropagation can give rise to spurious local 
minima even for networks without hidden layers," Comple Systems 3(1989): 91- 
106. 
Sontag, E.D., and H.J. Sussmann, "Backpropagation separates where percepttons 
do," Neural Networks(1991), to appear. 
