The Power of Approximating: a 
Comparison of Activation Functions 
Bhaskar DasGupta 
Department of Computer Science 
University of Minnesota 
Minneapolis, MN 55455-0159 
email: dasguptacs. uam. edu 
Georg Schnitger 
Department of Computer Science 
The Pennsylvania State University 
University Park, PA 16809 
mail: gorgs. psu. du 
Abstract 
We compare activation functions in terms of the approximation 
power of their feedforward nets. We consider the case of analog as 
well as boolean input. 
I Introduction 
We consider efficient approximations of a given multivariate function f: [- 1, 1] m - 
T by feedforward neural networks. We first introduce the notion of a feedforward 
net. 
Let F be a class of real-valued functions, where each function is defined on some 
subset of 7Z. A F-net C is an unbounded fan-in circuit whose edges and vertices 
are labeled by real numbers. The real number assigned to an edge (resp. vertex) 
is called its weight (resp. its threshold). Moreover, to each vertex v an activation 
function % 6 F is assigned. Finally, we assume that C has a single sink w. 
The net C computes a function fc : [-1, 1] " - 7Z as follows. The components 
of the input vector x = (xl,...,Xm) 6 [--1, 1] m are assigned to the sources of C. 
Let vl,..., Vn be the immediate predecessors of a vertex v. The input for v is then 
sv(x) = Y'.=i wiyi-iv, where wi is the weight of the edge (vi, v), tv is the threshold 
of v and yl is the value assigned to vi. If v is not the sink, then we assign the value 
7(s(x)) to v. Otherwise we assign s(x) to v. 
Then f = sw is the function computed by C where w is the unique sink of C. 
615 
616 DasGupta and Schnitger 
A great deal of work has been done showing that nets of two layers can approximate 
(in various norms) large function classes (including continuous functions) arbitrar- 
ily well (Arai, 1989; Carrol and Dickinson, 1989; Cybenko, 1989; Funahashi, 1989; 
Gallant and White, 1988; Hornik et al. 1989; Irie and Miyake,1988; Lapades and 
Farber, 1987; Nielson, 1989; Poggio and Girosi, 1989; Wet et al., 1991). Various 
activation functions have been used, among others, the cosine squasher, the stan- 
dard sigmoid, radial basis functions, generalized radial basis functions, polynomials, 
trigonometric polynomials and binary thresholds. Still, as we will see, these func- 
tions differ greatly in terms of their approximation power when we only consider 
efficient nets; i.e. nets with few layers and few vertices. 
Our goal is to compare activation functions in terms of efficiency and quality of 
approximation. We measure efficiency by the size of the net (i.e. the number of 
vertices, not counting input units) and by its number of layers. Another resource 
of interest is the Lipschitz-bound of the net, which is a measure of the numerical 
stability of the net. We say that net C has Lipschitz-bound L if all weights and 
thresholds of C are bounded in absolute value by L and for each vertex v of C and 
for all inputs x, y G [-1, 1] ", 
[%(s.(x)) - %(s.(y)) I _< L. Is.(x) - s.(y)l. 
(Thus we do not demand that activation function % has Lipschitz-bound L, but 
only that % has Lipschitz-bound L for the inputs it receives.) We measure the 
quality of an approximation of function f by function fc by the Chebychev norm; 
i.e. by the maximum distance between f and fc over the input domain [-1, 1] m. 
Let r be a class of activation functions. We are particularly interested in the 
following two questions. 
 Given a function f: [-1, 1] " - 7, how well can we approximate f by a r-net 
with d layers, size s, and Lipschitz-bound L? Thus, we are particularly interested 
in the behavior of the approximation error e(s, d) as a function of size and number 
of layers. This set-up allows us to investigate how much the approximation error 
decreases with increased size and/or number of layers. 
 Given two classes of activation functions P and r2, when do r-nets and 
nets have essentially the same "approximation power" with respect to some error 
function e(s, d)? 
We first formalize the notion of "essentially the same approximation power". 
Definition 1.1 Let e  Af 2 -- 7+ be a function. F and F2 are classes of activation 
functions. 
(a). We say that F simulates P with respect to e if and only if there is a constant 
k such that for all functions f  [-1, 1] " - 7 with Lipschitz-bound 1/e(s,d), 
if f can be approximated by a F-net with d layers, size s, Lipschitz- 
bound 2  and approximation error e(s, d), then f can also be ap- 
proximated with error e(s,d) by a Fl-net with k(d+ 1) layers, size 
(s + 1) k and Lipschitz-bound 2 sk. 
(b). We say that Pl and r2 are equivalent with respect to e if and only if P 
simulates F with respect to e and F simulates F with respect to e. 
The Power of Approximating: a Comparison of Activation Functions 617 
In other words, when comparing the approximation power of activation functions, 
we allow size to increase polynomially and the number of layers to increase by a 
constant factor, but we insist on at least the same approximation error. Observe 
that we have linked the approximation error e(s, d) and the Lipschitz-bound of the 
function to be approximated. The reason is that approximations of functions with 
high Lipschitz-bound "tend" to have an inversely proportional approximation error. 
Moreover observe that the Lipschitz-bounds of the involved nets are allowed to be 
exponential in the size of the net. We will see in section 3, that for some activation 
functions far smaller Lipschitz-bounds suffice. 
Below we discuss our results. In section 2 we consider the case of tight approx- 
imations, i.e. e(s, d) = 2 -s. Then in section 3 the more relaxed error model 
e(s, d) = s -a is discussed. In section 4 we consider the computation of boolean 
functions and show that sigmoidal nets can be far more efficient than threshold- 
nets. 
2 Equivalence of Activation Functions for Error e(s, d) = 2 -s 
We obtain the following result. 
Theorem 2.1 The following activation functions are equivalent with respect to er- 
ror e(s, d) -- 2 - . 
 the standard sigmoid a(x) =  
1 +exp(--x) ' 
 any rational function which is not a polynomial, 
 any root x , provided o is not a natural number, 
 the logarithm (for any base b > 1), 
 the gaussian e -2, 
 the radial basis functions (1 q- x") , a ( 1, a  0 
Notable exceptions from the list of functions equivalent to the standard sigmoid are 
polynomials, trigonometric polynomials and splines. We do obtain an equivalence 
to the standard sigmoid by allowing splines of degree s as activation functions for 
nets of size s. (We will always assume that splines are continuous with a single knot 
only.) 
Theorem 2.2 Assume that e(s,d)= 2 -s. Then splines (of degree s for nets of size 
s) and the standard sigmoid are equivalent with respect to e(s, d). 
Remark 2.1 
(a) Of course, the equivalence of spline-nets and {a}-nets also holds for binary 
input. Since threshold-nets can add and multiply m m-bit numbers with constantly 
many layers and size polynomial in m (Reif, 1987), threshold-nets can efficiently 
approzimate polynomials and splines. 
618 DasGupta and Schnitger 
Thus, we obtain that {r}-nets with d layers, size s and ipschitz-bound  can 
be simulated by nets of binary thresholds. The number of layers of the simulat- 
ing threshold-net will increase by a constant factor and its size will increase by a 
polynomial in (s -{- n)log(), where n is the number of input bits. (The inclusion 
of n accounts for the additional increase in size when approximately computing a 
weighted sum by a threshold-net.) 
(b) If we allow size to increase by a polynomial in s -t- n, then threshold-nets and 
(r)-nets are actually equivalent with respect to error bound 2 -. This follows, since 
a threshold function can easily be implemented by a sigmoidal gate (Maass et al., 
1991). 
Thus, if we allow size to increase polynomially (in s + n) and the number of layers 
to increase by a constant factor, then (r)-nets with weights that are at most expo- 
nential (in s + n) can be simulated by (r)-nets with wei9hts of size polynomial in 
S. 
{r}-nets and threshold-nets (respectively nets of linear thresholds) are not equiva- 
lent for analog input. The same applies to polynomials, even if we allow polynomials 
of degree s as activation function for nets of size s: 
Theorem 2.3 
(a) Let sq(x) = x 2. If a net of linear splines (with d layers and size s) approximates 
sq(x) over the interval [-1, 1], then its approximation error will be at least s -(a). 
(b) Let abs(x) =1 x 1. If a polynomial net with d layers and size s approximates 
abs(x) over the interval [-1, 1], then the approximation error will be at least s -(a). 
We will see in Theorem 2.5 that the standard sigmoid (and hence any activation 
function listed in Theorem 2.1) is capable of approximating sq(x) and abs(x) with 
error at most 2 -s by constant-layer nets of size polynomial in s. Hence the standard 
sigmoid is properly stronger than linear splines and polynomials. Finally, we show 
that sine and the standard sigmoid are inequivalent with respect to error 2 -s. 
Theorem 2.4 The function sine(Ax) can be approximated by a (r)-net C.4 with d 
layers, size s = A O/a) and error at most s (-a) On the other hand, every 
with d layers which approximates sine(Ax) with error at most , has to have size 
at least An(/a). 
Below we sketch the proof of Theorem 2.1. The proof itself will actually be more 
instructive than the statement of Theorem 2.1. In particular, we will obtain a 
general criterion that allows us to decide whether a given activation function (or 
class of activation functions) has at least the approximation power of splines. 
2.1 Activation Punctions with the Approximation Power of Splines 
Obviously, any activation function which can efficiently approximate polynomials 
and the binary threshold will be able to efficiently approximate splines. This follows 
since a spline can be approximated by the sum p -{- t  q with polynomials p and q 
The Power of Approximating: a Comparison of Activation Functions 619 
and a binary threshold t. (Observe that we can approximate a product once we can 
approximately square: (x + y)2/2- x2/2 - y2/2 = x . y.) 
Firstly, we will see that any sufficiently smooth activation function is capable of 
approximating polynomials. 
Definition 2.1 Let 7 ' 7 -- 7 be a function. We call 7 suitable if and only if 
there exists real numbers a,/3 (a > O) and an integer k such that 
(a) 7 can be represented by the power series i__oai(x -/3)i for all x e [-c,ct]. 
The coefficients are rationals of the form ai = ,, with I?il, IQI (for i > 1). 
(b) For each i > 2 there exists j with i _ j _ i k and aj  O. 
Proposition 2.1 Assume that 7 is suitable with parameter k. 
Then, over the domain [-D, D], any degree n polynomial p can be approximated 
with error  by a (7)-net Cp. Cp has 2 layers and size O(nk); its weights are 
rational numbers whose numerator and denominator are bounded in absolute value 
by 
1 
Here we have assumed that the coefficients of p are rational numbers with numerator 
and denominator bounded in absolute value by p,,. 
Thus, in order to have at least the approximation power of splines, a suitable activa- 
tion function has to be able to approximate the binary threshold. This is achieved 
by the following function class, 
Definition 2.2 
function. 
(a). 
Let F be a class of activation functions and let g  [1, c] - 7 be a 
We say that g is fast converging if and only if 
I g(x) - g(x + e) l= O(elx for x > 1, e _ O, 
0 < g(uO)du < o and I g(u)du l= O(1/N) for all N >_ 1. 
N 
(b). We say that F is powerful if and only if at least one function in F is suitable 
and there is a fast converging function g which can be approximated for all s  1 
(over the domain [-2 , 2s]) with error 2 - by a (F)-net with a constant number of 
layers, size polynomial in s and Lipschitz-bound 2 s. 
Fast convergence can be checked easily for differenttable functions by applying the 
mean value theorem. Examples are x - for a _> 1, exp(-x) and er(-x). Moreover, 
it is not difficult to show that each function mentioned in Theorem 2.1 is powerful. 
Hence Theorem 2.1 is a corollary of 
Theorem 2.5 Assume that F is powerful. 
(a) F simulates splines with respect to error e(s,d) = 2 - 
620 DasGupta and Schnitger 
(b) Assume that each activation function in F can be approximated (over the 
domain [-2 , 2]) with error 2 - by a spline-net N of size s and with constantly 
many layers. Then F is equivalent to splines. 
Remark 2.2 Obviously, 1/x is powerful. Therefore Theorem 2.5 implies that 
constant-layer (1/x)-nets of size s approximate abs(x) = Ixl with error 2 -. The 
degree of the resulting rational function will be polynomial in s. Thus Theorem 
.5 generalizes .Newman's approximation of the absolute value by rational functions. 
(Newman, 1964) 
3 Equivalence of Activation Functions for Error 
The lower bounds in the previous section suggest that the relaxed error bound 
e(s, d) = s -a is of importance. Indeed, it will turn out that many non-trivial smooth 
activation functions lead to nets that simulate (r)-nets, provided the number of 
input units is counted when determining the size of the net. (We will see in section 
4, that linear splines and the standard sigmoid are not equivalent if the number of 
inputs is not counted). The concept of threshold-property will be crucial for us. 
Definition 3.1 Let F be a collection of activation functions. We say that F has 
the threshold-property if there is a constant c such that the following two properties 
are satisfied for all m  1. 
(a) For each 7  F there is a threshold-net T.,, with c layers and size (s -t- rn) e 
which computes the binary representation of ?'(x) where ]?(x)- ?'(x)] _ 2 -". 
The input z of T.,, is given in binary and consists of 2m -t- 1 bits; m bits describe 
the integral part of x, m bits describe its fractional part and one bit indicates the 
sign. s -{- m specifies the required number of output bits, i.e. s = [1og2(sup(?(x ): 
-2 x 
(b) There is a F-net with c layers, size m e and Lipschitz bound 2 mc which approx- 
imates the binary threshold over D -- [-1, 1]- I-i/m, i/m] with error l/re. 
We can now state the main result of this section. 
Theorem 3.1 Assume that e(s, d) = s -a 
(a) Let F be a class of activation functions and assume that F has the threshold 
property. Then, (7 and F are equivalent with respect to e. Moreover, (er)-nets only 
require weights and thresholds of absolute value at most s. (Observe that F-nets are 
allowed to have weights as large as 2!) 
(b) If F and (7 are equivalent with respect to error 2 -s, then F and (7 are equivalent 
with respect to error s -a. 
(c) Additionally, the following classes are equivalent to (er)-nets with respect to e. 
(We assume throughout that all coefficients, weights and thresholds are bounded by 
2  for nets of size s). 
 polynomial nets (i.e. polynomials of degree s appear as activation function for 
nets of size s), 
The Power of Approximating: a Comparison of Activation Functions 621 
 (7)-nets, where ? is a suitable function and ? satisfies part (a) of Definition 3.1. 
(This includes the sine-function.) 
nets of linear splines 
The equivalence proof involves a first phase of extracting O(dlogs) bits from the 
analog input. In a second phase, a binary computation is mimicked. The extraction 
process can be carried out with error s - (over the domain [-1, 1]- I-i/s, i/s]) 
once the binary threshold is approximated. 
4 Computing boolean functions 
As we have seen in Remark 2.1, the binary threshold (respectively linear splines) 
gains considerable power when computing boolean functions as compared to approx- 
imating analog functions. But sigmoidal nets will be far more powerful when only 
the number of neurons is counted and the number of input units is disregarded. For 
instance, sigmoidal nets are far more efficient for "squaring", i.e when computing: 
Mn ={(x,y)' x {0,1},y{0,1} n and [x]2_>[y]} (where [z]=Zzi). 
i 
Theorem 4.1 A threshold-net computing Mn must have size at least Q(log n). But 
Mn can be computed by a a-net with constantly many gates. 
The previously best known separation of threshold-nets and sigmoidal-nets is due 
to Maass, Schnitger and Sontag (Maass et al., 1991). But their result only applies 
to threshold-nets with at most two layers; our result holds without any restriction 
on the number of layers. Theorem 4.1 can be generalized to separate threshold-nets 
and 3-times differenttable activation functions, but this smoothness requirement is 
more severe than the one assumed in (Maass et al., 1991). 
5 Conclusions 
Our results show that good approximation performance (for error 2 -s) hinges on 
two properties, namely efficient approximation of polynomials and efficient approx- 
imation of the binary threshold. These two properties are shared by a quite large 
class of activation functions; i.e. powerful functions. Since (non-polynomial) ratio- 
nal functions are powerful, we were able to generalize Newman's approximation of 
I x I by rational functions. 
On the other hand, for a good approximation performance relative to the relaxed 
error bound s -d it is already sufficient to efficiently approximate the binary thresh- 
old. Consequently, the class of equivalent activation functions grows considerably 
(but only if the number of input units is counted). The standard sigmoid is dis- 
tinguished in that its approximation performance scales with the error bound: if 
larger error is allowed, then smaller weights suffice. 
Moreover, the standard sigmoid is actually more powerful than the binary threshold 
even when computing boolean functions. In particular, the standard sigmoid is able 
to take advantage of its (non-trivial) smoothness to allow for more efficient nets. 
622 DasGupta and Schnitger 
Acknowledgements. We wish to thank R. Paturi, K. Y. Siu and V. P. Roy- 
chowdhury for helpful discussions. Special thanks go to W. Maass for suggesting 
this research, to E. Sontag for continued encouragement and very valuable advice 
and to J. Lambert for his never-ending patience 
The second author gratefully acknowledges partial support by NSF-CCR-9114545. 
References 
Arai, W. (1989), Mapping abilities of three-layer networks, in "Proc. of the Inter- 
national Joint Conference on Neural Networks", pp. 419-423. 
Carrol, S. M., and Dickinson, B. W. (1989), Construction of neural nets using 
the Radon Transform,in "Proc. of the International Joint Conference on Neural 
Networks", pp. 607-611. 
Cybenko, G. (1989), Approximation by superposition of a sigmoidal function, Math- 
ematics of Control, Signals, and System, 2, pp. 303-314. 
Funahashi, K. (1989), On the approximate realization of continuous mappings by 
neural networks, Neural Networks, 2, pp. 183-192. 
Gallant, A. R., and White, H. (1988), There exists a neural network that does not 
make avoidable mistakes, in "Proc. of the International Joint Conference on Neural 
Networks", pp. 657-664. 
Hornik, K., Stinchcombe, M., and White, H. (1989), Multilayer Feedforward Net- 
works are Universal Approximators, Neural Networks, 2, pp. 359-366. 
Irie, B., and Miyake, S. (1988), Capabilities of the three-layered percepttons, in 
"Proc. of the International Joint Conference on Neural Networks", pp. 641-648. 
Lapades, A., and Farbar, R. (1987), How neural nets work, in "Advances in Neural 
Information Processing Systems", pp. 442-456. 
Maass, W., Schnitger, G., and Sontag, E. (1991), On the computational power of 
sigmoid versus boolean threshold circuits, in "Proc. of the 32nd Annual Symp. on 
Foundations of Computer Science", pp. 767-776. 
Newman, D. J. (1964), Rational approximation to I xl, Michigan Math. Journal, 
11, pp. 11-14. 
Hecht-Nielson, R. (1989), Theory of backpropagation neural networks, in "Proc. of 
the International Joint Conference on Neural Networks", pp. 593-611. 
Poggio, T., and Girosi, F. (1989), A theory of networks for Approximation and 
learning, Artificial Intelligence Memorandum, no 1140. 
Reif, J. H. (1987), On threshold circuits and polynomial computation, in "Proceed- 
ings of the 2nd Annual Structure in Complexity theory", pp. 118-123. 
Wet, Z., Yinglin, Y., and Qing, J. (1991), Approximation property of multi-layer 
neural networks ( MLNN ) and its application in nonlinear simulation, in "Proc. of 
the International Joint Conference on Neural Networks", pp. 171-176. 
