Radial Basis Function Networks and Complexity 
Regularization in Function Learning 
Adam Krzyak 
Department of Computer Science 
Concordia University 
Montreal, Canada 
krzy zak @ cs.concordia.ca 
Tamils Linder 
Dept. of Math. & Comp. Sci. 
Technical University of Budapest 
Budapest, Hungary 
linder@inf.bme.hu 
Abstract 
In this paper we apply the method of complexity regularization to de- 
rive estimation bounds for nonlinear function estimation using a single 
hidden layer radial basis function network. Our approach differs from 
the previous complexity regularization neural network function learning 
schemes in that we operate with random covering numbers and l metric 
entropy, making it possible to consider much broader families of activa- 
tion functions, namely functions of bounded variation. Some constraints 
previously imposed on the network parameters are also eliminated this 
way. The network is trained by means of complexity regularization in- 
volving empirical risk minimization. Bounds on the expected risk in 
terms of the sample size are obtained for a large class of loss functions. 
Rates of convergence to the optimal loss are also derived. 
1 INTRODUCTION 
Artificial neural networks have been found effective in learning input-output mappings from 
noisy examples. In this learning problem an unknown target function is to be inferred from a 
set of independent observations drawn according to some unknown probability distribution 
from the input-output space IR d x IR. Using this data set the learner tries to determine a 
function which fits the data in the sense of minimizing some given empirical loss function. 
The target function may or may not be in the class of functions which are realizable by the 
learner. In the case when the class of realizable functions consists of some class of artificial 
neural networks, the above problem has been extensively studied from different viewpoints. 
In recent years a special class of artificial neural networks, the radial basis function (RBF) 
networks have received considerable attention. RBF networks have been shown to be 
the solution of the regularization problem in function estimation with certain standard 
smoothness functionals used as stabilizers (see [5], and the references therein). Universal 
198 A. Krzyak and T. Linder 
convergence of RBF nets in function estimation and classification has been proven by 
Krzyak et al. [6]. Convergence rates of RBF approximation schemes have been shown to 
be comparable with those for sigmoidal nets by Girosi and Anzellotti [4]. In a recent paper 
Niyogi and Girosi [9] studied the tradeoff between approximation and estimation errors and 
provided an extensive review of the problem. 
In this paper we consider one hidden layer RBF networks. We look at the problem of 
choosing the size of the hidden layer as a function of the available training data by means 
of complexity regularization. Complexity regularization approach has been applied to 
model selection by Barron [ 1 ], [2] resulting in near optimal choice of sigmoidal network 
parameters. Our approach here differs from Barron's in that we are using l metric en- 
tropy instead of the supremum norm. This allows us to consider a more general class 
of activation function, namely the functions of bounded variation, rather than a restricted 
class of activation functions satisfying a Lipschitz condition. For example, activations with 
jump discontinuities are allowed. In our complexity regularization approach we are able 
to choose the network parameters more freely, and no discretization of these parameters is 
required. For RBF regression estimation with squared error loss, we considerably improve 
the convergence rate result obtained by Niyogi and Girosi [9]. 
In Section 2 the problem is formulated and two results on the estimation error of complexity 
regularized RBF nets are presented: one for general loss functions (Theorem 1) and a 
sharpened version of the first one for the squared loss (Theorem 2). Approximation bounds 
are combined with the obtained estimation results in Section 3 yielding convergence rates 
for function learning with RBF nets. 
2 PROBLEM FORMULATION 
The task is to predict the value of a real random variable Y upon the observation of an IR d 
valued random vector X. The accuracy of the predictor f  IR d  R is measured by the 
expected risk 
J(f) -- EL(f(X), Y), 
where L  IR x IR -- IR + is a nonnegative loss function. It will be assumed that there exists 
a minimizing predictor f* such that 
J(f*) = infJ(f). 
A good predictor f is to be determined based on the data (X, Yl),  ., (X, Y) which 
are i.i.d. copies of (X, Y). The goal is to make the expected risk EJ(f) as small as 
possible, while fr is chosen from among a given class Y of candidate functions. 
In this paper the set of candidate functions Y will be the set of single-layer feedforward 
neural networks with radial basis function activation units and we let  = tJk=k, where 
k is the family of networks with k hidden nodes whose weight parameters satisfy certain 
constraints. In particular, for radial basis functions characterized by a kernel K: IR +  IR, 
 is the family of networks 
k 
f(x): (Ix - - cd) + wo, 
i----I 
where w0, w..., w are real numbers called weights, c,..., c C IR , Ai are nonnegative 
definite d x d matrices, and x t denotes the transpose of the column vector x. 
The complexity regularization principle for the learning problem was introduced by Vapnik 
[10] and fully developed by Barron [1], [2] (see also Lugosi and Zeger [8]). It enables 
the learning algorithm to choose the candidate class Y automatically, from which is picks 
Radial Basis Function Networks and Complexity Regularization 199 
the estimate function by minimizing the empirical error over the training data. Complexity 
regularization penalizes the large candidate classes, which are bound to have small approx- 
imation error, in favor of the smaller ones, thus balancing the estimation and approximation 
errors. 
Let .T be a subset of a space A' of real functions over some set, and let p be a pseudometric 
on A'. For e > 0 the covering number N(e, Y, p) is defined to be the minimal number of 
closed e balls whose union cover .T. In other words, N(e, .T, p) is the least integer such 
that there exist f,..., fN with N = N(e, .T, p) satisfying 
sup min p( f , f i ) _ e. 
fEr l<i<N 
In our case,  is a family of real functions on IR ', and for any two functions f and g, p is 
given by 
1 lf(zi) - g(zi)l, 
p(f, g) = 
i----1 
where z,..., z are n given points in IR ' . In this case we will use the notation N (e, .T, p) - 
N(e, .T, z), emphasizing the dependence of the metric p on z = (Zl,..., z). Let us 
define the families of functions 7-/k, k = 1,2,... by 
7-gk = {L(f(-), .)-f G .T'). 
Thus each member of 7-/ maps IR d+ 1 into R. It will be assumed that for each k we are given 
a finite, almost sure uniform upper bound on the random covering numbers N(e, 7-/, Z), 
where Z = ((X, Y),..., (X,, Y,)). We may assume without loss of generality that 
N(e, 7-/) is monotone decreasing in . Finally, assume that L(.f(X), Y) is uniformly 
almost surely bounded by a constant B, i.e., 
P{L(f(X),Y)_<B)= 1, fG.Tk, k= 1,2,... 
The complexity penalty of the kth class for n training samples is a nonnegative number A  
satisfying 
A >_ /128B 2 log N(A/8, 7-/) + ck (2) 
where the nonnegative constants c satisfy -= e -ok _< 1. Note that since N(e, 7-[) is 
nonincreasing in e, it is possible to choose such A  for all k and n. The resulting complexity 
penalty optimizes the upper bound on the estimation error in the proof of Theorem 1 below. 
We can now define our estimate. Let 
n 
fen = argmin J,(f) = argmin 1  L(f(Xi),Yi), 
that is, fk, minimizes over .T the empirical risk for n training samples. The penalized 
empirical risk is defined for each f G .T as 
',(f) = Jr(f) + 
The estimate f is then defined as the f, minimizing the penalized empirical risk over all 
classes: 
f = argmin &(fk). (3) 
fn:k_> l 
We have the following theorem for the expected estimation error of the above complexity 
regularization scheme. 
200 A. Krzyak and T. Linder 
Theorem 1 For any n and k the complexity regularization estimate (3) satisfies 
EJ(f,)- J(f*) < rain (/k, + inf J(f)- J(f*)) 
-- k>_l fE.T  
where 
tgkn = rain (u + 9Be-nu2/(S12B2)) . 
u _ 4A: ,, 
Assuming without loss of generality that log N(e, 7-/) _> 1, it is easy to see that the choice 
A, = 128B 21gN(B/x/'7t) + c (4) 
n 
satisfies (2). 
2.1 SQUARED ERROR LOSS 
For the special case when 
L(x, y) = (x - y)2 
we can obtain a better upper bound. The estimate will be the same as before, but instead of 
(2), the complexity penalty Ak, now has to satisfy 
A, _> C11gN(A'/C2'y) + c (5) 
whereCt = 3499C 4, C2 = 256C 3, andC = max{B, 1}. Here N(e, .Tk)is a uniform upper 
bound on the random ll covering numbers N(e, .T, X). Assume that the class .T - 
is convex, and let  be the closure of .T in L2(/.t), where tt denotes the distribution of X. 
Then there is a unique f e  whose squared loss J(f) achieves inffe J(f). We have the 
following bound on the difference EJ (f,) - J(f). 
Theorem 2 Assume that Y = UYk is a convex set of functions, and consider the squared 
error loss. Suppose that If(x)l <_ B for all z c IR d and f  , and P(IYI > B) - o. 
Then complexity regularization estimate with complexity penalty satisfying (5) gives 
EJ(f)-J(f)<2min A+ inf J(f)-J(f) +2n. 
The proof of this result uses an idea of Barron [1 ] and a Bernstein-type uniform probability 
inequality recently obtained by Lee et al. [7]. 
3 RBF NETWORKS 
We will consider radial basis function (RBF) networks with one hidden layer. Such a 
network is characterized by a kernel K  IR +  IR. An RBF net of k nodes is of the form 
k 
f(go) -- E WiI ([2 -- iltAi[go - i1) - wo, (6) 
i=1 
where w0, wl,..., w are real numbers called weights, el,..., c  IR a, and the A are 
nonnegative definite d x d matrices. The kth candidate class rk for the function estimation 
task is defined as the class of networks with k nodes which satisfy the weight condition 
Y,Lo lwi[ -< b for a fixed b > 0: 
- - - ca]) + w0. Iwal < b . 
i=l i=0 
Radial Basis Function Networks and Complexity Regularization 201 
Let L(x, y) = I x - yl p, and 
J(f) = Elf(X) - YI p, (s) 
where 1 <_ p < oo. Let/ denote the probability measure induced by X. Define Y to be the 
closure in L p (/) of the convex hull of the functions bK([x - c]tA[x - c]) and the constant 
function h(x) = 1, x C IR d, where _< b, c c and A varies over all nonnegative d x d 
matrices. That is, Y is the closure of  = t-Jkk, where k is given in (7). Let g G  be 
arbitrary. If we assume that I K I is uniformly bounded, then by Corollary 1 of Darken et al. 
[3], we have for 1 _< p <_ 2 that 
inf Ill - gllLp(): O(1/x/), (9) 
f 
where Ill - gllLpc) denotes the LP(i) norm (f If - f*lPd)x/p, and Y is given in (7). 
The approximation error inffE:  J(f) - J(f*) can be dealt with using this result if the 
optimal f* happens to be in W. In this case, we obtain 
inf J(f)- J(f*) - O(1/x/) 
for all 1 < p _< 2. Values of p close to 1 are of great importance for robust neural network 
regression. 
When the kernel K has a bounded total variation, it  be shown that N(c, 7-/k) _< 
(A/c)A2, where the constants A, A2 depend on sups IK (z)l, the total variation V of K, 
the dimension d, and on the the constant b in the definition (7) of fk. Then, if 1 <_ p <_ 2, 
the following consequence of Theorem 1 can be proved for L p regression estimation. 
Theorem 3 Let the kernel K be of bounded variation and assume that IYI is bounded. 
Then for 1 _< p _< 2 the error (8) of the complexity regularized estimate satisfies 
EJ(f)- J(f*) 
_ ). 
For p - 1, i.e., for L  regression estimation, this rate is known to be optimal within the 
logarithmic factor. 
For squared error loss J(f)-- E(f(X)- y)2 we have f* (z)= E(YIX = z). If f* E 3, 
then by (9) we obtain 
inf J(f) - J(f*) = O(1/k). (10) 
f 
It is easy to check that the class Ok is convex if the  are the collections of RBF nets 
defined in (7). The next result shows that we can get rid of the square root in Theorem 3. 
Theorem 4 Assume that K is of bounded variation. 
bounded random variable, and let L( z, y) = ( z - y)2. 
RBF squared regression estimate satisfies 
Suppose furthermore that IYl is a 
Then the complexity regularization 
Ed(f) infJ(f)<2min(inf J(f)infJ(f)+o(klgn)) (n 1--) 
- - +0 . 
202 A. Krzyak and T. Linder 
If f*  .T, this result and (10) give 
EJ(f)- J(f*) 
< min[O(klgn) (-)] 
+o 
- > n 
: o 
(11) 
This result sharpens and extends Theorem 3.1 of Niyogi and Girosi [9] where the weaker 
O () + O () convergence rate was obtained (in a PAC-like formulation) for the 
squared loss of Gaussian RBF network regression estimation. The rate in (11) varies linearly 
with dimension. Our result is valid for a very large class of RBF schemes, including the 
Gaussian RBF networks considered in [9]. Besides having improved on the convergence 
rate, our result has the advantage of allowing kernels which are not continuous, such as the 
window kernel. 
The above convergence rate results hold in the case when there exists an f* minimizing the 
risk which is a member of the L p () closure of .T -- U.Tk, where .Tk is given in (7). In 
other words, f* should be such that for all e > 0 there exists a k and a member f of 
with Ill - f* IILp() < e. The precise characterization of.T seems to be difficult. However, 
based on the work of Girosi and Anzellotti [4] we can describe a large class of functions 
that is contained in 3. 
Let H(z, t) be a real and bounded function of two variables z  IRa and t  IR ' . Suppose 
that , is a signed measure on IR ' with finite total variation I[ll. If g(z) is defined as 
g(x) =/R H(x,t)(dt), 
then g  LP(l_t) for any probability measure/.t on IR a. One can reasonably expect that g 
can be approximated well by functions f(x) of the form 
k 
f(x) : t,), 
i----1 
where t,,...,t e IR and =x Iwal _( II;ll. The case m ---- dand H(x,t)-- G(x- t)is 
investigated in [4], where a detailed description of function spaces arising from the different 
choices of the basis function G is given. Niyogi and Girosi [9] extends this approach to 
approximation by convex combinations of translates and dilates of a Gaussian function. In 
general, we can prove the following 
Lemma 1 Let 
g(x) = fR H(x,t)A(dt), (12) 
where H(x, t) and , are as above. Define for each k >_ 1 the class offunctions 
i=I i'-0 
Then for any probability measure i_t on IRa and for any 1 _< p < c, the function g can be 
approximated in LP(l_t) arbitrarily closely by members of  = U, i.e., 
inf Ill- gllLp()  o as k - oo. 
Radial Basis Function Networks and Complexity Regularization 203 
To prove this lemma one need only slightly adapt the proof of Theorem 8.2 in [4], or in a 
more elementary way following the lines of the probabilistic proof of Theorem 1 of [6]. To 
apply the lemma for RBF networks considered in this paper, let n = d 2 + d, t = (A, c), 
and H(z,t): K ([z - c]tA[z - c]). Then we obtain that  contains all the functions # 
with the integral representation 
- - 
= 
for which IIXll b, where b is the constraint on the weights as in (7). 
Acknowledgements 
This work was supported in part by NSERC grant OGP000270, Canadian National Networks 
of Centers of Excellence grant 293 and OTKA grant F014174. 
References 
[10] 
[ 1 ] A.R. Barron. Complexity regularization with application to artificial neural networks. 
In G. Roussas, editor, Nonparametric Functional Estimation and Related Topics, pages 
561-576. NATO ASI Series, Kluwer Academic Publishers, Dordrecht, 1991. 
[2] A. R. Barron. Approximation and estimation bounds for artificial neural networks. 
Machine Learning, 14:115-133, 1994. 
[3] C. Darken, M. Donahue, L. Gurvits, and E. Sontag. Rate of approximation results 
motivated by robust neural network learning. In Proc. Sixth Annual Workshop on 
Computational Learning Theory, pages 303-309. Morgan Kauffman, 1993. 
[4] F. Girosi and G. Anzellotti. Rates of convergence for radial basis functions and neural 
networks. In R. J. Mammone, editor, Artificial Neural Networks for Speech and Vision, 
pages 97-113. Chapman & Hall, London, 1993. 
[5] F. Girosi, M. Jones, and T. Poggio. Regularization theory and neural network archi- 
tectures. Neural Computation, 7:219-267, 1995. 
[6] A. KrzyZak, T. Linder, and G. Lugosi. Nonparametric estimation and classification 
using radial basis function nets and empirical risk minimization. IEEE Transactions 
on Neural Networks, 7(2):475-487, March 1996. 
[7] W. S. Lee, P. L. Bartlett, and R. C. Williamson. Efficient agnostic learning of neural 
networks with bounded fan-in. to be published in IEEE Transactions on Information 
Theory, 1995. 
[8] G. Lugosi and K. Zeger. Concept learning using complexity regularization. IEEE 
Transactions on Information Theory, 42:48-54, 1996. 
[9] P. Niyogi and F. Girosi. On the relationship between generalization error, hypothesis 
complexity, and sample complexity for radial basis functions. Neural Computation, 
8:819-842, 1996. 
V. N. Vapnik. Estimation of Dependencies Based on Empirical Data. Springer-Verlag, 
New York, 1982. 
