482 Saha and Keeler 
Algorithms for Better Representation and 
Faster Learning in Radial 
Basis Function Networks 
Avijit Saha 1 
James D. Keeler 
Microelectronics and Computer Technology corporation 
3500 West Balcones Center Drive 
Austin, Tx 78759 
ABSTRACT 
In this paper we present upper bounds for the learning rates for 
hybrid models that employ a combination of both self-organized 
and supervised learning, using radial basis functions to build 
receptive field representations in the hidden units. The learning 
performance in such networks with nearest neighbor heuristic can 
be improved upon by multiplying the individual receptive field 
widths by a suitable overlap factor. We present results indicat.ing 
optimal values for such overlap factors. We also present a new 
algorithm for determining receptive field centers. This method 
negotiates more hidden units in the regions of the input space as a 
function of the output and is conducive to better learning when the 
number of patterns (hidden units) is small. 
1 INTRODUCTION 
Functional approximation of experimental data originating from a continuous 
dynamical process is an important problem. Data is usually available in the form of 
a set S consisting of {x,y} pairs, where x is a input vector and y is the corresponding 
output vector. In particular, we consider networks with a single layer of hidden 
units and the jth output unit computes yj = 5'. fa Ra{ :' xa' Ca}' where, yj is the 
1 
University of Texas at Austin, Dept. of ECE, Austin TX 78712 
Algorithms for Better Representation 483 
network output due to input xj, fot is the synaptic weight associated with the otth 
hidden neuron and the jth OUtpUt unit; Rot{ x(tj), xot, c} is the Radial Basis Function 
(RBF) response of the otth hidden neuron. This technique of using a superposition of 
RBF for the purposes of approximation has been considered before by [Medgassy 
'58] and more recently by [Moody '88], [Casdagli '89] and [Poggio '89]. RBF 
networks are particularly attractive since such networks are potentially 1000 times 
faster than the ubiquitous backpropagation network for comparable error rates 
[Moody '88]. 
The essence of the network model we consider is described in [Moody '88]. A typical 
network that implements a receptive field response consists of a layer of linear in- 
put units, a layer of linear output units and an intermediate ( hidden ) layer of non- 
linear response units. Weights are associated with only the links connecting the 
hidden layer to the output layer. For the single output case the real valued function- 
al mapping f' R n -> R is characterized by the following equations: 
O(xi) = E fot Rot (xi) (1) 
O(xi) = E f ot Rot (xi) / E Rot (xi) (2) 
Rot(xi) e ( I xot - x i I / ( ot ) 2 
= - (3) 
where xot is a real valued vector associated with the otth receptive field ( hidden ) 
unit and is of the same dimension as the input. The output can be normalized by the 
sum of the responses of the hidden units due to any input, and the expression for the 
output using normalized response function is presented in Equation 2. The xot val- 
ues the centers of the receptive field units and ( are their widths. Training in 
such networks can be performed in a two stage hybrid combination of independent 
processes. In the first stage, a clustering of the input data is performed. The objec- 
tive of this clustering algorithm is to establish appropriate xot values for each of 
the receptive field units such that the cluster points represent the input distribution 
in the best possible manner. We use competefive learning with the nearest neighbor 
heuristic as our clustering algorithm (Equation 5). The degree or quality of cluster- 
ing achieved is quantified by the sum-square measure in Equation 4, which is the ob- 
jective function we are trying to minimize in the clustering phase. 
TSS- KMEANS = y ( xot_closes t _ xi ) 2 (4) 
xot_closest = xot_closest + ) ( x i _ xot_closes t) (5) 
After suitable cluster points (xotvalues) are determined the next step is to determine 
484 Saha and Keeler 
the cor or widths for each of the receptive fields. Once again we use the nearest 
neighbor heuristic where oor (the width of the orth neuron) is set equal to the eu- 
clidian distance between xor and its nearest neighbor. Once the receptive field centers 
xor and the widths (oor) are found, the receptive field responses can be calculated 
for any input using Equation 3. Finally, the for values or weights on links connect- 
ing the hidden layer units to the output are determined using the well-known gradi- 
ent descent learning rule. Pseudo inverse methods are usually impractical in these 
problems. The rules for the objective function and weight update are given by equa- 
tions 6 and 7. 
E = i (O(xi) - ti )2 (6) 
for = for + '1 {(O(x i) - t i) }R0t(x i) 
(7) 
where, i is the number of input patterns, 
output for the ith pattern. 
x i is the input vector and t i is the target 
2 LEARNING RATES 
In this section we present an adaptive formulation for the network learning rate '1 
(F_xluation 7). Learning rates ('1) in such networks that use gradient descent are 
usually chosen in an adhoc fashion. A conservative value for '1 is usually sufficient. 
However, there are two problems with such an approach. If the learning rate is not 
small enough the TSS (Total Sums of Squares) measure can diverge to high values 
instead of decreasing. A very conservative estimate on the other hand will work 
with almost all sets of data but will unnecessarily slow down the learning process. 
The choice of learning rate is crucial, since for real-time or hardware 
implementations of such systems there is very little scope for interactive 
monitoring. 
This problem is addressed by the Theorem 1. We present the proof for this theorem 
for the special case of a single outpu[ In the gradient descent algorithm, weight 
updates can be performed after each presentation of the entire set of patterns (per 
epoch basis or after each pattern (per pattern basis); both cases are considered. 
Equation p.3 gives the upper bound for '1 when updates are done on a per epoch 
basis. Only positive values of xl should be considered. Equations p.4 and p.5 gives 
the bounds for '1 when updates are done on a per pattern basis without and with 
normalized response function respectively. We present some simulation results for 
the logistic map { x(t+l) = r x(t) [ 1 - x(t)] } data in Figure 1. The plots are 
shown only for the normalized response case, and the learning rate was set to '1 = 
g{ ( Y. Ror) 2 / 5; (Ror)2}. We used a fixed number of 20 hidden units, and r was set 
to 4.0. The network TS S did not diverge until !.t was set arbitrarily close to 1. 
Algorithms for Better Representation 485 
This is shown in Figure 1 which indicates that, with the normalized response 
function, if the sum of squares of the hidden unit responses is nearly equal to 
the square of the sum of the responses, then a high effective leaming rate (.1) can 
be used. 
Theorem 1  The TSS measure of a network will be decreasing in time, provided the 
learning rate .1 does not exceed E i e i Eot EctR / { E i CZot ERcd )2} if the 
network is trained on a per epoch basis, and 1 / E(Ri )2 when updates are done 
on a per pattern basis. With normalized response function, the upper bound for the 
leaming rate is (ER)2 / Eot(R )2. Note similar result of [Widrow 1985]. 
Proof' TSS(t) = E i ( t i - EotfotRai) 2 (p. 1) 
where N is the number of exemplars, and K is the number of receptive fields and t i 
is the i th target output. 
(p.2) 
= 2 .1 Ect where, Ect = E i e i Rot i 
For stability, we impose the condition TSS(t) - TSS (t + 1) > 0. From 
Eqns (p. 1) and (p.2) above and substituting .1 2Ect for Afct, we have' 
TSS(0-TSS(t+I) > Z i ei 2 - Z i(t i ZotfotRai - Z a 2*1EctRcd )2 
Expanding the RHS of the above expression and substituting e i appropriately  
TSS(0 - TSS (t + 1) > - 4 .1 Z i e i EotE a Rct i + 4 ,I 2  (ZaE a Rai )2 
.'. From the above inequality it follows that for stability in per epoch basis 
training, the upper bound for learning rate .1 is given by' 
.1 _< Z i e i ZotEot Rct i / Z i (EaEaRai)2 (p.3) 
If updates are done on a per pattern basis, then N = 1 and we drop the summation 
over N and the index i and we obtain the following bound' 
.1 < 1 / Zot(Rod )2. (p.4) 
With normalized response function the upper bound for the learning rate is  
.1 < (EctRo d )2/Ect(Ro j )2. ((p.5) 
Q.E.D. 
486 Saha and Keeler 
n 
o 
r' 
m 
a 
1 
i 
z 
d 
e 
r 
r 
o 
r 
0.2 
0.1 
0.1 
0.0 
0.0 
-  
o.1 0.2 
I I 
0.4 0.6 2.0 
0.8 1.0 
Figure 1: Normalized error vs. fraction  ) 
of maximum allowable learning rate 
3 EFFECT OF WIDTH (t) ON APPROXIMATION ERROR 
In the nearest-neighbor heuristic  values of the hidden units are set equal to the 
euclidian distance between its center and the center of its nearest neighbor. This 
method is preferred mainly because it is computationally inexpensive. However, the 
performance can be improved by increasing the overlap between nearby hidden unit 
responses. This is done by multiplying the widths obtained with the nearest 
neighbor heuristic by an overlap factor m as shown in Equation 3.1. 
n 
o 
r 
m 
a 
1 
i 
0.14 
0.12 
0.10 
0.0 
0.0 
0.0 
0.0: 
0.0 
0.0 
e 
r 
r 
o 
r 
/e,  10 hiditi7ia p data 
 (r = 4.0 ) 
20 hid 
I t ! I 
1.0 2.0 3.0 4.0 
overlap factor (m) : 
Figure 2: 
5.0 
Normalized errors vs. overlap factor for the 
logistic map. 
Algorithms for Better Representation 487 
cot = m. II xo- xct.nearestll 
(3.1) 
and II. II is the euclidJan distance norm. 
In Figures 2 and 3 we show the network performance ( normalized error ) as a 
function of m. In the logistic map case a value of r = 4.0 was used, predicting 1 
timestep into the future; training set size was 10 times the number of hidden units 
and test set size was 70 patterns. The results for the Mackey-Glass data are with 
parameter values a = 0.1, b - 0.2, A = 6, D = 4. The number of training patterns was 
10 times the number of hidden units and the normalized error was evaluated based 
on the presentation of 900 unseen patterns. For the Mackey-Glass data the optimal 
values were rather well-defined; whereas for the logistic map case we found that the 
optimal values were spread out over a range. 
n 
o 
r 
m 
a 
1 
i 
z 
e 
d 
e 
r 
r 
o 
r 
0.6 
0.5 
0.4 
0.3 
0.2 
0.1 
0.0 
50 units 
100 units 
250 units 
500 units 
900 units 
0.0 0.5 
Figure 2: 
I  
 
I I I 
1.0 1.5 2.o 
overlap factor (m) 
Normalized errors vs. overlap factor for varying 
number of hidden units, Mackey-Glass data. 
I 
4 EXTENDED METRIC CLUSTERING 
In this method clustering is done in higher dimensions. In our experiments we set 
the initial K hidden unit center values based on the first K exemplars. The receptive 
fields are assigned vector values of dimensions determined by the size of the 
input and the output vectors. Each center value was set equal to the vector obtained 
by concatenating the input and the corresponding output. During the clustering phase 
the output Yi is concatenated with the input x i and presented to the hidden layer. 
This method finds cluster points in the (I+O)-dimensional space of the input-output 
map as defined by Equations 4.1, 4.2 and 4.3. 
488 Saha and Keeler 
Xa = < x w ya> 
(4.1) 
X i = < x i, yi > (4.2) 
Xc.new = Xot_old + 3, ( Xc - X i ) 
(4.3) 
Once the cluster points or the centers are determined we disable the output field, 
and only the input field is used for computing the widths and receptive field 
responses. In Figure 3 we present a comparison of the performances of such a 
network with and without the enhanced metric clustering. Variable size networks of 
only Gaussian RBF units were used. The plots presented are for the Mackey-Glass 
data with the same parameter values used in [Farmer 88]. This method works 
significantly better when the number of hidden units is low. 
n 
o 
r 
m 
a 
1 
i 
z 
d 
e 
r 
r 
o 
r 
0.6 
0.5 
0.4 
0.3 
0.2 
0.1 
0.0 
 Nearest neighbor 
Enhanced metric clustering 
MACKEY GLASS DATA 
-o7,, 
I I I I 
0 200 400 600 800 
Number of units > 
Figure 3: Performance of enhanced metric clustering 
algorithmm. 
5 CONCLUSIONS 
One of the emerging application areas for neural network models is real time signal 
processing. For such applications and hardware implementations, adaptive methods 
for determining network parameters are essential. Our derivations for learning rates 
are important in such situations. We have presented restfits indicating that in RBF 
networks, performance can be improved by tuning the receptive field widths by some 
suitable overlap factor. We have presented an extended metric algorithm that 
negotiates hidden units based on added output information. We have observed more 
than 20% improvement in the normalized error measure when the number of training 
Algorithms for Better Representation 489 
patterns, and therefore the number of hidden units, used is reasonably small. 
References 
M. Casdagli. (1989) "Nonlinear Prediction of Chaotic Time Series" Phsica 35D, 
335 -356. 
D. J. Farmer and J.J. Sidorowich. (1988). "Exploiting Chaos to Predict the Future 
and Reduce Noise". Tech. Report No. LA-UR-88-901, Los Alamos National Labora- 
tory. 
John Moody and Christen Darken (1989)."Learning with Localised Receptive 
Fields". In: Eds: D. Touretzky, Hinton and Seinowski: Proceedings of the 1988 Con- 
nectionist Models Summer School. Morgan Kaufmann Publishing, San Mateo, CA. 
P. Medgassy. (1961) Decomposition of Superposition of Distribution Functions, 
Publishing house of the Hungarian Academy of Sciences, Budapest, 1961. 
T. Poggio and F. Girosi. (1989). "A Theory of Networks for Approximation and 
Learning", A.I. Memo No. 1140, Massachusetts Institute of Technology. 
B. Widrow and S. Steams (1985). Adaptive Signal Processing. Prentice-Hall Inc., 
Englewood Cliffs, NJ, pp 49,102. 
