Smoothing Regularizers for 
Projective Basis Function Networks 
John E. Moody and Thorsteinn S. RSgnvaldsson * 
Department of Computer Science, Oregon Graduate Institute 
PO Box 91000, Portland, OR 97291 
moody@cse.ogi.edu denni@cca.hh.se 
Abstract 
Smoothing regularizers for radial basis functions have been studied extensively, 
but no general smoothing regularizers for projective basis functions (PBFs), such 
as the widely-used sigmoidal PBFs, have heretofore been proposed. We de- 
rive new classes of algebraically-simple rn h-order smoothing regularizers for 
networks of the form f(W, x) = Y'j%l uj7 [X Tvj '}' Vj0] + It0, with general 
projectire basis functions 9[']. These reguladzers are: 
N 
R(W,,) -  ,11vjll 2'- Global Form 
j=l 
N 
Rt(W,m) -  2 112,,, 
- u I lvj al Form 
./=1 
These regularizers bound the corresponding rn '-order smoothing integral 
where W denotes all the network weights { u, uo, v, vo}, and (x) is a weight- 
ing function on the D-dimensional input space. The global and local cases are 
distinguished by different choices of 
The simple algebraic forms R(W, m) enable the direct enforcement of smooth- 
ness without the need for cosily Monte-Carlo integrations of S(W, m). The new 
regularizers are shown to yield better generalization errors than weight decay 
when the implicit assumptions in the latter are wrong. Unlike weight decay, the 
new regularizers distinguish between the roles of the input and output weights 
.and capture the interactions between them. 
*Address as of September 1, 1996: Centre for Computer Architecture, University of Halmstad, 
P.O.Box 823, S-301 18 Halmstad, Sweden 
586 J. E. Moody and T. S. R6gnvaldsson 
1 Introduction: What are the right biases? 
Regularizafion is a technique for reducing prediction risk by balancing model bias and 
model variance. A regularizer It(W) imposes prior constraints on the network parameters 
W. Using squared error as the most common example, the objective functional that is 
minimized during training is 
M 
1 [y(i) _ f(W, a:(i))] 2 
E= 2M i= 
+ ,XR(W) , (1) 
where y(i) are target values corresponding to the inputs a: (i), M is the number of training 
patterns, and the regularization parameter A controls the importance of the prior constraints 
relative to the fit to the data. Several approaches can be applied to estimate ) (e.g. Eubank 
(1988) or Wahba (1990)). 
Regularization reduces model variance at the cost of some model bias. An important 
question arises: What are the right biases? (Geman, Bienenstock & Doursat 1992). A good 
choice of It(W) will result in lower expected prediction error than will a poor choice. 
Weight decay is often used effectively, but it is an ad hoc technique that controls weight 
values without regard to the function f(.). It is thus not necessarily optimal and not appro- 
priate for arbitrary function parameterizafions. It will give very different results, depending 
upon whether a function is parameterized, for example, as f(w, c) or as f(w - , c). 
Since many real world problems are intrinsically smooth, we propose that in many cases, 
an appropriate bias to impose is to favor solutions with low ruth-order curvature. Direct pe- 
nalizafion of curvature is a parametrization-independent approach. The desired regularizer 
is the standard D dimensional curvature functional of order m: 
Oa: m 
(2) 
Here II II denotes the ordinary euclidean tensor norm and 0'"/0a:'" denotes the mth order 
differential operator. The weighting function f/(a:) ensures that the integral converges and 
determines the region over which we require the function to be smooth. f/(a:) is not required 
to be equal to the input density p(a:), and will most often be different. 
The use of smoothing funcfionals like (2) has been extensively studied for smoothing splines 
0Eubank 1988, Hasfie & Tibshirani 1990, Wahba 1990) and for radial basis function (RBF) 
networks (Powell 1987, Poggio & Girosi 1990, Girosi, Jones & Poggio 1995). However, 
no general class of smoothing regularizers that directly enforce smoothness $(W, m) for 
projectire basis functions (PBFs), such as the widely used sigmoidal PBFs, has been 
previously proposed. 
Since explicit enforcement of smoothness using (2) requires cosfly, impractical Monte- 
Carlo integrations,  we derive algebraically-simple regularizers R(W, m) that tightly bound 
2 Derivation of Simple Regularizers from Smoothing Functionals 
We consider single hidden layer networks with D input variables, Nh nonlinear hidden 
units, and No linear output units. For clarity, we set No = 1, and drop the subscript on Nh 
I Note that (2) is not just one integral, but actually (9 (Dm ) integrals, since the norm of the operator 
om/ox m has O(D'") terms. This is extremely expensive to compute for large D or large m. 
Smoothing Regularizers for Projective Basis Function Networks 587 
(the derivation is trivially extended to the case No > 1). Thus, our network function is 
N 
= + -o 
j=l 
(3) 
where #[.] are the nonlinear transfer functions of the internal hidden units, a:  R D is the 
input vector 2 , 0 5 are the parameters associated with internal unit j, and W denotes all 
parameters in the network. 
For regularizers R(W), we will derive strict upper bounds for S(W, m). We desire the 
regularizers to be as general as possible so that they can easily be applied to different 
network models. Without making any assumptions about fl(a:) or #(.), we have the upper 
bound 
2 dDxQ(a) (4) 
S(W) NZtlJ Oxm ' 
j=l 
( )2 2 We consider two possible 
which follows from the inequality 'j;=, ai <_ NE=i a i . 
options for the weighting function fl(a:). One is to require global smoothness, in which 
case fl(a:) is a very wide function that covers all relevant parts of the input space (e.g. a 
very wide gaussian distribution or a constant distribution). The other option is to require 
local smoothness, in which case fl(a:) approaches zero outside small regions around some 
reference points (e.g. the training data). 
2.1 Projective Basis Representations 
Projective basis functions (PBFs) are of the form g[Oj, a:] = g [a:'vj + vo], where Oj = 
{vd, rio }, vj = (v5 , v52,... , VdD ) is the vector of weights connecting hidden unit j to the 
inputs, and v5o is the bias, offset, or threshold. For PBFs, expression (4) simplifies to 
N 
211vllmta(W, (5) 
with 
Is(W,m)-- /dxQ(a:) f dm#[ZS(a:)]  ,6) 
where zs( ) m rj + rS0. 
Although the most commonly used #[-]s are sigmoids, our analysis applies to many other 
forms, for example flexible fourier units, polynomials, andrational functions. 3 The classes 
of PBF transfer functions #[.] that are applicable (as determined by fl(x)) are those for 
which the integral (8) is finite and well-defined. 
2.2 Global weighting 
For the global case, we select a gaussian form for the weighting function 
= exp L ] 
2Throughout, we use small letter boldface to denote vector quantifies. 
3See for example Moody & Yarvin (1992). 
(7) 
588 J. E. Moody and T. S. Rtgnvaldsson 
and require rr to be large. Integrating out all dimensions, except the one associated with the 
projection vector v j, we are left with 
(8) 
If (d'g[z]/dz ')2 is integrable and approaches zero outside a region that is small compared 
to a, we can bound (8) by setting the exponential equal to unity. This implies 
j(w,-0 < II,,jl---- with I(m)  rrx7- dz dz---  (9) 
oo 
The bound of equation (5) then becomes 
N 
211 .112m-l 
s(w, m) < , 
j=l 
= NI(m)Ra(W, m) , 
(10) 
where the subscript (7 stands for global. Since A absorbs all constant multiplicative factors, 
we need only weigh Ra(W, m) into the training objective function. 
2.2.1 Local weighting 
For the local case, we consider weighting functions of the general form 
i=1 
(11) 
where a: (i) are a set of points, and (a: (I), a) is a function that decays rapidly for large 
- We require that lim,-,0 (a: (i), a) = J(a: - a:(i)). Thus, when the a: (i) are the 
training data points, the limiting distribution of (11) is the empirical distribution. 
In the limit a --} 0, equation (5) becomes 
211Vl12m l dmg 
j=l -- 
(12) 
For the empirical distribution, we could compute the expression within parenthesis in (12) 
for each input pattern a: (i) during training and use it as our regularization cost. This is done 
by Bishop (1993) for the special case m = 2. However, this requires explicit design for 
each transfer function and becomes increasingly complicated as we go to higher m. To 
th 
construct a simpler and more general form, we instead assume that the m derivative of 
{dmg[z] 2 
g[.] is bounded from aboveby C;(rn) -- maxz k, az, } 
This gives the bound 
N 
211,Jll 2m 
s(w, . _< VC(m) 
j----1 
= NC;(m)R;( m) 
(13) 
for the maximum local curvature of the function (the subscript L denotes local limit). 
Smoothing Regularizers for Projective Basis Function Networks 589 
3 Empirical Example 
We have done extensive simulation studies that demonstrate the efficacy of our new reg- 
ularizers for PBF networks on a variety of problems. An account is given in Moody & 
RSgnvaldsson (1996). Here, we demonstrate the value of using smoothing regularizers on a 
simple problem which illustrates a key difference between smoothing and quadratic weight 
decay, the two dimensional bilinear function 
(14) 
This example was used by Friedman & Stuetzle (1981) to demonstrate projection pursuit 
regression. It is the simplest function with interactions between input variables. 
We fit this function with one hidden layer networks using the m = { 1,2, 3} smoothing 
regularizers, comparing the results with using weight decay. In a large set of experiments, 
we find that both the global and local smoothing regularizers with m = 2 and m = 3 
outperform weight decay. An example is shown in figure 1. The local m = 1 case performs 
poorly, which is unsurprising, given that the target function is quadratic. Weight decay 
performs poorly because it lacks any form of interaction between the input layer and output 
layer weights oj and 
0.1 
0.0 I I I I I I I 0.01 I I I & I I I 
-6 -5 -4 -3 -2 -1 0 I -6 -5 -4 -3 -2 -I 0 1 
LogtO0.) LogtO0.) 
Figure 1: (a) Generalization errors on the x]x2 problem, with 40 training data points and a 
signal-to-noise ratio of 2/3, for different values of the regularization parameter and different 
orders of the smoothing regularizer. For each value of ), 10 networks with 8 hidden units' 
have been trained and averaged (geometric average). The shaded area shows the 95% 
confidence bands for the average performance of a linear model on the same problem. 
(b) Similar plot for the m = 3 smoother compared to the standard weight decay method. 
Error bars mark the estimated standard deviation of the mean generalization error of the 10 
networks. The m = 3 regularizer performs significantly better than weight decay. 
4 Quality of the Regularizers: Approximations vs Bounds 
Equations (10) and (13) are strict upper bounds to the smoothness functional $(W, m), 
eq. (2), in the global and local limits, a --} o and a --} 0. However, if the bounds are not 
sufficiently tight, then penalizing B(W, m) may not have the effect of penalizing S(W, m) 4. 
4For the proposed reguladzers R(W, m) to be effective in penalizing S(W, m), we need only 
have an approximate monotonic relationship between them. 
590 J. E. Moody and T. S. R6gnvaldsson 
The bound (4) is tighter the more uncorrelated the mth derivatives of the internal unit 
activities are. If they are uncorrelated, then the bounds of equations (10) and (13) can be 
replaced by the approximations: 
(15) 
sL(w,.0 , (16) 
( )2 2 The right hand sides differ from those in equations (10) and 
(13) only by a factor of N, so these approximations are actually proportional to the bounds. 
For our regularizers, the constant factor N doesn't matter, since it can be absorbed into 
the regularization parameter ) (along with the values of the factors It(m) or C';(m)). In 
practical terms, there is no difference between using the upper bounds (10) and (13) or the 
uncorrelated approximations (15) and (16). Our empirical results (see figure 2) indicate 
that an approximate linear relationship holds between $(W, m) and/t(W, m) for both the 
global and the local cases. This suggests that the uncorrelated hidden unit assumption 
yields a good approximation. This approximation also improves with the dimensionality of 
the input space. Extensive results and discussion are presented in (Moody & R6gnvaldsson 
1996). 
14 First order  smooths, ptojbasltanh units 
' Lin regression (c'oetali = O.OJ2) --- 
12 
> 2 
1 2 3 4 5 
Mome Carlo estimated value ol 8(W, 1) [E+3] 
. order global smoolhef, projection-based tanh units 
' ur r (r: b.)--. 
7 0 015 110 115 210 2.5 
Monte Carlo estimated value d SOY,2) [1:+6] 
Figure 2: Linear correlation between S(W, m) and the global Ro(W, m) for neural net- 
works with 10 input units, 10 internal tanh[.] PBF units, and one linear output. The values 
of S(W, m) are computed through Monte Carlo integration. The left graph shows m = 1 
and the right graph shows m = 2. Results are similar for the local form R(W, m). 
5 Summary 
Our regularizers R(W, rn) are the first general class of ruth-order smoothing regularizers 
to be proposed for projective basis function (PBF) networks. They apply to large classes 
of transfer functions #[.], including sigmoids. They differ fundamentally from quadratic 
weight decay in that they distinguish the roles of the input and output weights and capture 
the interactions between them. 
Our approach is quite different from that developed for smoothing splines and smoothing 
radial basis functions (RBFs), since we derive smoothing regularizers for given classes of 
units #[0, a:], rather than derive the forms of the units #[.] by requiring them to be Greens 
functions of the smoothing operator S(-). Our approach thus has the advantage that it can 
be applied to the types of networks most often used in practice, namely PBFs. 
Smoothing Regularizers for Projective Basis Function Networks 591 
In Moody & RSgnvaldsson (1996), we present further analysis and simulation results for 
PBFs. We have also extended our work to RBFs (Moody & R/gnvaldsson 1997). 
Acknowledgements 
Both authors thank Steve Rehfuss and Lizhong Wu for stimulating input. John Moody 
thanks Volker Tresp for a provocative discussion at a 1991 Neural Networks Workshop 
sponsored by the Deutsche Informatik Akademie. We gratefully acknowledge support for 
this work from ARPA and ONR (grant N00014-92-J-4062), NSF (grant CDA-9503968), the 
Swedish Institute, and the Swedish Research Council for Engineering Sciences (contract 
TFR-282-95-847). 
References 
Bishop, C. (1993), 'Curvature-driven smoothing: A learning algorithm for feedforward networks', 
IEEE Trans. Neural Networks 4, 882-884. 
Eubank, R. L. (1988), Spline Smoothing and Nonparametric Regression, Marcel Dekker, Inc. 
Friedman, J. H. & Stuetzle, W. (1981), 'Projection pursuit regression', J. Amer. Stat. Assoc. 
76(376), 817-823. 
Geman, S., Bienenstock, E. & Doursat, R. (1992), 'Neural networks and the bias/variance dilemma', 
Neural Computation 4(1), 1-58. 
Girosi, F., Jones, M. & Poggio, T. (1995), 'Regularization theory and neural network architectures', 
Neural Computation 7, 219-269. 
Hastie, T. J. & Tibshirani, R. J. (1990), Generalized Additive Models, Vol. 43 of Monographs on 
Statistics and Applied Probability, Chapman and Hall. 
Moody, J. E. & Yarvin, N. (1992), Networks with learned unit response functions, in J. E. Moody, 
S. J. Hanson & R. P. Lippmann, eds, 'Advances in Neural Information Processing Systems 4', 
Morgan Kaufmann Publishers, San Mateo, CA, pp. 1048-55. 
Moody, J. & R0gnvaldsson, T. (1996), Smoothing regularizers for projectire basis function networks, 
Submitted to Neural Computation. 
Moody, J. & RiSgnvaldsson, T. (1997), Smoothing regularizers for radial basis function networks, 
Manuscript in preparation. 
Poggio, T. & Girosi, F. (1990), 'Networks for approximation and learning', IEEE Proceedings 78(9). 
Powell, M. (1987), Radial basis functions for multivariable interpolation: a review., in J. Mason & 
M. Cox, eds, 'Algorithms for Approximation', Clarendon Press, Oxford. 
Wahba, G. (1990), Spline models for observational data, CBMS-NSF Regional Conference Series in 
Applied Mathematics. 
