A Smoothing Regularizer for Recurrent 
Neural Networks 
Lizhong Wu and John Moody 
Oregon Graduate Institute, Computer Science Dept., Portland, OR 97291-1000 
Abstract 
We derive a smoothing regularizer for recurrent network models by 
requiring robustness in prediction performance to perturbations of 
the training data. The regularizer can be viewed as a generaliza- 
tion of the first order Tikhonov stabilizer to dynamic models. The 
closed-form expression of the regularizer covers both time-lagged 
and simultaneous recurrent nets, with feedforward nets and one- 
layer linear nets as special cases. We have successfully tested this 
regularizer in a number of case studies and found that it performs 
better than standard quadratic weight decay. 
I Introduction 
One technique for preventing a neural network from overfitting noisy data is to add 
a regularizer to the error function being minimized. Regularizers typically smooth 
the fit to noisy data. Well-established techniques include ridge regression, see (Ho- 
erl & Kennard 1970), and more generally spline smoothing functions or Tikhonov 
stabilizers that penalize the rata-order squared derivatives of the function being fit, 
as in (Tikhonov & Arsenin 1977), (Eubank 1988), (Hastie & Tibshirani 1990) and 
(Wahba 1990). These nethods have recently been extended to networks of radial 
basis functions (Girosi,  ones & Poggio 1995), and several heuristic approaches have 
been developed for sigmoidal neural networks, for example, quadratic weight decay 
(Plaut, Nowlan &; Hinton 1986), weight elimination (Scalettar &: Zee 1988),(Chau- 
vin 1990),(Weigend, Rumelhart &; Huberman 1990) and soft weight sharing (Nowlan 
& Hinton 1992).  All previous studies on regularization have concentrated on feed- 
forward neural networks. To our knowledge, recurrent learning with regularization 
has not been reported before. 
 Two additional papers related to ours, but dealing only with feed forward networks, 
came to our attention or were written after our work. was completed. These are (Bishop 
1995) and (Leen 1995). Also, Moody & RSgnvaldsson (1995) have recently proposed 
several new classes of smoothing regularizers for feedforward nets. 
A Smoothing Regularizer for Recurrent Neural Networks 459 
In Section 2 of this paper, we develop a smoothing regularizer for general dynamic 
models which is derived by considering perturbations of the training data. We 
present a closed-form expression for our regularizer for two layer feedforward and 
recurrent neural networks, with standard weight decay being a special case. In 
Section 3, we evaluate our regularizer's performance on predicting the U.S. Index 
of Industrial Production. The advantage of our regularizer is demonstrated by 
comparing to standard weight decay in both feedforward and recurrent modeling. 
Finally, we conclude our paper in Section 4. 
2 Smoothing Regularization 
2.1 Prediction Error for Perturbed Data Sets 
Consider a training data set {P: Z(t),X(t)}, where the targets Z(t) are assumed to 
be generated by an unknown dynamical system F*(.r(t)) and an unobserved noise 
process: 
z,(t) = F*(.r(t)) + e*(t) with I(t) = {X(s),s = 1,2,...,t} (1) 
Here, I(t) is, the information set containing both current and past inputs X(s), and 
the e* (t) are independent random noise variables with zero mean and variance 
Consider next a dynamic network model 2(t) = F(, I(t)) to be trained on data set 
P, where  represents a set of network parameters, and F( ) is a network transfer 
function which is assumed to be nonlinear and dynamic. We assume that F( ) has 
good approximation capabilities, such that F(r,.r(t))  F*(.r(t)) for learnable 
parameters 
Our goal is to derive a smoothing regularizer for a network trained on the actual 
data set P that in effect optimizes the expected network performance (prediction 
risk) on perturbed test data sets of form {Q: ,(t), .(t) }. The elements of Q are 
related to the elements of P via small random perturbations ez(t) and ez(t), so that 
2(t) = z(t) + z(t) , (2) 
(t) = x(t) + z(t) . (3) 
The e(t) and e(t) have zero mean and variances r 2 and r  respectively. The 
training and test errors for the data sets P and Q are 
N 
1 
De =  -[Z(t)- F(r,I(t))]  (4) 
N 
I -.[2(t)- F($r,/(t))]  (5) 
DQ =  , 
where p denotes the network parameters obtained by training on data set P, and 
i(t) = r(s),s = 1,2,-..,t is the perturbed information set of Q. With this 
notation, our goal is to minimize the expected value of DQ, while training on Dp. 
Consider the prediction error for the perturbed data point at time t: 
d(t) = [2(t)- F($r,/(t))]  . (6) 
With Eqn (2), we obtain 
d(t) = [Z(t) + ez(t) - F($r,I(t)) + F($r,I(t))- F($r,/(t))] 2, 
-- [z(t)- + - + 
+2[z(t)- F(,I,r,I(t))][F(,I,r,r(t))- 
+2e(t)[Z(t)- F(r,i(t))]. (7) 
460 L. WU, J. MOODY 
Assuming that ez(t)is uncorrelated with [Z(t)- F(r,/(t))] and averaging over 
the exemplars of data sets P and Q, Eqn(7) becomes 
Do2 
Dp +  y.[F(p,I(t)) - F(p,i?(t))]: +  
=1 =1 
N 
2 
+ E[Z(t)- F(p,I(t))][F(p,I(t))- F(p,i?(t))] . 
(8) 
The third term, 'tN= [e(t)] 2, in Eqn(8) is independent of the weights, so it can 
be neglected during the learning process. The fourth term in Eqn(8) is the cross- 
covariance between [Z(t)- F($p,I(t))] and [F($p,I(t))- F($p,](t))]. Using 
the inequality 2ab _< a 2 + b , we can see that minimizing the first term Dp and 
the second term  iv 
Yt=[F(ff'p,I(t))- F($p,](t))]  in Eqn (8) during training 
will automatically decrease the effect of the cross-covariance term. Therefore, we 
exclude the cross-covariance term from the training criterion. 
The above analysis shows that the expected test error DQ can be minimized by 
minimizing the objective function D: 
N 
iv (9) 
1 y[Z(t)- r($,I(t))]  +  
t=l 
In Eqn (9), the second term is the time average of the squared disturbance 
[l(t)- 2(t)ll  of the trained network output due to the input perturbation 
IIi(t)- I(t)ll 2. Minimizing this term demands that small changes in the input 
variables yield correspondingly small changes in the output. This is the standard 
smoothness prior, namely that if nothing else is known about the function to be 
approximated, a good option is to assume a high degree of smoothness. Without 
knowing the correct functional form of the dynamical system F* or using such prior 
assumptions, the data fitting problem is ill-posed. In (Wu k Moody 1996), we have 
shown that the second term in Eqn (9) is a dynamic generalization of the first order 
Tikhonov stabilizer. 
2.2 Form of the Proposed Smoothing Regularizer 
Consider a general, two layer, nonlinear, dynamic network with recurrent connec- 
tions on the internal layer  as described by 
Y(t) = f (WY(t- r) + VX(t)) ,2(t) = UY(t) 
(lo) 
where X(t), Y(t) and 2(t) are respectively the network input vector, the hidden 
output vector and the network output;  = {U, V, W} is the output, input and 
recurrent connections of the network; f( ) is the vector-valued nonlinear transfer 
function of the hidden units; and r is a time delay in the feedback connections of 
hidden layer which is pre-defined by a user and will not be changed during learning. 
r can be zero, a fraction, or an integer, but we are interested in the cases with a 
small r. a 
2Our derivation can easily be extended to other network structures. 
aWhen the time delay r exceeds some critical value, a recurrent network becomes 
unstable and lies in oscillatory modes. See, for example, (Marcus & Westervelt 1989). 
A Smoothing Regularizer for Recurrent Neural Networks 461 
When r = 1, our model is a recurrent network as described by (Elman 1990) and 
(Rumelhart, Hinton k Williams 1986) (see Figure 17 on page 355). When r is equal 
to some fraction smaller than one, the network evolves - times within each input 
time interval. When r decreases and approaches zero, our model is the same as the 
network studied by (Pineda 1989), and earlier, widely-studied additive networks. In 
(Pineda 1989), r was referred to as the network relazation time scale. (Werbos 1992) 
distinguished the recurrent networks with zero r and non-zero r by calling them 
simultaneous recurrent networks and time-lagged recurrent networks respectively. 
We have found that minimizing the second term of Eqn(9) can be obtained by 
smoothing the output response to an input perturbation at every time step. This 
yields, see (Wu & Moody 1996): 
II(t)- 2(t)ll 2 < p2(p)ll(t) - x(t)ll 2 for t -- 1,2,... ,N . (11) 
We call p? ($p) the output sensitivity of the trained network $, to an input pertur- 
bation. pr2($) is determined by the network parameters only and is independent 
of the time variable t. 
We obtain our new regularizer by training directly on the expected prediction error 
for perturbed data sets Q. Based on the analysis leading to Eqns (9) and (11), the 
training criterion thus becomes 
N 
I [z(t)- r(,z(t))]  + 
D- . 
t=l 
(12) 
The coefficient A in Eqn()2) is a regularization parameter that measures the degree 
of input perturbation III(t)- I(t)112. The algebraic form for p() as derived in 
(13) 
(Wu & Moody 1996) is: 
-llUIllIVll { 
P(*) = 1 - 11Wll 
for time-lagged recurrent networks (r > 0). Here, II II denotes the Euclidean matrix 
norm. The factor 7 depends upon the maximal value of the first derivatives of the 
activation functions of the hidden units and is given by: 
7 = max I l/(oj(t)) I (14) 
t,j ' 
where j is the index of hidden units and oj(t) is the input to the jth unit. In general, 
7 -- 1. 4 To insure stability and that the effects of small input perturbations are 
damped out, it is required, see (Wu & Moody 1996), that 
,11wII < 1. (15) 
The regularizer Eqn(13) can be deduced for the simultaneous recurrent networks in 
the limit r  0 by: 
,o(,) = ,oo() = 'dlUIIIIVII 
1 - ,11wII ' (16) 
If the network is feedforward, W = 0 and r = 0, Eqns (13) and (16) become 
p(*)-- '11uIIIIVII  (17) 
Moreover, if there is no hidden layer and the inputs are directly connected to the 
outputs via U, the network is an ordinary linear model, and we obtain 
p(*) = IlUll, (18) 
'For instance, f'(x) [1 f(x)]f(x) if f(x) -- ' . Then, 7 max lf'(x)) - ' 
- - - I-- 
462 L. WU, J. MOODY 
which is standard quadratic weight decay (Plaut et al. 1986) as is used in ridge 
regression (Hoerl & Kennard 1970). 
The regularizer (Eqn(17) for feedforward networks and Eqn (13) for recurrent net- 
works) was obtained by requiring smoothness of the network output to perturbations 
of data. We therefore refer to it as a smoothing regularizer. Several approaches can 
be applied to estimate the regularization parameter A, as in (Eubank 1988), (Hastie 
&: Tibshirani 1990) and (Wahba 1990). We will not discuss this subject in this 
paper. 
In the next section, we evaluate the new regularizer for the task of predicting the 
U.S. Index of Industrial Production. Additional empirical tests can be found in 
(Wu & Moody 1996). 
3 Predicting the U.S. Index of Industrial Production 
The Index of Industrial Production (IP) is one of the key measures of economic 
activity. It is computed and published monthly. Our task is to predict the one- 
month rate of change of the index from January 1980 to December 1989 for models 
trained from January 1950 to December 1979. The exogenous inputs we have used 
include 8 time series such as the index of leading indicators, housing starts, the 
money supply M2, the $&P 500 Index. These 8 series are also recorded monthly. 
In previous studies by (Moody, Levin & Rehfuss 1993), with the same defined 
training and test data sets, the normalized prediction errors of the one month rate 
of change were 0.81 with the netm neural network simulator, and 0.75 with the 
proj neural network simulator. 
We have simulated feedforward and recurrent neural network models. Both models 
consist of two layers. There are 9 input units in the recurrent model, which re- 
ceive the 8 exogenous series and the previous month IP index change. We set the 
time-delayed length in the recurrent connections r = 1. The feedforward model is 
constructed with 36 input units, which receive 4 time-delayed versions of each input 
series. The time-delay lengths ar.e 1, 3, 6 and 12, respectively. The activation func- 
tions of hidden units in both feedforward and recurrent models are tanh functions. 
The number of hidden units varies from 2 to 6. Each model has one linear output 
unit. 
We have divided the data from January 1950 to December 1979 into four non- 
overlapping sub-sets. One sub-set consists of 70% of the original data and each of 
the other three subsets consists of 10% of the original data. The larger sub-set is 
used as training data and the three smaller sub-sets are used as validation data. 
These three validation data sets are respectively used for determination of early 
stopped training, selecting the regularization parameter and selecting the number 
of hidden units. 
We have formed 10 random training-validation partitions. For each training- 
validation partition, three networks with different initial weight parameters are 
trained. Therefore, our prediction committee is formed by 30 networks. 
The committee error is the average of the errors of all committee members. All 
networks in the committee are trained simultaneously and stopped at the same 
time based on the committee error of a validation set. The value of the regulariza- 
tion parameter and the number of hidden units are determined by minimizing the 
committee error on separate validation sets. 
Table 1 compares the out-of-sample performance of recurrent networks and feedfor- 
A Smoothing Regularizer for Recurrent Neural Networks 463 
Table 1: Normalized prediction errors for the one-month rate of return on the U.S. 
Index of Industrial Production (Jan. 1980 - Dec. 1989). Each result is based on 30 
networks. 
Model Regularizer Mean 4- Std Median Max Min Committee 
Recurrent Smoothing 0.6464-0.008 0.647 0.657 0.632 0.639 
Networks Weight Decay 0.7344-0.018 0.737 0.767 0.704 0.734 
Feedforward Smoothing 0.7004-0.023 0.707 0.729 0.654 0.693 
Networks Weight Decay 0.7454-0.043 0.748 0.805 0.676 0.731 
ward networks trained with our smoothing reguladzer to that of networks trained 
with standard weight decay. The results are based on 30 networks. As shown, the 
smoothing regularizer again outperforms standard weight decay with 95% confi- 
dence (in t-distribution hypothesis) in both cases of recurrent networks and feed- 
forward networks. We also list the median, maximal and minimal prediction errors 
over 30 predictors. The last column gives the committee results, which are based on 
the simple average of 30 network predictions. We see that the median, maximal and 
minimal values and the committee results obtained with the smoothing regularizer 
are all smaller than those obtained with standard weight decay, in both recurrent 
and feedforward network models. 
4 Concluding Remarks 
Regularization in learning can prevent a network from overtraining. Several tech- 
niques have been developed in recent years, but all these are specialized for feed- 
forward networks. To our best knowledge, a regularizer for a recurrent network has 
not been reported previously. 
We have developed a smoothing regularizer for recurrent neural networks that cap- 
tures the dependencies of input, output, and feedback weight values on each other. 
The regularizer covers both simultaneous and time-lagged recurrent networks, with 
feedforward networks and single layer, linear networks as special cases. Our smooth- 
ing regularizer for linear networks has the same form as standard weight decay. The 
regularizer developed depends on only the network parameters, and can easily be 
used. A more detailed description of this work appears in (Wu k Moody 1996). 
References 
Bishop, C. (1995), 'Training with noise is equivalent to Tikhonov regularization', 
Neural Computation 7(1), 108-116. 
Chauvin, Y. (1990), Dynamic behavior of constrained back-propagation networks, 
in D. Touretzky, ed., 'Advances in Neural Information Processing Systems 2', 
Morgan Kaufmann Publishers, San Francisco, CA, pp. 642-649. 
Elman, J. (1990), 'Finding structure in time', Cognition Science 14, 179-211. 
Eubank, R. L. (1988), Spline Smoothing and Nonparametric Regression, Marcel 
Dekker, Inc. 
Girosi, F., Jones, M. & Poggio, T. (1995), 'Regularization theory and neural net- 
works architectures', Neural Computation 7, 219-269. 
464 L. WU, J. MOODY 
Hastie, T. J. & Tibshirani, R. J. (1990), Generalized Additive Models, Vol. 43 of 
Monographs on Statistics and Applied Probability, Chapman and Hall. 
Hoerl, A. & Kennard, R. (1970), 'Ridge regression: biased estimation for nonorthog- 
onal problems', Technometrics 12, 55-67. 
Leen, T. (1995), 'From data distributions to regularization in invariant learning', 
Neural Computation 7(5), 974-981. 
Marcus, C. & Westervelt, R. (1989), Dynamics of analog neural networks with 
time delay, in D. Touretzky, ed., 'Advances in Neural Information Processing 
Systems 1', Morgan Kaufmann Publishers, San Francisco, CA. 
Moody, J. & RSgnvaldsson, T. (1995), Smoothing regularizers for feed-forward neu- 
ral networks, Oregon Graduate Institute Computer Science Dept. Technical 
Report, submitted for publication, 1995. 
Moody, J., Levin, U. &; Rehfuss, S. (1993), 'Predicting the U.S. index of indus- 
trial production', In proceedings of the 1993 Parallel Applications in Statistics 
and Economics Conference, Zeist, The Netherlands. Special issue of Neural 
Network World 3(6), 791-794. 
Nowlan, S. & Hinton, G. (1992), 'Simplifying neural networks by soft weight- 
sharing', Neural Computation 4(4), 473-493. 
Pineda, F. (1989), 'Recurrent backpropagation and the dynamical approach to 
adaptive neural computation', Neural Computation 1(2), 161-172. 
Plaut, D., Nowlan, S. & Hinton, G. (1986), Experiments on learning by back prop- 
agation, Technical Report CMU-CS-86-126, Carnegie-Mellon University. 
Rumelhart, D., Hinton, G. & Williams, R. (1986), Learning internal representa- 
tions by error propagation, in D. Rumelhart & J. McClelland, eds, 'Parallel 
Distributed Processing: Exploration in the microstructure of cognition', MIT 
Press, Cambridge, MA, chapter 8, pp. 319-362. 
Scalettar, R. & Zee, A. (1988), Emergence of grandmother memory in feed forward 
networks: learning with noise and forgetfulness, in D. Waltz & J. Feldman, 
eds, 'Connectionist Models and Their Implications: Readings from Cognitive 
Science', Ablex Pub. Corp. 
Tikhonov, A. N. & Arsenin, V. I. (1977), Solutions of Ill-posed Problems, Winston; 
New York: distributed solely by Halsted Press. Scripta series in mathematics. 
Translation editor, Fritz John. 
Wahba, G. (1990), Spline models for observational data, CBMS-NSF Regional Con- 
ference Series in Applied Mathematics. 
Weigend, A., Rumelhart, D. & Huberman, B. (1990), Back-propagation, weight- 
elimination and time series prediction, in T. Sejnowski, G. Hinton & D. Touret- 
zky, eds, 'Proceedings of the connectionist models summer school', Morgan 
Kaufmann Publishers, San Mateo, CA, pp. 105-116. 
Werbos, P. (1992), Neurocontrol and supervised learning: An overview and eval- 
uation, in D. White & D. Sofge, eds, 'Handbook of Intelligent Control', Van 
Nostrand Reinhold, New York. 
Wu, L. & Moody, J. (1996), 'A smoothing regularizer for feedforward and recurrent 
neural networks', Neural Computation 8(3), 463-491. 
