Stationarity and Stability of 
Autoregressive Neural Network Processes 
Friedrich Leisch 1, Adrian Trapletti 2 & Kurt Hornik  
 Institut fiir Statistik 
Technische Universitit Wien 
Wiedner Hauptstrat3e 8-10 / 1071 
A-1040 Wien, Austria 
firstname.lastname@ci.tuwien.ac.at 
Institut fiir Unternehmensffihrung 
Wir tschaft suniversi t/t Wien 
Augasse 2-6 
A-1090 Wien, Austria 
adrian.trapletti@wu-wien.ac.at 
Abstract 
We analyze the asymptotic behavior of autoregressive neural net- 
work (AR-NN) processes using techniques from Markov chains and 
non-linear time series analysis. It is shown that standard AR-NNs 
without shortcut connections are asymptotically stationary. If lin- 
ear shortcut connections are allowed, only the shortcut weights 
determine whether the overall system is stationary, hence standard 
conditions for linear AR processes can be used. 
I Introduction 
In this paper we consider the popular class of nonlinear autoregressive processes 
driven by additive noise, which are defined by stochastic difference equations of 
form 
t = g(t-1,...,t-p,O) q- et (1) 
where (t is an iid. noise process. If g(..., 0) is a feedforward neural network with 
parameter ("weight") vector 0, we call Equation 1 an autoregressive neural network 
process of order p, short AR-NN(p) in the following. 
AR-NNs are a natural generalization of the classic linear autoregressive AR(p) pro- 
cess 
t -- O1t-1 q-''' q- Opt-p q- ft' (2) 
See, e.g., Brockwell & Davis (1987) for a comprehensive introduction into AR and 
ARMA (autoregressive moving average) models. 
268 F. Leisch, A. Trapletti and K. Hornik 
One of the most central questions in linear time series theory is the stationarity of 
the model, i.e., whether the probabilistic structure of the series is constant over time 
or at least asymptotically constant (when not started in equilibrium). Surprisingly, 
this question has not gained much interest in the NN literature, especially there 
are up to our knowledge--no results giving conditions for the stationarity of AR- 
NN models. There are results on the stationarity of Hopfield nets (Wang & Sheng, 
1996), but these nets cannot be used to estimate conditional expectations for time 
series prediction. 
The rest of this paper is organized as follows: In Section 2 we recall some results 
from time series analysis and Markov chain theory defining the relationship between 
a time series and its associated Markov chain. In Section 3 we use these results to 
establish that standard AR-NN models without shortcut connections are stationary. 
We also give conditions for AR-NN models with shortcut connections to be station- 
ary. Section 4 examines the NN modeling of an important class of non-stationary 
time series, namely integrated series. All proofs are deferred to the appendix. 
2 Some Time Series and Markov Chain Theory 
2.1 Stationarity 
Let t denote a time series generated by a (possibly nonlinear) autoregressive pro- 
cess as defined in (1). If ]Eet= 0, then # equals the conditional expectation 
lE(tlt-,...,t-p) and #(t-1,...,t-p) is the best prediction for t in the mean 
square sense. 
If we are interested in the long term properties of the series, we may ask whether 
certain features such as mean or variance change over time or remain constant. 
The time series is called weakly stationary if ]Et '- ]_t and COV(t, t+h) --' 7h, Vt, 
i.e., mean and covariances do not depend on the time t. A stronger criterion is 
that the whole distribution (and not only mean and covariance) of the process does 
not depend on the time, in this case the series is called strictly stationary. Strong 
stationarity implies weak stationarity if the second moments of the series exist. For 
details see standard time series textbooks such as Brockwell g; Davis (1987). 
If t is strictly stationary, then ]r(t e A) --- 7r(A), Vt and r(.)is called the stationary 
distribution of the series. Obviously the series can only be stationary from the 
beginning if it is started with the stationary distribution such that 0  r. If 
it is not started with r, e.g., because 0 is a constant, then we call the series 
asymptotically stationary if it converges to its stationary distribution: 
lim lP(t G A) = r(A) 
t-- oo 
2.2 Time Series as Markov Chains 
Using the notation 
Xt-1 "- (t--1,''',t-p)' 
G(xt_l ) --(g(Xt_l),t_l,...,t_p+l )' 
et ---- (et,O,...,O)  
we can write scalar autoregressive models of order p such as (1) 
order vector model 
xt =G(xt-1)+et 
(3) 
(4) 
(5) 
or (2) as a first 
(6) 
$tationarity and Stability of Autoregressive Neural Network Processes 269 
with xt, et  P (e.g., Chan & Tong, 1985). If we write 
p"(x,A) = AIx = x} 
p(x,A) ---- pl(x,A) 
for the probability of going from point x to set A  B in n steps, then {xt) with 
p(x,A) forms a Markov chain with state space (IRP, B,A), where B are the Borel 
sets on P and A is the usual Lebesgue meure. 
The Markov chain {xt) is called T-irreducible, if for some a-finite measure T on 
VxP: p"(x,A)>O 
whenever (A) % 0. This means essentially, that all parts of the state space can be 
reached by the Markov chain irrespective of the starting point. Another important 
property of Markov chains is aperiodicity, which loosely speaking means that there 
are no (infinitely often repeated) cycles. See, e.g., Tong (1990) for details. 
The Markov chain {xt) is called geometrically ergodic, if there exists a probability 
measure (A) on (P,B,A) and a p % 1 such that 
Vx : pllp(x, .)- = o 
where []. [[ denotes the total variation. Then  satisfies the invariance equation 
r(A) =/p(x,A)(dx), VA  13 
There is a close relationship between a time series and its associated Markov chain. 
If the Markov chain is geometrically ergodic, then its distribution will converge to 
rr and the time series is asymptotically stationary. If the time series is started with 
distribution r, i.e., x0  r, then the series {t) is strictly stationary. 
3 Stationarity of AR-NN Models 
We now apply the concepts defined in Section 2 to the case where g is defined by 
a neural network. Let x denote a p-dimensional input vector, then we consider the 
following standard network architectures: 
Single hidden layer perceptrons: 
g(X) = 0 +  /i(Oi + a'iz) (7) 
i 
where ci,/i and '/0 are scalar weights, ai are p-dimensional weight vectors, 
and a(.) is a bounded sigmoid function such as tanh(.). 
Single hidden layer perceptrons with shortcut connections: 
g(x) = % + c' x + y./io'(oq + a'ix ) (8) 
i 
where c is an additional weight vector for shortcut connections between 
inputs and output. In this case we define the characteristic polynomial c(z) 
associated with the linear shortcuts as 
c(z)- 1-cz-c2z 2-...-cpz p, zGC. 
270 E Leisch, A. Trapletti and K. Hornik 
Radial basis function networks: 
a(x) = + rail) 
i 
(9) 
where rn are center vectors and (b(...))is one of the usual bounded radial 
basis functions such as (b(x) = exp(-x 2 . 
Lemma 1 Let {zt} be defined by (6), let lEletl < c and let the PDF of et be 
positive everywhere in 1I. Then if g is defined by any of (7), (8) or (9), the Markov 
chain {xt } is (b-irreducible and aperiodic. 
Lemma 1 basically says that the state space of the Markov chain, i.e., the points that 
can be reached, cannot be reduced depending on the starting point. An example 
for a reducible Markov chain would be a series that is always positive if only x0 > 0 
(and negative otherwise). This cannot happen in the AR-NN(p) case due to the 
unbounded additive noise term. 
Theorem 1 Let {t) be defined by (1), {xt} by (6), further let mltl < o<> and the 
PDF of et be positive everywhere in 11. Then 
1. If g is a network without linear shortcuts as defined in (7) and (9), then 
{xt } is geometrically ergodic and {t} is asymptotically stationary. 
If g is a network with linear shortcuts as defined in (8) and additionally 
c(z) : O, z  C'lz I <_ 1, then {xt} is geometrically ergodic and {t} is 
asymptotically stationary. 
The time series {t} remains stationary if we allow for more than one hidden layer 
(--+ multi layer perceptron, MLP) or non-linear output units, as long as the overall 
mapping has bounded range. An MLP with shortcut connections combines a (pos- 
sibly non-stationary) linear AR(p) process with a non-linear stationary NN part. 
Thus, the NN part can be used to model non-linear fluctuations around a linear 
process like a random walk. 
The only part of the network that controls whether the overall process is stationary 
are the linear shortcut connections (if present). If there are no shortcuts, then the 
process is always stationary. With shortcuts, the usual test for stability of a linear 
system applies. 
4 Integrated Models 
An important method in classic time series analysis is to. first transform a non- 
stationary series into a stationary one and then model the remainder by a stationary 
process. The probably most popular models of this kind are autoregressive inte- 
grated moving average (ARIMA) models, which can be transformed into stationary 
ARMA processes by simple differencing. 
Let A k denote the k-th order difference operator 
Agt "- t -- %t- 1 (10) 
A2t = A(t - = t - 2,_1 + ,_, 
k 
Stationarity and Stability of Autoregressive Neural Network Processes 271 
with A  = A. E.g., a standard random walk t = t-lnt'et is non-stationary because 
of the growing variance, but can be transformed into the iid (and hence stationary) 
noise process et by taking first differences. 
If a time series is non-stationary, but can be transformed into a stationary series 
by taking k-th differences, we call the series integrated of order k. Standard MLPs 
or RBFs without shortcuts are asymptotically stationary. It is therefore important 
to take care that these networks are only used to model stationary processes. Of 
course the network can be trained to mimic a non-stationary process on a finite time 
interval, but the out-of-sample or prediction performance will be poor, because the 
network inherently cannot capture some important features of the process. One way 
to overcome this problem is to first transform the process into a stationary series 
(e.g., by differencing an integrated series) and train the network on the transformed 
series (Chng et al., 1996). 
As differencing is a linear operation, this transformation can also be easily incor- 
porated into the network by choosing the shortcut connections and weights from 
input to hidden units accordingly. Assume we want to model an integrated series 
of integration order k, such that 
hkt -- g(hkt-1, .. ., hkt-p) "l t- t 
where Akt is stationary. By (12) this is equivalent to 
- 
which (for p > k) can be modeled by an MLP with shortcut connections as defined 
by (8) where the shortcut weight vector c is fixed to 
c= , ., (-1) p- 
.. , :-0forn>k 
and  is such that (t-,...,t-p-k) = g(Akxt-). This is always possible and 
can basically be obtained by adding c to all weights between input and first hidden 
layer of g. 
An AR-NN(p) can model integrated series up to integration order p. If the order 
of integration is known, the shortcut weights can either be fixed, or the differenced 
series is used as input. If the order is unknown, we can also train the complete 
network including the shortcut connections and implicitly estimate the order of 
integration. After training the final model can be checked for stationarity by looking 
at the characteristic roots of the polynomial defined by the shortcut connections. 
4.1 Fractional Integration 
Up to now we have only considered integrated series with positive integer order of 
integration, i.e., k G ll. In the last years models with fractional integration order 
became very popular (again). Series with integration order of 0.5 < k < 1 can 
be shown to exhibit self-similar or fractal behavior, and have long memory. These 
type of processes were introduced by Mandelbrot in a series of paper modeling river 
flows, e.g., see Mandelbrot & Ness (1968). More recently, self-similar processes were 
used to model Ethernet traffic by Leland et al. (1994). Also some financial time 
series such as foreign exchange data series exhibit long memory and self-similarity. 
272 F. Leisch, A. Trapletti and K. Hornik 
The fractional differencing operator A k, k  [-1, 1] is defined by the series expansion 
+ n) 
5'et = + 1) et-' (13) 
which is obtained from the Taylor series of (1 - z) . For k > 1 we first use Equa- 
tion (12) and then the above series for the fractional remainder. For practical 
computation, the series (13) is of course truncated at some term n = N. An AR- 
NN(p) model with shortcut connections can approximate the series up to the first 
p terms. 
5 Summary 
We have shown that AR-NN models using standard NN architectures without short- 
cuts are asymptotically stationary. If linear shortcuts between inputs and outputs 
are included--which many popular software packages have already implemented-- 
then only the weights of the shortcut connections determine if the overall system 
is stationary. It is also possible to model many integrated time series by this kind 
of networks. The asymptotic behavior of AR-NNs is especially important for pa- 
rameter estimation, predictions over larger intervals of time, or when using the 
network to generate artificial time series. Limiting (normal) distributions of pa- 
rameter estimates are only guaranteed for stationary series. We therefore always 
recommend to transform a non-stationary series to a stationary series if possible 
(e.g., by differencing) before training a network on it. 
Another important aspect of stationarity is that a single trajectory displays the 
complete probability law of the process. If we have observed one long enough tra- 
jectory of the process we can (in theory) estimate all interesting quantities of the 
process by averaging over time. This need not be true for non-stationary processes 
in general, where some quantities may only be estimated by averaging over several 
independent trajectories. E.g., one might train the network on an available sam- 
ple and then use the trained network afterwards--driven by artificial noise from a 
random number generator--to generate new data with similar properties than the 
training sample. The asymptotic stationarity guarantees that the AR-NN model 
cannot show "explosive" behavior or growing variance with time. 
We currently are working on extensions of this paper in several directions. AR-NN 
processes can be shown to be strong mixing (the memory of the process vanishes 
exponentially fast) and have autocorrelations going to zero at an exponential rate. 
Another question is a thorough analysis of the properties of parameter estimates 
(weights) and tests for the order of integration. Finally we want to extend the uni- 
variate results to the multivariate case with a special interest towards cointegrated 
processes. 
Acknowledgement 
This piece of research was supported by the Austrian Science Foundation (FWF) under 
grant SFB010 ('Adaptive Information Systems and Modeling in Economics and Man- 
agement Science'). 
Stationarity and Stability of Autoregressive Neural Network Processes 273 
Appendix: Mathematical Proofs 
Proof of Lemma 1 
It can easily be shown that {xt} is y-irreducible if the support of the probability density 
function (PDF) of et is the whole real line, i.e., the PDF is positive everywhere in ll (Chan 
& Tong, 1985). In this case every non-null p-dimensional hypercube is reached in p steps 
with positive probability (and hence every non-null Borel set A). 
A necessary and sufficient condition for {xt} to be aperiodic is that there exists a set A 
and positive integer n such that p'(x, A) > 0 and p'+(x,A) > 0 for all x E A (Tong, 
1990, p. 455). In our case this is true for all n due to the unbounded additive noise. 
Proof of Theorem I 
We use the following result from nonlinear time series theory: 
Theorem 2 (Chan &: Tong 1985) Let {xt} be defined by (1), (6) and let G be compact, 
i.e. preserve compact sets. If G can be decomposed as G = Gn +Gd andGd(.) is of bounded 
range, Gn(.) is continuous and homogeneous, i.e., Gn(ax) = aGn(x), the origin is a fixed 
point of Gn and Gn is uniform asymptotically stable, IE]atl < or> and the PDF of et is 
positive everywhere in II, then {xt} is geometrically ergodic. 
The noise process t fulfills the conditions by assumption. Clearly all networks are con- 
tinuous compact functions. Standard MLPs without shortcut connections and RBFs have 
a bounded range, hence Gn -- 0 and G -- Ga, and the series {it} is asymptotically sta- 
tionary. If we allow for linear shortcut connections between the input and the outputs, 
we get Gn = c'x and Ga = 70 + ]]i/ier(c + a'ix ) i.e., Gn is the linear shortcut part 
of the network, and Ga is a standard MLP without shortcut connections. Clearly, Gn is 
continuous, homogeneous and has the origin as a fixed point. Hence, the series {it} is 
asymptotically stationary if Gn is asymptotically stable, i.e., when all characteristic roots 
of Gn have a magnitude less than unity. Obviously the same is true for RBFs with shortcut 
connections. Note that the model reduces to a standard linear AR(p) model if Ga -- O. 
References 
Brockwell, P. J. & Davis, R. A. (1987). Time Series: Theory and Methods. Springer Series 
in Statistics. New York, USA: Springer Verlag. 
Chan, K. S. & Tong, H. (1985). On the use of the deterministic Lyapunov function for 
the ergodicity of stochastic difference equations. Advances in Applied Probability, 17, 
666-678. 
Chng, E. S., Chen, S., & Mulgrew, B. (1996). Gradient radial basis function networks 
for nonlinear and nonstationary time series prediction. IEEE Transactions on Neural 
Networks, 7(1), 190-194. 
Husmeier, D. & Taylor, J. G. (1997). Predicting conditional probability densities of sta- 
tionary stochastic time series. Neural Networks, 10(3), 479-497. 
Jones, D. A. (1978). Nonlinear autoregressive processes. Proceedings of the Royal Society 
London A, 360, 71-95. 
Leland, W. E., Taqqu, M. S., Willinger, W., & Wilson, D. V. (1994). On the self-similar 
nature of ethernet traffic (extended version). IEEE/ACM Transactions on Networking, 
2(1), 1-15. 
Mandelbrot, B. B. & Ness, J. W. V. (1968). Fractional browninn motions, fractional noises 
and applications. SIAM Review, 10(4), 422-437. 
Tong, H. (1990). Non-linear time series: A dynamical system approach. New York, USA: 
Oxford University Press. 
Wang, T. & Sheng, Z. (1996). Asymptotic stationarity of discrete-time stochastic neural 
networks. Neural Networks, 9(6), 957-963. 
