Noisy Neural Networks and 
Generalizations 
Hava T. Siegelmon 
Industrial Eng. and Management, Mathematics 
Technion- IIT 
Haifa 32000, Israel 
iehava @ie. technion. ac. il 
Alexander Roitershtein 
Mathematics 
Technion- IIT 
Haifa 32000, Israel 
roiterst @math. technion. ac. il 
Asa Ben-Hur 
Industrial Eng. and Management 
Technion- IIT 
Haifa 32000, Israel 
asa @tx. tech nion. a c. il 
Abstract 
In this paper we define a probabilistic computational model which 
generalizes many noisy neural network models, including the recent 
work of Maass and Sontag [5]. We identify weak ergodic.ity as the 
mechanism responsible for restriction of the computational power 
of probabilistic models to definite languages, independent of the 
characteristics of the noise: whether it is discrete or analog, or if 
it depends on the input or not, and independent of whether the 
variables are discrete or continuous. We give examples of weakly 
ergodic models including noisy computational systems with noise 
depending on the current state and inputs, aggregate models, and 
computational systems which update in continuous time. 
I Introduction 
Noisy neural networks were recently examined, e.g. in. [1, 4, 5]. It was shown in [5] 
that Gaussian-like noise reduces the power of analog recurrent neural networks to 
the class of definite languages, which are a strict subset of regular languages. Let 
E be an arbitrary alphabet. L C E* is called a definite language if for some integer 
r any two words coinciding on the last r symbols are either both in L or neither in 
L. The ability of a computational system to recognize only definite languages can 
be interpreted as saying that the system forgets all its input signals, except for the 
most recent ones. This property is reminiscent of human short term memory. 
"Definite probabilistic computational models" have their roots in Rabin's pioneer- 
ing work on probabilistic automata [9]. He identified a condition on probabilistic 
automata with a finite state space which restricts them to definite languages. Paz 
[8] generalized Rabin's condition, applying it to automata with a countable state 
space, and calling it weak ergodicity [7, 8]. In their ground-breaking paper [5], 
336 H. T. Siegelmann, A. Roitershtein and A. Ben-Hur 
Maass and Sontag extended the principle leading to definite languages to a finite 
interconnection of continuous-valued neurons. They proved that in the presence 
of "analog noise" (e.g. Gaussian), recurrent neural networks are limited in their 
computational power to definite languages. Under a different noise model, Maass 
and Orponen [4] and Casey [1] showed that such neural networks are reduced in 
their power to regular languages. 
In this paper we generalize the condition of weak ergodicity, making it applica- 
ble to numerous probabilistic computational machines. In our general probabilistic 
model, the state space can be arbitrary: it is not constrained to be a finite or 
infinite set, to be a discrete or non-discrete subset of some Euclidean space, or 
even to be a metric or topological space. The input alphabet is arbitrary as well 
(e.g., bits, rationals, reals, etc.). The stochasticity is not necessarily defined via a 
transition probability function (TPF) as in all the aforementioned probabilistic and 
noisy models, but through the more general Markov operators acting on measures. 
Our Markov Computational Systems (MCS's) include as special cases Rabin's ac- 
tual probabilistic automata with cut-point [9], the quasi-definite automata by Paz 
[8], and the noisy analog neural network by Maass and Sontag [5]. Interestingly, 
our model also includes: analog dynamical systems and neural models, which have 
no underlying deterministic rule but rather update probabilistically by using finite 
memory; neural networks with an unbounded number of components; networks of 
variable dimension (e.g., "recruiting networks"); hybrid systems that combine dis- 
crete and continuous variables; stochastic cellular automata; and stochastic coupled 
map lattices. 
We prove that all weakly ergodic Markov systems are stable, i.e. are robust with 
respect to architectural imprecisions and environmental noise. This property is de- 
sirable for both biological and artificial neural networks. This robustness was known 
up to now only for the classical discrete probabilistic automata [8, 9]. To enable 
practicality and ease in deciding weak ergodicity for given systems, we provide two 
conditions on the transition probability functions under which the associated com- 
putational system becomes weakly ergodic. One condition is based on a version 
of Doeblin's condition [5] while the second is motivated by the theory of scram- 
bling matrices [7, 8]. In addition we construct various examples of weakly ergodic 
systems which include synchronous or asynchronous computational systems, and 
hybrid continuous and discrete time systems. 
2 Markov Computational System (MCS) 
Instead of describing various types of noisy neural network models or stochastic 
dynamical systems we define a general abstract probabilistic model. When dealing 
with systems containing inherent elements of uncertainty (e.g., noise) we abandon 
the study of individual trajectories in favor of an examination of the flow of state 
distributions. The noise models we consider are homogeneous in time, in that they 
may depend on the input, but do not depend on time. The dynamics we consider 
is defined by operators acting in the space of measures, and are called Markov 
operators [6]. In the following we define the concepts which are required for such 
an approach. 
Let E be an arbitrary alphabet and f2 be an abstract state space. We assume that 
a er-algebra B (not necessarily Borel sets) of subsets of f is given, thus (f, B) is a 
measurable space. Let us denote by P the set of probability measures on (f, 
This set is called a distribution space. 
Let  be a space of finite measures on (f, B) with the total variation norm defined 
Noisy Neural Networks and Generalizations 337 
by 
[1111 ----[1(') -- sup/(A)- inf/(A). (1) 
AB AB 
Denote by  the set of all bounded linear operators acting from  to itself. The 
I1' ]]1- norm on  induces a norm IIPI] - supe ]IP/11 in . An operator P   
is said to be a Markov operator if for any probability measure/  P, the image P/ 
is again a probability measure. For a Markov operator, IIPII - 1. 
Definition 2.1 A Markov system is a set of Markov operators T = {P,: u  Z}. 
With any Markov system T, one can associate a probabilistic computational sys- 
tem. If the probability distribution on the initial states is given by the probability 
measure/0, then the distribution of states after n computational steps on inputs 
w - w0, Wl, ..., w, is defined as in [5, 8] 
Pw!o(A) - Pon '...' PoP0!o  (2) 
Let .A and 7 be two subset of 7 ) with the property of having a p-gap 
dist(A, ) -- inf lip- '1[  P > 0 (3) 
The first set is called a set of accepting distributions and the second is called a set 
of rejecting distributions. A language L 6 E* is said to be recognized by Markov 
computational system A = (, A, 7, E,/0, T) if 
w 6 LP!o6A 
w q L,Poo7. 
This model of language recognition with a gap between accepting and rejecting 
spaces agrees with Rabin's model of probabilistic automata with isolated cut-point 
[9] and the model of analog probabilistic computation [4, 5]. 
An example of a Markov system is a system of operators defined by TPF on (, B). 
Let Pu (x, A) be the probability of moving from a state x to the set of states A upon 
receiving the input signal u  E. The function Pu(x, .) is a probability measure for 
all x   and Pu(', A) is a measurable function of x for any A  B. In this case, 
Pu!(A) are defined by 
P'!(A) =/c P,(x, A)!(dx). (4) 
3 Weakly Ergodic MCS 
1 
Let P 6  be a Markov operator. The real number 7(P) = I -  sup,,e ]lPp - 
PYI[ is called the ergodicity coefficient of the Markov operator. We denote 
((P) = I- 7(P). It can be proven that for any two Markov operators P,P2, 
((PP2) _< ((P)J(P2). The ergodicity coefficient was introduced by Dobrushin [2] 
for the particular case of Markov operators induced by TPF P(x, A). In this special 
case 7(P) = 1 -supr,y sup A [P(x,A)- P(y,A)I. 
Weakly ergodic systems were introduced and studied by Paz in the particular case 
of a denumerable state space , where Markov operators are represented by infi- 
nite dimensional matrices. The following definition makes no assumption on the 
associated measurable space. 
Definition 3.1 A Markov system {P,, u G Z} is called weakly ergodic if for any 
c > 0, there is an integer r -- r(c 0 such that for any w G E >r and any p, v  P, 
1 
338 H.T. Siegelmann, ,4. Roitershtein and,4. Ben-Hur 
An MCS AA is called weakly ergodic if its associated Markov system 
is weakly ergodic. 
An MCS JM is weakly ergodic if and only if there is an integer r and real number 
c < 1, such that IIPw/ - Pwt, 111 _< c for any word w of length r. Our most general 
characterization of weak ergodicity is as follows: [11]: 
Theorem 1 An abstract MCS .M is weakly ergodic if and only if there exists 
a multiplicative operator's norm [l' II** on  equivalent to the norm I1' I[t :- 
liP;kill and such that suPuez IlP. II** < * for some number s < 1  
sup{ x:xn=0} IlXll ' - ' 
The next theorem connects the computational power of weakly ergodic MCS's with 
the class of definite languages, generalizing the results by Rabin [9], Paz [8, p. 175], 
and Maass and Sontag [5]. 
Theorem 2 Let .M be a weakly ergodic MCS. If a language L can be recognized by 
.M, then it is definite.  
4 The Stability Theorem of Weakly Ergodic MCS 
An important issue for any computational system is whether the machine is robust 
with respect to small perturbations of the system's parameters or under some ex- 
ternal noise. The stability of language recognition by weakly ergodic MCS's under 
perturbations of their Markov operators was previously considered by Rabin [9] and 
Paz [7, 8]. We next state a general version of the stability theorem that is applicable 
to our wide notion of weakly ergodic systems. 
We first define two MCS's Ad and Ad to be similar if they share the same measur- 
able space (f,B), alphabet E, and sets 4 and 7, and if they differ only by their 
associated Markov operators. 
Theorem 3 Let .M and .M be two similar MCS's such that the first is weakly 
ergodic. Then there is a > O, such that if liP. - P. 111 <, for all u e r,, then 
the second is also weakly ergodic. Moreover, these two MCS's recognize exactly the 
same class of languages.  
Corollary 3.1 Let .M and .M be two similar MCS's. Suppose that the first is 
weakly ergodic. Then there exists/3 > O, such that ifsup.4es IP.(x,A)-P.(x,A)I _< 
/3 for all u 6 E, x  f2, the second is also weakly ergodic. Moreover, these two MCS's 
recognize exactly the same class of languages.  
A mathematically deeper result which implies Theorem 3 was proven in [11]' 
Theorem 4 Let .M and .M be two similar MCS's, such that the first is weakly 
ergodic and the second is arbitrary. Then, for any c > 0 there exists  > 0 such 
that liP. - _< e for all u   implies IlP - _<  for all words w 6 E*.  
Theorem 3 follows from Theorem 4. To see this, one can chose any c < p in Theorem 
4 and observe that IIPo - Poll _< c < p implies that the word w is accepted or 
rejected by Ad in accordance to whether it is accepted or rejected by 
Noisy Neural Networks and Generalizations 339 
5 Conditions on the Transition Probabilities 
This section discusses practical conditions for weakly ergodic MCS's in which the 
Markov operators P are induced by transition probability functions as in (4). 
Clearly, a simple sufficient condition for an MCS to be weakly ergodic is given 
by supper. d(P) _< I - c, for some c > 0. 
Maass and Sontag used Doeblin's condition to prove the computational power of 
noisy neural networks [5]. Although the networks in [5] constitute a very particular 
case of weakly ergodic MCS's, Doeblin's condition is applicable also to our general 
model. The following version of Doeblin's condition was given by Doob [3]: 
Definition 5.1 [3] Let P(x, A) be a TPF on (q, B). We say that it satisfies Doeblin 
condition, D s, if there exists a constant c and a probability measure it on (, B) 
such that P"(z,A) >_ cit(A) for any set A  B.  
If an MCS .M is weakly ergodic, then all its associated TPF P(z, A), w  Y:. must 
satisfy D8 for some n = n(w). Doob has proved [3, p. 197] that if P(x,A) satisfies 
Doeblin's condition D0  with constant c, then for any it,,  P, liPit- P'11 <_ 
(1 - c)]ll t - '11, i.e., d(P) _< 1 - c. This leads us to the following definition. 
Definition 5.2 Let .M be an MCS. We say that the space f is small with respect 
to AA if there exists an rn > 0 such that all associated TPF Pw(x,A), w  E " 
satisfy Doeblin's condition D& uniformly with the same constant c, i.e., Pw (x, A) >_ 
cit(A), w  Y?.  
The following theorem strengthens the result by Maass and Sontag [5]. 
Theorem 5 Let .M be an MC$. If the space f is small with respect to .M, then 
fid is weakly ergodic, and it can recognize only definite languages.  
This theorem provides a convenient method for checking weak ergodicity in a given 
TPF. The theorem implies that it is sufficient to execute the following simple check: 
choose any integer n, and then verify that for every state x and all input strings 
w  Y?, the "absolutely continuous" part of all TPF P, w  Y? is uniformly 
bounded from below: 
Pw ({y: pw(x,y) >_ c for all w  Y?) _> c2, (6) 
where pw(x,y) is the density of the absolutely continuous component of Pw(x,.) 
with respect to Pw, and cl, c2 are positive numbers. 
Most practical systems can be defined by null preserving TPF (including for example 
the systems in [5]). For these systems we provide (Theorem 6) a sufficient and neces- 
sary condition in terms of density kernels. A TPF Pu(x, A), u 6 E is called null pre- 
serving with respect to a probability measure it  P if it has a density with respect 
to it i.e., P(x,A)= f,4 p(x,z)it(dz). It is not hard to see, that the property of null 
preserving per letter u 6 E implies that all TPF Pw(x, A) of words w  E* are null 
preserving as well. In this case ((P)= 1- inf,y frrnin{p(x,z),p(y,z)}it(dz) 
and we have: 
Theorem 6 Let fid be an MC$ defined by null preserving transition probability 
functions Pu, u  E. Then, fid is weakly ergodic if and only if there exists n such 
that infwcr.. inf,,y fa rnin{pu(x,z),pu(y,z)}itu(dz) > O.  
A similar result was previously established by Paz [7, 8] for the case of a denumerable 
state space . This theorem allows to treat examples which are not covered by 
340 H. T. Siegelmann, A. Roitershtein and A. Ben-Hur 
Theorem 5. For example, suppose that the space f is not small with respect to an 
MCS .M, but for some n and any w 6 E n there exists a measure bw on (f, B) with 
the property that for any couple of states x, y 6 f 
bw ({z: min{po(x,z),pw(y,z)} > Cl}) > cs, (7) 
where pw(x,y) is the density of Pw(x,.) with respect to bw, and Cl,C are positive 
numbers. This condition may occur even if there is no y such that pu(x, y) <_ Cl for 
all x  fl. 
6 Examples of Weakly Ergodic Systems 
1. The Synchronous Parallel Model 
Let (fi, Bi), i = 1, 2, ..., N be a collection of measurable sets. Define fi = 1-Ij 
and 15 4 = 1-Ij Bj. Then (fi, B i) are measurable spaces. Define also E = 
and T = {Px,,(x,Ai) : (xl,u) 6 Z} be given stochastic kernels. Each set 
defines an MCS AAi. We can define an aggregate MCS by setting 
B -- rli/3/, s - rli /, R - rli Ri, and 
P,(x,A) : H Px,,,(xi,Ai). (8) 
i 
This describes a model of N noisy computational systems that update in syn- 
chronous parallelism. The state of the whole aggregate is a vector of states of the 
individual components, and each receives the states of all other components as part 
of its input. 
Theorem 7 [12] Let A4 be an MC$ defined by.equation (8). It is weakly erg.odic 
P' _ - for any u 
at least one set of operators T i is such that 5(u,x) < 1 c 
and some positive number c.  
2. The Asynchronous Parallel Model 
In this model, at every step only one component is activated. Suppose that a collec- 
tion of N similar MCS's .AAi, i = 1,..., N is given. Consider a probability measure 
e = {el,..., es} on the set K - {1,..., N}. Assume that in each computational 
step only one MCS is activated. The current state of the whole aggregate is rep- 
resented by the state of its active component. Assume also that the probability of 
a computational system .AAi to be activated, is time-independent and is given by 
Prob(Adi) = ei. The aggregate system is then described by stochastic kernels 
N 
P,(x,A) : E eiP'i(x'A)' (9) 
i=1 
Theorem 8 [12] Let AA be an MCS defined by formula (9). It is weakly ergodic if 
at least one set of operators 1 
{ Pu  ),..., { Pu N) is weakly ergodic.  
3. Hybrid Weakly Ergodic Systems 
We now present a hybrid weakly ergodic computational system consisting of both 
continuous and discrete elements. The evolution of the system is governed by a 
differential equation, while its input arrives at discrete times. Let f = IR n, and 
consider a collection of differential equations 
(s) = (x(s)), u 6 r,, s e [0, ). (0) 
Noisy Neural Networks and Generalizations 341 
Suppose that p, (x) is sufficiently smooth to ensure the existence and uniqueness of 
solutions of Equation (10) for s  [0, 1] and for any initial condition. 
Consider a computational system which receives an input u(t) at discrete times 
to,t,t2 .... In the interval t  [ti,ti+.] the behavior of the system is described by 
Equation (10), where s -- t-ti. A random initial condition for the time t, is defined 
by 
Prob[x,(t)(O)  A] = P(x(t_)(1),A), (11) 
where xu (t _  ) (1) is the st ate of the system after previously completed cornput atio ns, 
and Pu (x, A), u C Z is a family of stochastic kernels on  x B. This describes a system 
which receives inputs in discrete instants of time; the input letters u  Z cause 
random perturbations of the st ate x, (t- ) (1) governed by the transition probability 
functions Pu(t)(xu(t-), A). In all other times the system is a noise-free continuous 
computational system which evolves according to equation (10). 
Let  = I n , x0   be a distinguished initial state, and let $ and R be two subsets 
of  with the property of having a p-gap: dist($, R) = infes,yeR I[x - YI[ = P ) O. 
The first set is called a set of accepting final states and the second is called a 
set of rejecting final states. We say that the hybrid computational system ./M = 
(,E,xo,Pu,$,R) recognizes L _C E* if for all w = wo...w  E* and the end 
letter $  E the following holds: w  L : Prob(x,$(1)  $)   + e, and 
1 
Theorem 9 [12] Let JM be a hybrid computational system. It is weakly ergodic if 
its set of evolution operators T - {Pu ' u  E) is weakly ergodic.  
References 
[1] Casey, M., The Dynamics of Discrete-Time Computation, With Application to Re- 
current Neural Networks and Finite State Machine Extraction, Neural Computation 
8, 1135-1178, 1996. 
[2] Dobrushin, R. L., Central limit theorem for nonstationary Markov chains I, II. 
Theor. Probability Appl. vol. 1, 1956, pp 65-80, 298-383. 
[3] Doob J. L., Stochastic Processes. John Wiley and Sons, Inc., 1953. 
[4] W. Maass and Orponen, P., On the effect of analog noise in discrete time computation, 
Neural Computation, 10(5), 1998, pp. 1071-1095. 
[5] W. Maass and Sontag, E., Analog neural nets with Gaussian or other common noise 
distribution cannot recognize arbitrary regular languages, Neural Computation, 11, 
1999, pp. 771-782. 
[6] Neveu J., Mathematical Foundations of the Calculus of Probability. Holden Day, San 
Francisco, 1964. 
[7] Paz A., Ergodic theorems for infinite probabilistic tables. Ann. Math. Statist. vol. 
41, 1970, pp. 539-550. 
[8] Paz A., Introduction to Probabilistic Automata. Academic Press, Inc., London, 1971. 
[9] Rabin, M., Probabilistic automata, Information and Control, vol 6, 1963, pp. 230-245. 
[10] Siegelmann H. T., Neural Networks and Analog Computation: Beyond the Turing 
Limit. Birkhauser, Boston, 1999. 
[11] Siegelmann H. T. and Roitershtein A., On weakly ergodic computational systems, 
1999, submitted. 
[12] Siegelmann H. T., Roitershtein A., and Ben-Hur, A., On noisy computational sys- 
tems, 1999, Discrete Applied Mathematics, accepted. 
