Convergence of Stochastic Iterative 
Dynamic Programming Algorithms 
Tommi Jaakkola* Michael I. Jordan 
Satinder P. Singh 
Department of Brain and Cognitive Sciences 
Massachusetts Institute of Technology 
Cambridge, MA 92139 
Abstract 
Increasing attention has recently been paid to algorithms based on 
dynamic programming (DP) due to the suitability of DP for learn- 
ing problems involving control. In stochastic environments where 
the system being controlled is only incompletely known, however, 
a unifying theoretical account of these methods has been missing. 
In this paper we relate DP-based learning algorithms to the pow- 
erful techniques of stochastic approximation via a new convergence 
theorem, enabling us to establish a class of convergent algorithms 
to which both TD(A) and Q-learning belong. 
I INTRODUCTION 
Learning to predict the future and to find an optimal way of controlling it are the 
basic goals of learning systems that interact with their environment. A variety of 
algorithms are currently being studied for the purposes of prediction and control 
in incompletely specified, stochastic environments. Here we consider learning algo- 
rithms defined in Markov environments. There are actions or controls (u) available 
for the learner that affect both the state transition probabilities, and the proba- 
bility distribution for the immediate, state dependent costs (ci(u)) incurred by the 
learner. Let pij(u) denote the probability of a transition to state j when control 
u is executed in state i. The learning problem is to predict the expected cost of a 
*E-mMI: tommi@psyche.mit.edu 
703 
704 Jaakkola, Jordan, and Singh 
fixed policy tt (a function from states to actions), or to obtain the optimal policy 
(it*) that minimizes the expected cost of interacting with the environment. 
If the learner were allowed to know the transition probabilities as well as the imme- 
diate costs the control problem could be solved directly by Dynamic Programming 
(see e.g., Bertsekas, 1987). However, when the underlying system is only incom- 
pletely known, algorithms such as Q-learning (Watkins, 1989) for prediction and 
control, and TD(,) (Sutton, 1988) for prediction, are needed. 
One of the central problems in developing a theoretical understanding of these 
algorithms is to characterize their convergence; that is, to establish under what 
conditions they are ultimately able to obtain correct predictions or optimal control 
policies. The stochastic nature of these algorithms immediately suggests the use 
of stochastic approximation theory to obtain the convergence results. However, 
there exists no directly available stochastic approximation techniques for problems 
involving the maximum norm that plays a crucial role in learning algorithms based 
on DP. 
In this paper, we extend Dvoretzky's (1956) formulation of the classical Robbins- 
Munro (1951) stochastic approximation theory to obtain a class of converging pro- 
cesses involving the maximum norm. In addition, we show that Q-learning and 
both the on-line and batch versions of TD(A) are realizations of this new class. 
This approach keeps the convergence proofs simple and does not rely on construc- 
tions specific to particular algorithms. Several other authors have recently presented 
results that are similar to those presented here: Dayan and Sejnowski (1993) for 
TD(,), Peng and Williams (1993) for TD(,), and Tsitsiklis (1993) for Q-learning. 
Our results appear to be closest to those of Tsitsiklis (1993). 
2 Q-LEARNING 
The Q-learning algorithm produces values--"Q-values"--by which an optimal ac- 
tion can be determined at any state. The algorithm is based on DP by rewriting 
Bellman's equation such that there is a value assigned to every state-action pair 
instead of only to a state. Thus the Q-values satisfy 
Q(s, u) = g,(u) + '7 y]p**,(u)mfixQ(s', u') (1) 
where  denotes the mean of c. The solution to this equation can be obtained 
by updating the Q-values iteratively; an approach known as the value iteration 
method. In the learning problem the values for the mean of c and for the transition 
probabilities are unknown. However, the observable quantity 
c,(ut) + 7maxQ(st+l, u) (2) 
where st and ut are the state of the system and the action taken at time t, respec- 
tively, is an unbiased estimate of the update used in value iteration. The Q-learning 
algorithm is a relaxation inethod that uses this estimate iteratively to update the 
current Q-values (see below). 
The Q-learning algorithm converges mainly due to the contraction property of the 
value iteration operator. 
Convergence of Stochastic Iterative Dynamic Programming Algorithms 705 
2.1 CONVERGENCE OF Q-LEARNING 
Our proof is based on the observation that the Q-learning algorithm can be viewed as 
a stochastic process to which techniques of stochastic approximation are generally 
applicable. Due to the lack of a formulation of stochastic approximation for the 
maximum norm, however, we need to slightly extend the standard results. This is 
accomplished by the following theorem the proof of which can be found in Jaakkola 
et al. (1993). 
Theorem I A random iterative process An+l(z) -- (1-an(X))An(X)-kfin(X)Fn(X) 
converges to zero w.p.1 under the following assumptions: 
1) The state space is finite. 
2 
2) Z an(X) -- , Zn an(X) < 
E{/3n(x)lPn) _ E{an(x)lPn ) uniformly m.p.1. 
3) II ;{Fn(x)lPn)IIw<_ - II/x, Ilw, where ' e (o, 1). 
4) Var{Fn(x)lPn) _< C(I+ II an II) , where C is some constant. 
Here Pn = {An, An-I,..., Fn-1,..., an-1,-.., fin-i,..-} stands for the past at step 
n. Fn(x), an(X) and fin(X) are allowed to depend on the past insofar as the above 
conditions remain valid. The notation I1' I1 refers to some weighted maximum 
norm. 
In applying the theorem, the An process will generally represent the difference 
between a stochastic process of interest and some optimal value (e.g., the optimal 
value function). The formulation of the theorem therefore requires knowledge to be 
available about the optimal solution to the learning problem before it can be applied 
to any algorithm whose convergence is to be verified. In the case of Q-learning the 
required knowledge is available through the theory of DP and Bellman's equation 
in particular. 
The convergence of the Q-learning algorithm now follows easily by relating the 
algorithm to the converging stochastic process defined by Theorem 1. I 
Theorem 2 The Q-learning algorithm given by 
qt+l(St, ut) = (1 -- at(st, ut))Qt(st, ut) + at(st, ut)[cs,(ut) + "Vt (St+l)] 
converges to the optimal Q*(s, u) values if 
I) The state and action spaces are finite. 
2) Zt at(s, u): oc and Zt a2( s, u) < oo uniformly w.p.1. 
3) Var{c,(u))is bounded. 
We note that the theorem is more powerful than is needed to prove the convergence 
of Q-learning. Its generality, however, allows it to be applied to other algorithms as well 
(see the following section on TD()). 
706 Jaakkola, Jordan, and Singh 
3) If'/= 1, all policies lead to a cost free terminal state w.p.1. 
Proof. By subtracting Q*(s, u) from both sides of the learning rule and by defining 
At(s, u) - Qt(s, u) - Q*(s, u) together with 
F(s, u) = c(u) + - q*(8, u) (3) 
the Q-learning algorithm can be seen to have the form of the process in Theorem 1 
with (s, u) = st(s, u). 
To verify that Ft(s, u) has the required properties we begin by showing that it is a 
contraction mapping with respect to some maximum norm. This is done by relating 
Ft to the DP value iteration operator for the same Markov chain. More specifically, 
maxlE{F(i,u)}l = '/maxlEpij(u)[(j) -- v*(j)]l 
J 
< '/axy.pi(u)maxIQt(j,v)- Q*(j,v)l 
J 
- '/maxZpij(u)V/x(j): T(V/X)(i) 
where we have used the notation l/a(j) = maxv Iq(J, v)-q*(j, v)l and T is the DP 
value iteration operator for the case where the costs associated with each state are 
zero. If'/< 1 the contraction property of E{Ft(i, u)} can be obtained by bounding 
j Pij(u)Va(j) by maxj Va(j) and then including the '/factor. When the future 
costs are not discounted ('/= 1) but the chain is absorbing and all policies lead to 
the terminal state w.p.1 there still exists a weighted maximum norm with respect 
to which T is a contraction mapping (see e.g. Bertsekas & Tsitsiklis, 1989) thereby 
forcing the contraction of E{Ft(i,u)}. The variance of Ft(s,u) given the past is 
within the bounds of Theorem I as it depends on Or(s, u) at most linearly and the 
variance of c(u) is bounded. 
Note that the proof covers both the on-line and batch versions. [] 
3 THE TD(A) ALGORITHM 
The TD(A) (Sutton, 1988) is also a DP-based learning algorithm that is naturally 
defined in a Markov environment. Unlike Q-learning, however, TD does not involve 
decision-making tasks but rather predictions about the future costs of an evolving 
system. TD(A) converges to the same predictions as a version of Q-learning in which 
there is only one action available at each state, but the algorithms are derived from 
slightly different grounds and their behavioral differences are not well understood. 
The algorithm is based on the estimates 
x(i) = (1 - A) E/n-lvt(n)(i) (4) 
n'--I 
where ('*)(i) are n step look-ahead predictions. The expected values of the VX(i) 
are strictly better estimates of the correct predictions than the (i)s are (see 
Convergence of Stochastic Iterative Dynamic Programming Algorithms 
707 
Jaakkola et al., 1993) and the update equation of the algorithm 
+i(it) = (it) + oq[)'(it) - (it)] (5) 
can be written in a practical recursive form as is seen below. The convergence of 
the algorithm is mainly due to the statistical properties of the x(i) estimates. 
3.1 CONVERGENCE OF TD(A) 
As we are interested in strong forms of convergence we need to impose some new 
constraints, but due to the generality of the approach we can dispense with some 
others. Specifically, the learning rate parameters ct,, are replaced by (x,,(i) which 
satisfy ,, c,,(i) - oc and Yn c2(i) < cx> uniformly w.p.1. These parameters 
allow asynchronous updating and they can, in general, be random variables. The 
convergence of the algorithm is guaranteed by the following theorem which is an 
application of Theorem 1. 
Theorem 3 For any finite absorbing Markov chain, for any distribution of starting 
states with no inaccessible states, and for any distributions of the costs with finite 
variances the TD(A) algorithm given by 
1) 
m t 
Vn+l (i) : V,(i) + c,(i) Z[ci, + 7Vn(it+l ) - Vn(it)] Z(TA)t-k Xi(k) 
t:l k:l 
E. = E. < uniformly 
t 
+z(i) = Vt(i) + (x(i)[ci, + 7(it+l) - (it)] E(TA) xi(k) 
k--1 
-.tct(i) : oc and -.,c(i) < oc uniformly w.p.1 and within sequences 
ct(i)/rnaxtesct(i)--> 1 uniformly w.p.1. 
converges to the optimal predictions w.p.1 provided 7, A 6 [0, 1] with 7A < 1. 
Proof for (1): 
previous section). 
We use here a slightly different form for the learning rule (cf. the 
,(i) 
1 
E{rn(i)) Z V(i;k) 
k=l 
where V(i; k) is an estimate calculated at 
sequence and for mathematical convenience 
c,,(i) - E{m(i)}c,,(i), where re(i) is the 
during the sequence. 
re(i) 
E{m(i)) V,, (i)] 
the k tt occurrence of state i in a 
we have made the transformation 
number of times state i was visited 
Vn+l(i) = Vn(i) q- (x,(i)[G,(i) 
708 Jaakkola, Jordan, and Singh 
To apply Theorem 1 we subtract V*(i), the optimal predictions, from both sides of 
the learning equation. By identifying a,,(i) := o,,(i)m(i)/E{m(i)}, /3,,(i) := a,,(i), 
and F,(i) :: G,(i)- V*(i)m(i)/E{m(i)} we need to show that these satisfy the 
conditions of Theorem 1. For a,(i) and/3,,(i) this is obvious. We begin here by 
showing that F,(i) indeed is a contraction mapping. To this end, 
max IE{F.(i)I X/.)l- 
i 
1 
mx I E{m(i)} E{(I/,x(i; 1)- l/*(i)) + (l/,x (i; 2)- l/*(i)) +... l l/,, }1 
which can be bounded above by using the relation 
IE{v(i; k)- v*(i) I 
_< E{ IE{V(i;k)- V*(i)Ira(i) >_ k,V)10(m(i)-k) I 
_< e{m(i) >_ k}lE{V(i)- v*(i) l v.}l 
< 7P{m(i) >_ k} m.axlV,,(i)- V*(i)[ 
where 0(x) = 0 if x < 0 and 1 otherwise. Here we have also used the fact that V(i) 
is a contraction mapping independent of possible discounting. As 5-}k P{m(i) >_ 
k} = E{m(i)} we finally get 
maxlE{F,(i) l V.}l < 7 m.ax [V,,(i)- V*(i)l 
i z 
The variance of F,(i) can be seen to be bounded by 
E{m4} m.xlV.(i)l 2 $ 
For any absorbing Markov chain the convergence to the terminal state is geometric 
and thus for every finite k, E{rn k} <_ C(k), implying that the variance of F,(i) is 
within the bounds of Theorem 1. As Theorem I is now applicable we can conclude 
that the batch version of TD() converges to the optimal predictions w.p.1. [] 
Proof for (2) The proof for the on-line version is achieved by showing that the 
effect of the on-line updating vanishes in the limit thereby forcing the two versions 
to be equal asymptotically. We view the on-line version as a batch algorithm in 
which the updates are made after each complete sequence but are made in such a 
manner so as to be equal to those made on-line. 
Define G',(i) = G,(i) + G(i) to be a new batch estimate taking into account the 
on-line updating within sequences. Here G, (i) is the batch estimate with the desired 
properties (see the proof for (1)) and G(i) is the difference between the two. We 
take the new batch learning parameters to be the maxima over a sequence, that 
is a,,(i) = maxres cq(i). As all the ct(i) satisfy the required conditions uniformly 
w.p.1 these new learning parameters satisfy them as well. 
To analyze the new batch algorithm we divide it into three parallel processes: the 
batch TD(X) with a,(i) as learning rate parameters, the difference between this and 
the new batch estimate, and the change in the value function due to the updates 
made on-line. Under the conditions of the TD(X) convergence theorem rigorous 
Convergence of Stochastic Iterative Dynamic Programming Algorithms 709 
upper bounds can be derived for the latter two processes (see Jaakkola, et al., 
1993). These results enable us to write 
II g{c'- V*)11 
 II E{G - V*)II + II C II 
 (-( + CZ)II V - V* II +C 
where C' and C' go to zero with w.p.1. This implies that for any e > 0 and 
I[ v - v* II>> e there exists 7 < 1 such that 
for n large enough. This is the required contraction property of Theorem 1. In 
addition, it can readily be checked that the variance of the new estimate falls under 
the conditions of Theorem 1. 
Theorem 1 now guarantees that for any e the value function in the on-line algorithm 
converges w.p.1 into some e-bounded region of V* and therefore the algorithm itself 
converges to V* w.p.1. [] 
4 CONCLUSIONS 
In this paper we have extended results from stochastic approximation theory to 
cover asynchronous relaxation processes which have a contraction property with 
respect to some maximum norm (Theorem 1). This new class of converging iterative 
processes is shown to include both the Q-learning and TD(A) algorithms in either 
their on-line or batch versions. We note that the convergence of the on-line version 
of TD(A) has not been shown previously. We also wish to emphasize the simplicity 
of our results. The convergence proofs for Q-learning and TD(A) utilize only high- 
level statistical properties of the estimates used in these algorithms and do not rely 
on constructions specific to the algorithms. Our approach also sheds additional 
light on the similarities between Q-learning and TD(A). 
Although Theorem 1 is readily applicable to DP-based learning schemes, the theory 
of Dynamic Programming is important only for its characterization of the optimal 
solution and for a contraction property needed in applying the theorem. The theo- 
rem can be applied to iterative algorithms of different types as well. 
Finally we note that Theorem I can be extended to cover processes that do not show 
the usual contraction property thereby increasing its applicability to algorithms of 
possibly more practical importance. 
References 
Bertsekas, D. P. (1987). Dynamic Programming: Deterministic and Stochastic Mod- 
els. Englewood Cliffs, NJ' Prentice-Hall. 
Bertsekas, D. P., & Tsitsiklis, J. N. (1989). Parallel and Distributed Computation: 
Numerical Methods. Englewood Cliffs, N J: Prentice-Hall. 
Dayan, P. (1992). The convergence of TD(A) for general h. Machine Learning, 8, 
341-362. 
710 Jaakkola, Jordan, and Singh 
Dayan, P., & Sejnowski, T. J. (1993). TD(A) converges with probability 1. CNL, 
The Salk Institute, San Diego, CA. 
Dvoretzky, A. (1956). On stochastic approximation. Proceedings of the Third Berke- 
ley Symposium on Mathematical Statistics and Probability. University of California 
Press. 
Jaakkola, T., Jordan, M. I., & Singh, S. P. (1993). On the convergence of stochastic 
iterative dynamic programming algorithms. Submitted to Neural Computation. 
Peng J., & Williams R. J. (1993). TD(A) converges with probability 1. Department 
of Computer Science preprint, Northeastern University. 
Robbins, H., & Monro, S. (1951). A stochastic approximation model. Annals of 
Mathematical Statistics, 22, 400-407. 
Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. 
Machine Learning, 3, 9-44. 
Tsitsiklis J. N. (1993). Asynchronous stochastic approximation and Q-learning. 
Submitted to: Machine Learning. 
Watkins, C.J.C.H. (1989). Learning from delayed rewards. PhD Thesis, University 
of Cambridge, England. 
Watkins, C.J.C.H, & Dayan, P. (1992). Q-learning. Machine Learning, 8, 279-292. 
