Convergence of Indirect Adaptive 
Asynchronous Value Iteration Algorithms 
Vijaykumar Gullapalll 
Department of Computer Science 
University of Massachusetts 
Amherst, MA 01003 
vijaycs.umass.edu 
Andrew G. Barto 
Department of Computer Science 
University of Massachusetts 
Amherst, MA 01003 
barto@cs.umass.edu 
Abstract 
Reinforcement Learning methods based on approximating dynamic 
programming (DP) are receiving increased attention due to their 
utility in forming reactive control policies for systems embedded 
in dynamic environments. Environments are usually modeled as 
controlled Markov processes, but when the environment model is 
not known a priori, adaptive methods are necessary. Adaptive con- 
trol methods are often classified as being direct or indirect. Direct 
methods directly adapt the control policy from experience, whereas 
indirect methods adapt a model of the controlled process and com- 
pute control policies based on the latest model. Our focus is on 
indirect adaptive DP-based methods in this paper. We present a 
convergence result for indirect adaptive asynchronous value itera- 
tion algorithms for the case in which a look-up table is used to store 
the value function. Our result implies convergence of several ex- 
isting reinforcement learning algorithms such as adaptive real-time 
dynamic programming (ARTDP) (Barto, Bradtke, & Singh, 1993) 
and prioritized sweeping (Moore & Atkeson, 1993). Although the 
emphasis of researchers studying DP-bascd reinforcement learning 
has been on direct adaptive methods such as Q-Learning (Watkins, 
1989) and methods using TD algorithms (Sutton, 1988), it is not 
clear that these direct methods are preferable in practice to indirect 
methods such as those analyzed in this paper. 
695 
696 Gullapalli and Barto 
1 INTRODUCTION 
Reinforcement learning methods based on approximating dynamic programming 
(DP) are receiving increased attention due to their utility in forming reactive con- 
trol policies for systems embedded in dynamic environments. In most of this work, 
learning tasks are formulated as Markovian Decision Problems (MDPs) in which 
the environment is modeled as a controlled Markov process. For each observed 
environmental state, the agent consults a policy to select an action, which when 
executed causes a probabilistic transition to a successor state. State transitions 
generate rewards, and the agent's goal is to form a policy that maximizes the ex- 
pected value of a measure of the long-term reward for operating in the environment. 
(Equivalent formulations minimize a measure of the long-term cost of operating in 
the environment.) Artificial neural networks are often used to store value functions 
produced by these algorithms (e.g., (Tesauro, 1992)). 
Recent advances in reinforcement learning theory have shown that asynchronous 
value iteration provides an important link between reinforcement learning algo- 
rithms and classical DP methods for value iteration (VI) (Barto, Bradtke, & Singh, 
1993). Whereas conventional VI algorithms use repeated exhaustive "sweeps" of the 
MDP's state set to update the value function, asynchronous VI can achieve the same 
result without proceeding in systematic sweeps (Bertsekas & Tsitsiklis, 1989). If the 
state ordering of an asynchronous VI computation is determined by state sequences 
generated during real or simulated interaction of a controller with the Markov pro- 
cess, the result is an algorithm called Real-Time DP (RTDP) (Barto, Bradtke, & 
Singh, 1993). Its convergence to optimal value functions in several kinds of prob- 
lems follows from the convergence properties of asynchronous VI (Barto, Bradtke, 
& Singh, 1993). 
2 MDPS WITH INCOMPLETE INFORMATION 
Because asynchronous VI employs a basic update operation that involves computing 
the expected value of the next state for all possible actions, it requires a complete 
and accurate model of the MDP in the form of state-transition probabilities and ex- 
pected transition rewards. This is also true for the use of asynchronous VI in RTDP. 
Therefore, when state-transition probabilities and expected transition rewards are 
not completely known, asynchronous VI is not directly applicable. Problems such 
as these, which are called MDPs with incomplete information, x require more com- 
plex adaptive algorithms for their solution. An indirect adaptive method works by 
identifying the underlying MDP via estimates of state transition probabilities and 
expected transition rewards, whereas a direct adaptive method (e.g., Q-Learning 
(Watkins, 1989)) adapts the policy or the value function without forming an ex- 
plicit model of the MDP through system identification. 
In this paper, we prove a convergence theorem for a set of algorithms we call indirect 
adaptive asynchronous VI algorithms. These are indirect adaptive algorithms that 
result from simply substituting current estimates of transition probabilities and ex- 
pected transition rewards (produced by some concurrently executing identification 
These problems should not be confused with MDPs with incomplete ,tate information, 
i.e., partially observable MDPs. 
Convergence of Indirect Adaptive Asynchronous Value Iteration Algorithms 697 
algorithm) for their actual values in the asynchronous value iteration computation. 
We show that under certain conditions, indirect adaptive asynchronous VI algo- 
rithms converge with probability one to the optimal value function. Moreover, we 
use our result to infer convergence of two existing DP-based reinforcement learning 
algorithms, adaptive real-time dynamic programming (ARTDP) (Barto, Bradtke, 
& Singh, 1993), and prioritized sweeping (Moore & Atkeson, 1993). 
3 
CONVERGENCE OF INDIRECT ADAPTIVE 
ASYNCHRONOUS VI 
Indirect adaptive asynchronous VI algorithms are produced from non-adaptive algo- 
rithms by substituting a current approximate model of the MDP for the true model 
in the asynchronous value iteration computations. An indirect adaptive algorithm 
can be expected to converge only if the corresponding non-adaptive algorithm, with 
the true model used in the place of each approximate model, converges. We therefore 
restrict attention to indirect adaptive asynchronous VI algorithms that correspond 
in this way to convergent non-adaptive algorithms. We prove the following theorem: 
Theorem 1 For any finite state, finite action MDP with an infinite-horizon dis- 
counted performance measure, any indirect adaptive asynchronous VI algorithm (for 
which the corresponding non-adaptive algorithm converges) converges to the optimal 
value function with probability one if 
1) the conditions for convergence of the non-adaptive algorithm are met, 
) in the limit, every action is executed from every state infinitely often, and 
3) the estimates of the state-transition probabilities and the expected transition re- 
wards remain bounded and converge in the limit to their true values with probability 
one. 
Proof The proof is given in Appendix A.2. 
4 DISCUSSION 
Condition 2 of the theorem, which is also required by direct adaptive methods 
to ensure convergence, is usually unavoidable. It is typically ensured by using a 
stochastic policy. For example, we can use the Gibbs distribution method for se- 
lecting actions used by Watkins (1989) and others. Given condition 2, condition 3 is 
easily satisfied by most identification methods. In particular, the simple maximum- 
likelihood identification method (see Appendix A.1, items 6 and 7) converges to the 
true model with probability one under this condition. 
Our result is valid only for the special case in which the value function is explicitly 
stored in a look-up table. The case in which general function approximators such 
as neural networks are used requires further analysis. 
Finally, an important issue not addressed in this paper is the trade-off between 
system identification and control. To ensure convergence of the model, all actions 
have to be executed infinitely often in every state. On the other hand, on-line control 
objectives are best served by executing the action in each state that is optimal 
according to the current value function (i.e., by using the certainty equivalence 
698 Gullapalli and Barto 
optimal policy). This issue has received considerable attention from control theorists 
(see, for example, (Kumar, 1985), and the references therein). Although we do not 
address this issue in this paper, for a specific estimation method, it may be possible 
to determine an action selection scheme that makes the best trade-off between 
identification and control. 
5 
EXAMPLES OF INDIRECT ADAPTIVE 
ASYNCHRONOUS VI 
One example of an indirect adaptive asynchronous VI algorithm is ARTDP (Barto, 
Bradtke, & Singh, 1993) with maximum-likelihood identification. In this algorithm, 
a randomized policy is used to ensure that every action has a non-zero probability 
of being executed in each state. The following theorem for ARDTP follows directly 
from our result and the corresponding theorem for RTDP in (Barto, Bradtke, & 
Singh, 1993): 
Theorem 2 For any discounted MDP and any initial value function, trial-based 2 
ARTDP converges with probability one. 
As a special case of the above theorem, we can obtain the result that in similar prob- 
lems the prioritized sweeping algorithm of Moore and Atkeson (Moore & Atkeson, 
1993) converges to the optimal value function. This is because prioritized sweeping 
is a special case of ARTDP in which states are selected for value updates based 
on their priority and the processing time available. A state's priority reflects the 
utility of performing an update for that state, and hence prioritized sweeping can 
improve the efficiency of asynchronous VI. A similar algorithm, Queue-Dyna (Peng 
& Williams, 1992), can also be shown to converge to the optimal value function 
using a simple extension of our result. 
CONCLUSIONS 
We have shown convergence of indirect adaptive asynchronous value iteration un- 
der fairly general conditions. This result implies the convergence of several existing 
DP-based reinforcement learning algorithms. Moreover, we have discussed possi- 
ble extensions to our result. Our result is a step toward a better understanding 
of indirect adaptive DP-based reinforcement learning methods. There are several 
promising directions for future work. 
One is to analyze the trade-off between model estimation and control mentioned 
earlier to determine optimal methods for action selection and to integrate our work 
with existing results on adaptive methods for MDPs (Kumar, 1985). Second, anal- 
ysis is needed for the case in which a function approximation method, such as a 
neural network, is used instead of a look-up table to store the value function. A 
third possible direction is to analyze indirect adaptive versions of more general DP- 
based algorithms that combine asynchronous policy iteration with asynchronous 
2As in (Barto, Bradtke, & Singh, 1993), by trial-ba, ed execution of an algorithm we 
mean its use in an infinite series of trials such that every state is selected infinitely often 
to be the start state of a trial. 
Convergence of Indirect Adaptive Asynchronous Value Iteration Algorithms 699 
policy evaluation. Several non-adaptive algorithms of this nature have been pro- 
posed recently (e.g., (Williams & Baird, 1993; Singh & Gullapalli)). 
Finally, it will be useful to examine the relative efficacies of direct and indirect 
adaptive methods for solving MDPs with incomplete information. Although the 
emphasis of researchers studying DP-based reinforcement learning has been on di- 
rect adaptive methods such as Q-Learning and methods using TD algorithms, it is 
not clear that these direct methods are preferable in practice to indirect methods 
such as the ones discussed here. For example, Moore and Atkeson (1993) report sev- 
eral experiments in which prioritized sweeping significantly outperforms Q-learning 
in terms of the computation time and the number of observations required for 
convergence. More research is needed to characterize circumstances for which the 
various reinforcement learning methods are best suited. 
APPENDIX 
A.1 NOTATION 
1. Time steps are denoted t: 1, 2,..., and zt denotes the last state observed 
before time t. zt belongs to a finite state set S -- {1, 2,..., 
2. Actions in a state are selected according to a policy 7r, where 7r(i) E A, a 
finite set of actions, for 1 _< i _< n. 
3. The probability of making a transition from state i to state j on executing 
action a is pa(i, j). 
4. The expected reward from executing action a in state i is r(i, a). The 
reward received at time t is denoted rt(zt, at). 
5. 0 _< 'y < 1 is the discount factor. 
6. Let p(i,j) denote the estimate at time t of the probability of transition 
from state i to j on executing action a E A. Several different methods can be 
used for estimating i5'(i, 5)- For example, if n'(i, j) is the observed number 
of times before time step t that execution of action a when the system was 
in state i was followed by a transition to state j, and r,(i) = YjEs r,(i,j) 
is the number of times action a was executed in state i before time step 
t, then, for I < i < r, and for all a G A, the maximum-likelihood state- 
transition probability estimates at time t are 
n(i) ' l_< j <_n. 
Note that the maximum-likelihood estimates converge to their true values 
with probability one if r,(i) -- o as t -- o, i.e., every action is executed 
from every state infinitely often. 
Let pa(i) = (i, 1),...,p(i, n)] G [0, 1] n, and similarly,/5(i) = 9(i, 1), 
...,p(i,r*)]  [0,1] . We will denote the ISI x IA[ matrix of transition 
probabilities associated with state i by P(i) and its estimate at time t by 
Pt(i). Finally, P denotes the vector of matrices [P(1),...,P(r,)], and Pt 
denotes the vector [Pt(l),..., Pt(r*)]. 
700 Gullapalli and Barto 
7. Let t(i,a) denote the estimate at time t of the epected reward v(i,a), 
and let t denote all the ISI x IAI estimates at time t. Again, if maximum- 
likelihood estimation is used, 
' 
where Ii,: S x A -- {0, 1) is the indicator function for the state-action pair 
i a. 
8. Vt* denotes the optimal value function for the MDP defined by the estimates 
Pt and t of P and r at time t. Thus, Vi  $, 
= + v 
j$ 
Similarly, V* denotes the optimal value function for the MDP defined by 
P and r. 
9. Bt C_ $ is the subset of states whose values are updated at time t. Usually, 
at least 
A.2 PROOF OF THEOREM I 
In indirect adaptive asynchronous VI algorithms, the estimates of the MDP param- 
eters at time step t, Pt and t, are used in place of the true parameters, P and r, in 
the asynchronous VI computations at time t. Hence the value function is updated 
at time t as 
{ max,e.,i{t(i,a) +'y.esl(i,j)Vt(j)} if i  Bt 
Vt+ (i) - Vt(i) otherwise, 
where B(t) C_ S is the subset of states whose values are updated at time t. 
First note that because Pt and t are assumed to be bounded for all t, Vt is also 
bounded for all t. Next, because the optimal value function given the model Pt and 
t, Vt*, is a continuous function of the estimates Pt and t, convergence of these 
estimates w.p. 1 to their true values implies that 
where V* is the optimal value function for the original MDP. The convergence w.p. 
1 of l/t* to V* implies that given an e > 0 there exists an integer T > 0 such that 
for all t _> T, 
(1 -if) 
IIV?- v'11 < w.p. 1. (1) 
Here, II' II can be any norm on R', although we will use the oo or max norm. 
In algorithms based on asynchronous VI, the values of only the states in Bt C_ S are 
updated at time t, although the value of each state is updated infinitely often. For 
an arbitrary z  S, let us define the infinite subsequence {t,)=o to be the times 
when the value of state z gets updated. Further, let us only consider updates at, 
or after, time T, where T is from equation (1) above, so that t >_ T for all z G S. 
Convergence of Indirect Adaptive Asynchronous Value Iteration Algorithms 701 
By the nature of the VI computation we have, for each t >_ 1, 
IV,+(i)- V/(i)l _< 'll v, - V/I I if i e B,. 
Using inequality (2), we can get a bound for IVtg+(z)- V(z)l as 
(2) 
(3) 
We can verify that the bound in (3) is correct through induction. The bound is 
clearly valid for k = 0. Assuming it is valid for k, we show that it is valid for k + 1: 
- '+=11v,, - v,11 q- (1 -q,a+)e. 
Taking the limit as k -- oo in equation (3) and observing that for each 
limk-,oo V (): V'() w.p. 1, we obtain 
lira IV,:+()- v'()l <  w.p. 1. 
k-+oo 
Since e and a axe arbitrary, this implies that Vt -- V* w.p. 1. 
Acknowledgement s 
We gratefully acknowledge the significant contribution of Peter Dayan, who pointed 
out that a restrictive condition for convergence in an earlier version of our result 
was actually unnecessary. This work has also benefited from several discussions 
with Satinder Singh. We would also like to thank Chuck Anderson for his timely 
help in preparing this material for presentation at the conference. This material 
is based upon work supported by funding provided to A. Barto by the AFOSR, 
Bolling AFB, under Grant AFOSR-F49620-93-1-0269 and by the NSF under Grant 
ECS-92-14866. 
References 
[1] A.G. Barto, S.J. Bradtke, and S.P. Singh. Learning to act using real-time 
dynamic programming. Technical Report 93-02, University of Massachusetts, 
Amherst, MA, 1993. 
[2] D. P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed Computation: 
Numerical Methods. Prentice-Hall, Englewood Cliffs, N J, 1989. 
[3] P. R. Kumax. A survey of some results in stochastic adaptive control. SIAM 
Journal of Control and Optimization, 23(3):329-380, May 1985. 
702 Gullapalli and Barto 
[9] 
[10] 
[4] A. W. Moore and C. G. Atkeson. Memory-based reinforcement learning: Ef- 
ficient computation with prioritized sweeping. In S. J. Hanson, J. D. Cowan, 
and C. L. Giles, editors, Advances in Neural Information Processing Systems 
5, pages 263-270, San Mateo, CA, 1993. Morgan Kaufmann Publishers. 
[5] J. Peng and R. J. Williams. Efficient learning and planning within the dyna 
framework. In Proceedings of the Second International Conference on Simula- 
tion of Adaptive Behavior, Honolulu, HI, 1992. 
[6] S. P. Singh and V. Gullapalli. Asynchronous modified policy iteration with 
single-sided updates. (Under review). 
[7] R. S. Sutton. Learning to predict by the methods of temporal differences. 
Machine Learning, 3:9-44, 1988. 
[8] G. J. Tesauro. Practical issues in temporal difference learning. Machine Learn- 
ing, 8(3/4):257-277, May 1992. 
C. J. C. H. Watkins. Learning from delayed rewards. PhD thesis, Cambridge 
University, Cambridge, England, 1989. 
R. J. Williams and L. C. Baird. Analysis of some incremental variants of 
policy iteration: First steps toward understanding actor-critic ]earning sys- 
tems. TechnicM Report NU-CCS-93-11, Northeastern University, College of 
Computer Science, Boston, MA 02115, September 1993. 
