The Asymptotic Convergence-Rate of 
Q-learning 
Cs. Szepesvri* 
Research Group on Artificial Intelligence, "JSzsef Attila" University, 
Szeged, Aradi vrt. tere 1, Hungary, H-6720 
szepes@math.u-szeged.hu 
Abstract 
In this paper we show that for discounted MDPs with discount 
factor 3'  1/2 the asymptotic rate of convergence of Q-learning 
is O(1/t R(-v)) if R(1- 3')  1/2 and O(v/loglogt/t) otherwise 
provided that the state-action pairs are sampled from a fixed prob- 
ability distribution. Here R -- Pmin/Pmax is the ratio of the min- 
imum and maximum state-action occupation frequencies. The re- 
sults extend to convergent on-line learning provided that Pmin  0, 
where Pmin and Pmax now become the minimum and maximum 
state-action occupation frequencies corresponding to the station- 
ary distribution. 
I INTRODUCTION 
Q-learning is a popular reinforcement learning (RL) algorithm whose convergence is 
well demonstrated in the literature (Jaakkola et al., 1994; Tsitsiklis, 1994; Littman 
and Szepesvri, 1996; Szepesvri and Littman, 1996). Our aim in this paper is to 
provide an upper bound for the convergence rate of (lookup-table based) Q-learning 
algorithms. Although, this upper bound is not strict, computer experiments (to be 
presented elsewhere) and the form of the lemma underlying the proof indicate that 
the obtained upper bound can be made strict by a slightly more complicated defi- 
nition for R. Our results extend to learning on aggregated states (see (Singh et al., 
1995)) and other related algorithms which admit a certain form of asynchronous 
stochastic approximation (see (Szepesvgri and Littman, 1996)). 
Present address: Associative Computing, Inc., Budapest, Konkoly Thege M. u. 29-33, 
HUNGARY-1121 
The Asymptotic Convergence-Rate of Q-learning 1065 
2 Q-LEARNING 
Watkins introduced the following algorithm to estimate the value of state-action 
pairs in discounted Markovian Decision Processes (MDPs) (Watkins, 1990): 
Qt+(x,a) -- (1-ot(x,a))Qt(x,a) + ot(x,a)(rt(x,a) +"/maxQt(yt,b)). (1) 
bGA 
Here x & X and a  A are states and actions, respectively, X and A are finite. It is 
assumed that some random sampling mechanism (e.g. simulation or interaction with 
a real Markovian environment) generates random samples of form (xt, at,yt,rt), 
where the probability of Yt given (xt,at) is fixed and is denoted by P(xt,at,yt), 
E[rtlxt, at] = R(x,a) is the immediate average reward which is received when 
executing action a from state x, Yt and rt are assumed to be independent given the 
history of the learning-process, and also it is assumed that Var[rtlxt, at] < ' for 
some C > 0. The values 0 _< at(x, a) _< I are called the learning rate associated 
with the state-action pair (x, a) at time t. This value is assumed to be zero if 
(x, a)  (xt, at), i.e. only the value of the actual state and action is reestimated in 
each step. If 
at(x, a) = (2) 
t=l 
and 
 at2(x,a) <  (3) 
t=l 
then Q-learning is guaranteed to converge to the only fixed point Q* of the operator 
T  x x A _> X x A defined by 
(TQ)(x,a) - R(x,a) +"7 Y P(x,a,y)mxQ(y,b) 
yGX 
(convergence proofs can be found in (Jaakkola et al., 1994; Tsitsiklis, 1994; Littman 
and Szepesvgri, 1996; Szepesvgri and Littman, 1996)). Once Q* is identified the 
learning agent can act optimally in the underlying MDP simply by choosing the 
action which maximizes Q* (x, a) when the agent is in state x (Ross, 1970; Puterman, 
1994). 
3 THE MAIN RESULT 
Condition (2) on the learning rate at(x, a) requires only that every state-action pair 
is visited infinitely often, which is a rather mild condition. In this article we take the 
stronger assumption that {(xt, at)}t is a sequence of independent random variables 
with common underlying probability distribution. Although this assumption is not 
essential it simplifies the presentation of the proofs greatly. A relaxation will be 
discussed later. We further assume that the learning rates take the special form 
 if (x,a) = (xt,a); 
S t (----, a ' 
at(x, a) = O, otherwise, 
where St (x, a) is the number of times the state-action pair was visited by the process 
(xs, as) before time step t plus one, i.e. St(x, a) -- 1 + #{ (xs, as) = (x, a), I _< s _< 
1066 C. Szepesrdri 
t ). This assumption could be relaxed too as it will be discussed later. For technical 
reasons we further assume that the absolute value of the random reinforcement 
signals rt admit a common upper bound. Our main result is the following: 
THEOREM 3.1 Under the above conditions the .following relations hold asymptoti- 
cally and with probability one: 
B 
IQt(x,a) - Q*(x,a)l _ tR(_n) (4) 
and 
IQt(x,a) - Q*(x,a)l < B/1glgt 
- t 
for some suitable 
min(x,,) p(x, a) 
bility of (x, a). 
constant B  O. Here R = 
and Pmax - may:(x,.)p(x, a), where p(x, a) 
(5) 
Pmin/Pmax, where Pmin -- 
is the sampling proba- 
Note that if ? _ 1 -Pmax/2pmin then (4) is the slower, while if ? < 1 -Pmax/2pmin 
then (5) is the slower. The proof will be presented in several steps. 
Step 1. Just like in (Littman and SzepesvAri, 1996) (see also the extended version 
(Szepesvi and Littman, 1996)) the main idea is to compare Qt with the simpler 
process 
(t+(x,a) = (1-ct(x,a))Qt(x,a) +ct(x,a)(rt(x,a) +?maxQ*(y,,b)). (6) 
Note that the only (but rather essential) difference between the definition of Qt and 
that of Qt is the appearance of Q* in the defining equation of Qt. Firstly, notice 
that as a consequence of this change the process Qt c, learly converges to Q* and 
this convergence may be investigated along each component (x, a) separately using 
standard stochastic-approximation techniques (see e.g. (Wasan, 1969; Poljak and 
Tsypkin, 1973)). 
Using simple devices one can show that the difference process At(x, a) = ]Qt(x, a) - 
(t (x, a)l satisfies the following inequality: 
zXt+x(x,a) (1-ot(x,a))zXt(x,a) +.ot(x,a)(ll/xtll + llt -Q*ll). (7) 
Here 11. I[ stands for the maximum norm. That is the task of showing the convergence 
rate of Qt to Q* is reduced to that of showing the convergence rate of At to zero. 
Step 2. We simplify the notation by introducing the abstract process whose update 
equation is 
xt+x(i) = i St(i) xt(i) + $t(i---- Ilxtll + et , (S) 
where i e 1,2,...,n can be identified with the state-action pairs, xt with At, 
et with Qt - Q*, etc. We analyze this process in two steps. First we consider 
processes when the "perturbation-term" et is missing. For such processes we have 
the following lemma: 
LEMMA 3.2 Assume that rli,,... ,rh,... are independent random variables with 
a common underlying distribution P(rt = i) = Pi  O. Then the process xt defined 
The Asymptotic Convergence-Rate of Q-learning 1067 
by 
satisfies 
x,+,(i)- { (1 
zt(i) 
,) 
s.(i) x.(i) + s.-ollx. 11.  , = i; 
0 1 
]lxtll = (t/(i-_,)) 
with probability one (w.p. 1), where R = mini Pi/ maxi Pi. 
Proof. (Outline) Let To = 0 and 
Tt*+l = min{ t >_ Tt* lVi = 1... n, 3s = s(i) : V8 = i }, 
i.e. Tt*+l is the smallest time after time Tt* such that during the time interval 
[Tt* + 1,Tt*+l] all the components of xt(.) are "updated" in Equation (9) at least 
once. Then 
XT}++i(i) _< 1  IIxT+xll, (10) 
where S = mi S(i). This inequity holds because if t (i) is the lt time in 
[T + 1, T+l] when the ith component is updated then 
XT}++X (i) 
-- Xt(i)+l(i) -- (1- 1/St}(i))Xt}(i)(i) q-7/St}(i)llxt}(i)(.)l I 
_< (1 - 1/St,(i))llxt(i)(')ll 
1 
 Ilxn+l(')11, 
where it was exploited that Ilxtll is decreasing. 
time yields 
k-1 
xr+l(.) _< II/oll 1-I (1 
j=O 
Now, iterating (10) backwards in 
Now, consider the following approximations: Tt*  Ck, where C >_ 1/Pmi n (C can be 
computed explicitly from {pi}), St,  PmaxTt*+l  Pmax/Pmin(k + 1)  (k + 1)/Pro, 
where R0 = 1/Cpmax. Then, using Large Deviation's Theory, 
( 
t*-i 1 -7 Ro(1 -7) 
II 1 II 1-  
j=o J j+l 
j=O 
Ro(1--7) 
(11) 
holds w.p.1. Now, by defining s = Tt* + 1 so that sic  k we get 
Ro (1-')') Ro (1-')') 
 Ilxoll _< Ilxoll , 
which holds due to the monotonicity of xt and 1/kRo(1-'r) and because R -- 
Pmin/Pmax  Ro. [] 
Step 3. Assume that 7 > 1/2. Fortunately, we know by an extension of the Law of 
the Iterated Logarithm to stochastic approximation processes that the convergence 
1068 
rate of IlOt-Q*I] is O (v/loglogt/t) (the uniform boundedness of the random rein- 
forcement signals must be exploited in this step) (Major, 1973). Thus it is sufficient 
to provide a convergence rate estimate for the perturbed process, xt, defined by (8), 
when et = Cv/loglogt/t for some C > 0. We state that the convergence rate of et 
is faster than that of xt. Define the process 
1_ 7 
zt+ (i) = (1 - St(i)) zt(i), 
zt(i), 
if rh = i; (12) 
if rh  i. 
This process clearly lower bounds the perturbed process, xt. Obviously, the con- 
vergence rate of zt is O(1/t -) which is slower than the convergence rate of et 
provided that ? > 1/2, proving that et must be faster than xt. Thus, asymptoti- 
cally et _< (1/- 7 - 1)xt, and so IIxtll is decreasing for large enough t. Then, by an 
argument similar to that of used in the derivation of (10), we get 
1 -7) 
8/, 
where s/* = mini S/*(i). By some approximation arguments similar to that of Step 2, 
together with the bound (l/n" 2 s"3/2v/loglogs _< s-W2x/loglogs, I > rl > O, 
which follows from the mean-value theorem for integrals and the law of integration 
by parts, we get that xt  O(1/tRO-v)). The case when ? _< 1/2 can be treated 
similarly. 
Step 5. Putting the pieces together and applying them for A t = Qt - Qt yields 
Theorem 3.1. 
4 DISCUSSION AND CONCLUSIONS 
The most restrictive of our conditions is the assumption concerning the sampling 
of (xt, at). However, note that under a fixed learning policy the process (xt,at) 
is a (non-stationary) Markovian process and if the learning policy converges in 
the sense that limt-,c P(atl,t) - P(at I xt) (here r t stands for the history of the 
learning process) then the process (xt, at) becomes eventually stationary Markovian 
and the sampling distribution could be replaced by the stationary distribution of 
the underlying stationary Markovian process. If actions become asymptotically 
optimal during the course of learning then the support of this stationary process 
will exclude the state-action pairs whose action is sub-optimal, i.e. the conditions 
of Theorem 3.1 will no longer be satisfied. Notice that the proof of convergence of 
such processes still follows very similar lines to that of the proof presented here (see 
the forthcoming paper (Singh et al., 1997)), so we expect that the same convergence 
rates hold and can be proved using nearly identical techniques in this case as well. 
A further step would be to find explicit expressions for the constant B of The- 
orem 3.1. Clearly, B depends heavily on the sampling of (xt,at), as well as the 
transition probabilities and rewards of the underlying MDP. Also the choice of har- 
monic learning rates is arbitrary. If a general sequence ct were employed then the 
artificial "time" Tt(x,a) = 1/II}=0(1 -at(x,a)) should be used (note that for the 
harmonic sequence Tt(x, a)  t). Note that although the developed bounds are 
asymptotic in their present forms, the proper usage of Large Deviation's Theory 
would enable us to develope non-asymptotic bounds. 
The Asymptotic Convergence-Rate of Q-learning 1069 
Other possible ways to extend the results of this paper may include Q-learning 
when learning on aggregated states (Singh et al., 1995), Q-learning for alternat- 
ing/simultaneous Markov games (Littman, 1994; Szepesv/ri and Littman, 1996) 
and any other algorithms whose corresponding difference process At satisfies an 
inequality similar to (7). 
Yet another application of the convergence-rate estimate might be the convergence 
proof of some average reward reinforcement learning algorithms. The idea of those 
algorithms follows from a kind of Tauberian theorem, i.e. that discounted sums 
converge to the average value if the discount rate converges to one (see e.g. Lemma 1 
of (Mahadevan, 1994; Mahadevan, 1996) or for a value-iteration scheme relying on 
this idea (Hordjik and Tijms, 1975)). Using the methods developed here the proof 
of convergence of the corresponding Q-learning algorithms seems quite possible. We 
would like to note here that related results were obtained by Bertsekas et al. et. al 
(see e.g. (Bertsekas and Tsitsiklis, 1996)). 
Finally, note that as an application of this result we immediately get that the con- 
vergence rate of the model-based RL algorithm, where the transition probabilities 
and rewards are estimated by their respective averages, is clearly better than that of 
for Q-learning. Indeed, simple calculations show that the law of iterated logarithm 
holds for the learning process underlying model-based RL. Moreover, the exact ex- 
pression for the convergence rate depends explicitly on how much computational 
effort we spend on obtaining the next estimate of the optimal value function, the 
more effort we spend the faster is the convergence. This ,bound thus provides a direct 
way to control the tradeoff between the computational effort and the convergence 
rate. 
Acknowledgements 
This research was supported by OTKA Grant No. F20132 and by a grant provided 
by the Hungarian Educational Ministry under contract no. FKFP 1354/1997. I 
would like to thank Andris Kr/mli and Michael L. Littman for numerous helpful 
and thought-provoking discussions. 
References 
Bertsekas, D. and Tsitsiklis, J. (1996). Neuro-Dynamic Programming. Athena 
Scientific, Belmont, MA. 
Hordjik, A. and Tijms, H. (1975). A modified form of the iterative method of 
dynamic programming. Annals of Statistics, 3:203-208. 
Jaakkola, T., Jordan, M., and Singh, S. (1994). On the convergence of stochastic 
iterative dynamic programming algorithms. Neural Computation, 6(6):1185- 
1201. 
Littman, M. (1994). Markov games as a framework for multi-agent reinforcement 
learning. In Proc. of the Eleventh International Conference on Machine Learn- 
ing, pages 157-163, San Francisco, CA. Morgan K'auffman. 
Littman, M. and Szepesvgri, C. (1996). A Generalized Reinforcement Learning 
Model: Convergence and applications. In Int. Conf. on Machine Learning. 
http://iserv.iki.kfid.hu/asl-publs. html. 
1070 C. Szepesvgri 
Mahadevan, S. (1994). To discount or not to discount in reinforcement learning: A 
case study comparing R learning and Q learning. In Proceedings of the Eleventh 
International Conference on Machine Learning, pages 164-172, San Francisco, 
CA. Morgan Kaufmann. 
Mahadevan, S. (1996). Average reward reinforcement learning: Foundations, algo- 
rithms, and empirical results. Machine Learning, 22(1,2,3):124-158. 
Major, P. (1973). A law of the iterated logarithm for the Robbins-Monro method. 
Studia Scientiarum Mathematicarum Hungatica, 8:95-102. 
Poljak, B. and Tsypkin, Y. (1973). Pseudogradient adaption and training algo- 
rithms. Automation and Remote Control, 12:83-94. 
Puterman, M. L. (1994). Markov Decision Processes -- Discrete Stochastic Dynamic 
Programming. John Wiley & Sons, Inc., New York, NY. 
Ross, S. (1970). Applied Probability Models with Optimization Applications. Holden 
Day, San Francisco, California. 
Singh, S., Jaakkola, T., and Jordan, M. (1995). Reinforcement learning with soft 
state aggregation. In Proceedings of Neural Information Processing Systems. 
Singh, S., Jaakkola, T., Littman, M., and Csaba Szepesvg ri (1997). On the con- 
vergence of single-step on-policy reinforcement-learning algorithms. Machine 
Learning. in preparation. 
Szepesvri, C. and Littman, M. (1996). Generalized Markov Decision Processes: Dy- 
namic programming and reinforcement learning algorithms. Machine Learning. 
in preparation, available as TR CS96-10, Brown Univ. 
Tsitsildis, J. (1994). Asynchronous stochastic approximation and q-learning. Ma- 
chine Learning, 8(3-4):257-277. 
Wasan, T. (1969). Stochastic Approximation. Cambridge University Press, London. 
Watkins, C. (1990). Learning from Delayed Rewards. PhD thesis, King's College, 
Cambridge. QLEARNING. 
