Reinforcement Learning Methods for 
Continuous-Time Markov Decision 
Problems 
Steven J. Bradtke 
Computer Science Depaxtment 
University of Massachusetts 
Amherst, MA 01003 
bradtke(}cs. umass. edu 
Michael O. Duff 
Computer Science Depaxtment 
University of Massachusetts 
Amherst, MA 01003 
duff cs. umas s. edu 
Abstract 
Semi-Maxkov Decision Problems axe continuous time generaliza- 
tions of discrete time Markov Decision Problems. A number of 
reinforcement leaxning lgorithms have been developed recently 
for the solution of Markov Decision Problems, based on the ideas 
of asynchronous dynamic programming and stochastic approxima- 
tion. Among these are TD(,), Q-leaxning, and Real-time Dynamic 
Programming. After reviewing semi-Maxkov Decision Problems 
and Bellman's optimality equation in that context, we propose l- 
gorithms similax to those named above, adapted to the solution of 
semi-Markov Decision Problems. We demonstrate these algorithms 
by applying them to the problem of determining the optimal con- 
trol for a simple queueing system. We conclude with a discussion 
of circumstances under which these algorithms may be usefully ap- 
plied. 
I Introduction 
A number of reinforcement learning lgorithms based on the ideas of asynchronous 
dynamic programming and stochastic approximation have been developed recently 
for the solution of Markov Decision Problems. Among these axe Sutton's TD(,) 
[10], Watkins' Q-learning [12], and Real-time Dynamic Programming (RTDP) [1, 
394 Steven Bradtke, Michael O. Duff 
3]. These learning alogorithms re widely used, but their domain of application 
has been limited to processes modeled by discrete-time Mrkov Decision Problems 
(MDP's). 
This paper derives analogous algorithms for semi-Mrkov Decision Problems 
(SMDP's) extending the domain of applicability to continuous time. This ef- 
fort was originally motivated by the desire to apply reinforcement learning methods 
to problems of adaptive control of queueing systems, and to the problem of adaptive 
routing in computer networks in prticular. We apply the new algorithms to the 
well-known problem of routing to two heterogeneous servers [7]. We conclude with 
a discussion of circumstances under which these algorithms may be usefully applied. 
2 Semi-Markov Decision Problems 
A semi-Maxkov process is a continuous time dynamic system consisting of a count- 
able state set, ., and a finite action set, .A. Suppose that the system is originally 
observed to be in state a  ., and that action a  .A is applied. A semi-Maxkov 
process [9] then evolves as follows: 
 The next state, y, is chosen according to the transition probabilities Pt(a) 
 A rewaxd rate p(a, a) is defined until the next transition occurs 
 Conditional on the event that the next state is y, the time until the tran- 
sition from a to y occurs has probability distribution Ft(-la ) 
One form of the SMDP is to find a policy the minimizes the expected infinite horizon 
discounted cost, the "value" for each state: 
{fo 
where a(t) and a(t) denote, respectively, the state and action at time t. 
For a fixed policy r, the value of a given state a must satisfy 
 P(r()) 0  e-tV(y)dF(tlr()). (1) 
Defining 
R(a:, y, a) - e-t' p(a, r(a))dsdF=(tlr(a)), 
the expected rewaxd that will be received on transition from state a to state y on 
action a, and 
'(, , a) - e-t dF=lr() ), 
Reinforcement Learning Methods for Continuous-Time Markov Decision Problems 395 
the expected discount factor to be applied to the value of state y on transition 
from state z on action a, it is clear that equation (1) is nearly identical to the 
value-function equation for discrete time Markov reward processes, 
= + ' (2) 
where R(, a) = y],, P,(a)R(, y, a). If transition times are identically one for 
an SMDP, then a standard discrete-time MDP results. 
Similarly, while the value function associated with an optimal policy for an MDP 
satisfies the Bellman optimality equation 
V(a)=max{R(a:,a)+7tSP,t(a)V*(Y) } , 
(3) 
the optimal value function for an SMDP satisfies the following version of the Bellman 
optimality equation: 
V* (:) = max Z P:(a) e -'p(:, a)dsdF:(tla ) + 
f' (4) 
3 Temporal Difference learning for SMDP's 
Sutton's TD(0) [10] is a stochastic approximation method for finding solutions to 
the system of equations (2). Having observed a transition from state a to state y 
with sample reward r(a, y, a'(a:)), TD(0) updates the value function estimate 
in the airection of the sample value r(a, y, r(a:)) +7V(k)(y). The TD(0) update rule 
for MDP's is 
V(+)(z) - V(k)(z)+a[r(z,y,r(z))+yV()(y) - V()(z)], (5) 
where ak is the learning rate. The sequence of value-function estimates generated 
by the TD(0) proceedure will converge to the true solution, V, with probability 
one [5,8,11] under the appropriate conditions on the a and on the definition of the 
MDP. 
The TD(0) learning rule for SMDP's, intended to solve the system of equations (1) 
given a sequence of sampled state transitions, is: 
V('+I)(z): V(')(z)+ a, [1--e-ttrr(z,y ' r(z))+ e-ttrV(')(y)- V(')(z)] (6) 
fi ' 
where the sampled transition time from state z to state y was r time units, 
-*-Z'r(z,y,r(z)) is the sample reward received in r time units, and e - is the 
sample discount on the value of the next state given a transition time of r time 
units. The TD(,) learning rule for SMDP's is straightforward to define from here. 
396 Steven Bradtke, Michael O. Duff 
4 Q-learning for SMDP's 
Denardo [6] and Watkins [12] define Q,r, the Q-function corresponding to the policy 
71', as 
Q.(, a) = R(, a) +   p,()V.(y) (7) 
Notice that a can be any action. It is not necesarily the action r(z) that would be 
chosen by policy r. The function Q* corresponds to the optimal policy. Q(z, a) 
represents the total discounted return that can be expected if any action is taken 
from state z, and policy r is followed thereafter. Equation (7) can be rewritten as 
Q,(a:,a) = R(,a)+- Z Pzt(a)Q'(Y,r(Y)) , (8) 
and Q* satisfies the Bellman-style optimality equation 
Q'(, ) = R(, ) +,  a,() m= Q'(y, a'), 
Q-learning, first described by Watkins [12], uses stochastic approximation to itera- 
tively refine an estimate for the function Q*. The Q-learning rule is very similar to 
TD(0). Upon a sampled transition from state a to state y upon selection of a, with 
sampled reward r(a, y, a), the Q-function estimate is updated according to 
Q('+)(:,a) = Q(')(:,a) + a, [r(:,y,a) + 7m,axQ(')(y,a') - Q(')(:r,a)] . (10) 
Q-functions may also be defined for SMDP's. The optimal Q-function for an SMDP 
satisfies the equation 
Z Py(a) e -' max q'(y,a')day(tla ). (11) 
This leads to the foowing Q-learning rule for SMDP's: 
q(+)(z, a) = q()(, a)+a [ 1 - e 
r(:,y,a) + e -r maxQ(k)(y,a ') - Q(k)(:r, a)] 
(12) 
5 RTDP and Adaptive RTDP for SMDP's 
The TD(0) and Q-learning algorithms are model-free, and rely upon stochastic 
approximation for asymptotic convergence to the desired function (V and Q*, re- 
spectively). Convergence is typically rather slow. Real-Time Dynamic Program- 
ming (RTDP) and Adaptive RTDP [1,3] use a system model to speed convergence. 
Reinforcement Learning Methods for Continuous-Time Markov Decision Problems 397 
RTDP assumes that a system model is known a priori; Adaptive RTDP builds a 
mode] as it interacts with the system. As discussed by Baxto et al. [1], these asyn- 
chronous DP algorithms can have computational advantages over traditional DP 
algorithms even when a system mode] is given. 
Inspecting equation (4), we see that the model needed by RTDP in the SMDP 
domain consists of three paxts: 
1. the state transition probabilities Pt(a), 
2. the expected reward on transition from state a to state y using action a, 
R(a, y, a), and 
3. the expected discount factor to be applied to the value of the next state on 
transition from state a to state y using action a, 7(a, y, a). 
If the process dynamics axe governed by a continuous time Maxkov chain, then the 
model needed by RTDP can be analytically derived through uniformization [2]. In 
general, however, the model can be very difficult to analytically derive. In these 
cses Adaptive RTDP can be used to incrementally build a system model through 
direct interaction with the system. One version of the Adaptive RTDP algorithm 
for SMDP's is described in Figure 1. 
1 Set k = 0, and set a:0 to some start state. 
2 Initialize l b, ft, and . 
3 repeat forever { 
4 For all actions a, compute 
5 
6 
7 
8 
Q(')(x.,a) = P...(a) [ + ] 
Perform the update V('+)(,) = mlno(. A 
Select an action, a. 
Perform a and observe the transition to x+ after ' time units. Update 
P. Use the sample rewd -e- 
 r(x,x+l,a) and the sample scot 
factor e - to update 
k=k+l 
Figure 1: Adaptive RTDP for SMDP's. P, R, and  are the estimates maintained 
by Adaptive RTDP of P, R, and 7. 
Notice that the action selection procedure (line 6) is left unspecified. Unlike RTDP, 
Adaptive RTDP can not always choose the greedy action. This is because it only has 
an estimate of the system model on which to base its decisions, and the estimate 
could initially be quite inaccurate. Adaptive RTDP needs to explore, to choose 
actions that do not currently appear to be optimal, in order to ensure that the 
estimated model converges to the true model over time. 
398 Steven Bradtke, Michael O. Duff 
6 Experiment: Routing to two heterogeneous servers 
Consider the queueing system shown in Figure 2. Arrivals are assumed to be Poisson 
with rate .X. Upon arrival, a customer must be routed to one of the two queues, 
whose servers have service times that axe exponentially distributed with parameters 
/ and/a respectively. The goal is compute a policy that minimizes the objective 
function: 
 {fo + ca-a)ldt }, 
where c, and ca are scalar cost factors, and n,(t) and ha(t) denote the number of 
customers in the respective queues at time 6. The pair (n(t),na(t)) is the state of 
the system at time t; the state space for this problem is countably infinite. There 
axe two actions available at every state: if an arrival occurs, route it to queue 1 or 
route it to queue 2. 
Figure 2: Routing to two queueing systems. 
It is known for this problem (and many like it [7]), that the optimal policy is a 
threshold policy; i.e., the set of states S1 for which it is optimal to route to the 
first queue is characterized by a monotonically nondecreasing threshold function F 
via S = {(n,na)ln _< F(na)}. For the case where c = ca = 1 and 
the policy is simply to join the shortest queue, and the theshold function is a line 
slicing diagnonally through the state space. 
We applied the SMDP version of Q-learning to this problem in an attempt to find 
the optimal policy for some subset of the state space. The system parameters were 
set to .X -/:/a = 1, fl= 0.1, and c = ca: 1. We used a feedforward neural 
network trained using backpropagation as a function approximator. 
Q-learning must take exploratory actions in order to adequately sample all of the 
available state transitions. At each decision time k, we selected the action ak to be 
applied to state ak via the Boltzmann distribution 
Pr{a - a} - 
where is the "computational temperature." The temperature is initialized to a 
relatively high value, resulting in a uniform distribution for prospective actions. T 
is gradually lowered as computation proceeds, raising the probability of selecting 
actions with lower (and for this application, better) Q-values. In the limit, the action 
that is greedy with respect to the Q-function estimate is selected. The temperature 
and the learning rate a axe decreased over time using a "search then converge" 
method [4]. 
Reinforcement Learning Methods for Continuous-Time Markov Decision Problems 399 
Figure 3 shows the results obtained by Q-learning for this problem. Each square 
denotes a state visited, with r,(t) running along the a-axis, and r*2(t) along the t- 
axis. The color of each square represents the probability of choosing action 1 (route 
arrivals to queue 1). Black represents probability 1, white represents probability 0. 
An optimal policy would be black above the diagonal, white below the diagonal, 
and could have arbitrary co]ors along the diagonal. 
ffiDDDDDDDDO 
DDDDDDDDDD 
B 
!!!! 
!!!!!!!! !! 
!!!!!!!!!!i 
lll!ll! 
BBDDDDDDDD 
DDDDDDDDDD 
DDDDDDDDDDDD 
Figure 3: Results of the Q-learning experiment. Panel A represents the policy after 
50,000 total updates, Panel B represents the policy after 100,000 total updates, and 
Panel C represents the policy after 150,000 total updates. 
One unsatisfactory feature of the algorithm's performance is that convergence is 
rather slow, though the schedules governing the decrease of Boltzmann temperature 
Tk and learning rate ak involve design parameters whose tweakings may result in 
faster convergence. If it is known that the optimal policies are of theshold type, 
or that some other structural property holds, then it may be of extreme practical 
utility to make use of this fact by constraining the value-functions in some way or 
perhaps by representing them as a combination of appropriate basis vectors that 
implicity realize or enforce the given structural property. 
7 Discussion 
In this paper we have proposed extending the applicability of well-known reinforce- 
ment learning methods developed for discrete-time MDP's to the continuous time 
domain. We derived semi-Markov versions of TD(0), Q-learning, RTDP, and Adap- 
tive RTDP in a straightforward way from their discrete-time analogues. While we 
have not given any convergence proofs for these new algorithms, such proofs should 
not be difficult to obtain if we limit ourselves to problems with finite state spaces. 
(Proof of convergence for these new algorithms is complicated by the fact that, in 
general, the state spaces involved are infinite; convergence proofs for traditional 
reinforcement learning methods assume the state space is finite.) Ongoing work 
is directed toward applying these techniques to more complicated systems, exam- 
ining distributed control issues, and investigating methods for incorporating prior 
400 Steven Bradtke, Michael O. Duff 
knowledge (such as structured function approximators). 
A cknowle dgement s 
Thanks to Professor Andrew Barto, Bob Crites, and to the members of the Adaptive 
Networks Laboratory. This work was supported by the National Science Foundation 
under Grant ECS-9214866 to Professor Barto. 
References 
[11] 
[12] 
[1] A. G. Barto, S. J. Bradtke, and S. P. Singh. Learning to act using real-time 
dynamic programming. Artificial Intelligence. Accepted. 
[2] D. P. Bertsekas. Dynamic Programming: Deterministic and Stochastic Models. 
Prentice Hall, Englewood Cliffs, N J, 1987. 
[3] S. J. Bradtke. Incremental Dynamic Programming for On-line Adaptive Opti- 
mal Control. PhD thesis, University of Massachusetts, 1994. 
[4] C. Darken, J. Chang, and J. Moody. Learning rate schedules for faster stochas- 
tic gradient search. In Neural Networks for Signal Processing  -- Proceedings 
of the 199 IEEE Workshop. IEEE Press, 1992. 
[5] P. Dayan and T. J. Sejnowski. Td(,): Convergence with probability 1. Machine 
Learning, 1994. 
[6] E. V. Denardo. Contraction mappings in the theory underlying dynamic pro- 
gramming. SIAM Review, 9(2):165-177, April 1967. 
[7] B. Hajek. Optimal control of two interacting service stations. IEEE-TAC, 
29:491-499, 1984. 
[8] T. Jaakkola, M. I. Jordan, and S. P. Singh. On the convergence of stochastic 
iterative dynamic programming algorithms. Neural Computation, 1994. 
[9] S. M. Ross. Applied Probability Models with Optimization Applications. Holden- 
Day, San Francisco, 1970. 
[10] R. S. Sutton. Learning to predict by the method of temporal differences. 
Machine Learning, 3:9-44, 1988. 
J. N. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Tech- 
nical Report LIDS-P-2172, Laboratory for Information and Decision Systems, 
MIT, Cambridge, MA, 1993. 
C. J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, Cambridge 
University, Cambridge, England, 1989. 
