Multi-Grid Methods for Reinforcement 
Learning in Controlled Diffusion Processes 
Stephan Pareigis 
stp@numerik.uni-kiel.de 
Lehrstuhl Praktische Mathematik 
Christian-Albrechts-Universit/t Kiel 
Kiel, Germany 
Abstract 
Reinforcement learning methods for discrete and semi-Markov de- 
cision problems such as Real-Time Dynamic Programming can 
be generalized for Controlled Diffusion Processes. The optimal 
control problem reduces to a boundary value problem for a fully 
nonlinear second-order elliptic differential equation of Hamilton- 
Jacobi-Bellman (HJB-) type. Numerical analysis provides multi- 
grid methods for this kind of equation. In the case of Learning Con- 
trol, however, the systems of equations on the various grid-levels are 
obtained using observed information (transitions and local cost). 
To ensure consistency, special attention needs to be directed to- 
ward the type of time and space discretization during the obser- 
vation. An algorithm for multi-grid observation is proposed. The 
multi-grid algorithm is demonstrated on a simple queuing problem. 
1 Introduction 
Controlled Diffusion Processes (CDP) are the analogy to Markov Decision Problems 
in continuous state space and continuous time. A CDP can always be discretized in 
state space and time and thus reduced to a Markov Decision Problem. Algorithms 
like Q-learning and RTDP as described in [1] can then be applied to produce controls 
or optimal value functions for a fixed discretization. 
Problems arise when the discretization needs to be refined, or when multi-grid 
information needs to be extracted to accelerate the algorithm. The relation of 
time to state space discretization parameters is crucial in both cases. Therefore 
1034 $. Pareigis 
a mathematical model of the discretized process is introduced, which reflects the 
properties of the converged empirical process. In this model, transition probabilities 
of the discrete process can be expressed in terms of the transition probabilities of 
the continuous process. Recent results in numerical methods for stochastic control 
problems in continuous time can be applied to give assumptions that guarantee a 
local consistency condition which is needed for convergence. The same assumptions 
allow application of multi-grid methods. 
In section 2 Controlled Diffusion Processes are introduced. A model for the dis- 
cretized process is suggested in section 3 and the main theorem is stated. Section 4 
presents an algorithm for multi-grid observation according to the results in the pre- 
ceding section. Section 5 shows an application of multi-grid techniques for observed 
processes. 
2 Controlled Diffusion Processes 
Consider a Controlled Diffusion Process (CDP) (t) in some bounded domain f C 
]R n fulfilling the diffusion equation 
dE(t) = b((t),u(t))dt + a((t))dw. (1) 
The control u(t) takes values in some finite set U. 
(cost) for state (t) and control u(t) is 
r(t) = u(t)). (2) 
The control objective is to find a feedback control law 
u(t)=u((t)), (3) 
that minimizes the total discounted cost 
J(x,u) = IE e-tr((t),u(t)dt, (4) 
where IE is the expectation starting in x  f and applying the control law u(.). 
/ > 0 is the discount. 
The transition probabilities of the CDP are given for any initial state x  f and 
subset A c f by the stochastic kernels 
P(x,A) := prob{(t) e AI(0 ) = x,u}. (5) 
It is known that the kernels have the properties 
jf(y - x)P(x, dy) = t . b(x, u) + o(t) (6) 
ifs(y- x)(y - x)rP(x, dy) = t.a(x)a(x) r + o(t). (7) 
For the optimal control it is sufficient to calculate the optimal value function V: 
V(x) :- inf J(x,u). (8) 
The immediate reinforcement 
Multi-Grid Methods for Reinforcement Learning in Diffusion Processes 1035 
Under appropriate smoothness assumptions V is a solution of the Hamilton-Jacobi- 
Bellman (HJB-) equation 
min {aV(x) -/3V(x) + r(x,a)} = O, x  f. (9) 
c U 
Let a(x) -- cr(x)a(x) T be the diffusion matrix, then a, a E U is defined as the 
elliptic differential operator 
n n 
a :_ E aij(X)Ox, Oxj + Ebi(x,a)Ox,. (10) 
i,j=l i=1 
3 A Model for Observed CDP's 
Let fhl be the centers of cells of a cell-centered grid on f with cell sizes h0, hi - 
h0/2, h2 - hi/2, .... For any x E fh, we shall denote by A(x) the cell of x. Let 
At > 0 be a parameter for the time discretization. 
Figure 1: The picture depicts three 
cell-centered grid levels and the trajec- 
tory of a diffusion process. The approx- 
imating value function is represented 
locally constant on each cell. The tri- 
angles on the path denote the posi- 
tion of the diffusion at sample times 
0, At, 2At, 3At, .... Transitions be- 
tween respective cells axe then counted 
in matrices Q, for each control a and 
grid i. 
By counting the transitions between cells and calculating the empirical probabilities 
as defined in (20) we obtain empirical processes on every grid. By the law of 
great numbers the empirical processes will converge towards observed CDPs as 
subsequently defined. 
Definition 1 An observed process n,,at,(t) 
discrete state-space and discrete time) on ftl, 
transition probabilities 
is a Controlled Markov Chain (i.e. 
and interpolation time Ati with the 
:= prob{(Ati) e A(y)l(O ) e A(x), u} 
_ 1 /A Pt'(z'A(y))dz' 
(11) 
where x, y  ftn, and (t) is a solution of (1). Also define the observed reinforcement 
p as 
if A fat, 
(12) 
Phi,Ati (X, U) := h? (x) 0 
1036 $. Pareigis 
On every grid h the respective process h,xt has its own value function V 
By theorem 10.4.1. in Kushner, Dupuis ([5], 1999.) it holds, that 
Vn,,at,(x) - V(x) for all x e , (13) 
if the following local consistency conditions hold. 
Definition 2 Let , = n,t(At) - ,(0). , is called locally consistent 
to a ottion q (1), 
sup I/XCa,at(n/Xt)l 
= b(x,a)At+o(At) (14) 
= a(x)At + o(At) (15) 
--> 0 as h--> O. (16) 
To verify these conditions for the observed CDP, the expectation and variance can 
be calculated. For the expectation we get 
=  P,,/xt, (x, y)(y - x) 
Yh i 
= h?  (Y - x)Pt'(z'A(y))dz' 
yCZh (x) 
(17) 
Recalling properties (6) and (7) and doing a similar calculation for the variance we 
obtain the following theorem. 
Theorem 3 For observed CDPs a,at let hi and Ati be such that 
hi/ Ati -- 0 as Ati -- O. 
(18) 
Furthermore, hi,atl shall be truncated at some radius R, such that R -- 0 for 
hi - 0 and expectation and variance of the truncated process differ only in the 
order o( At) from expectation and variance of h,Ati. Then the observed processes 
hi,At, truncated at R are locally consistent to the diffusion process (.) and therefore 
the value functions Vh,At, converge to the value function V. 
4 Identification by Multi-Grid Observation 
The condition in Theorem 3 provides information as how to choose parameters in 
the algorithm with empirical data. Choose discretization values h0, At0 for the 
coarsest grid 0. At0 should typically be of order I[bllsup/ho. Then choose for the 
grid 0 
space h0 
time At0 
finer grids 
I 2 3 4 5 ... 
(19) 
2 4 8 16 32 ' ' ' 
At 0 At 0 At 0 At 0 At 0 
2 2 4 4 8 ''' 
We may now formulate the algorithm 
The sequences verify the assumption (18). 
for Multi-Grid Observation of the CDP (.). Note that only observation is being 
carried out. The actual calculation of the value function may be done separately 
as described in the next section. The choice of the control is assumed to be done 
Multi-Grid Methods for Reinforcement Learning in Diffusion Processes 1037 
by a separate controller. Let fk be the finest grid, i.e. Atk and h the finest 
discretizations. Let Ut = U att/at = Ux... xU, Att/At times. Q' is a [ft[ x Itl- 
matrix (at  Ut), containing the number of transitions between cells in ft. R' is a 
[ft I-vector containing the empirical cost for every cell in ft. The immediate cost is 
given by the system as rt oo e-atr((t), at)dt. T denotes current time. 
0. Initialize t, Q', R' for all at 6 Ut, I = 0,... ,k 
1. repeat { 
2. choose a = a(T)  U and apply a constantly on IT; T + 
3. T := T + At 
4. for l = 0 to k do { 
5. determine cell xt E ft with (T - Att)  A(xt) 
6. determine cell Yt e ft with (T) e A(yt) 
7. if Ilxk - Y11 - R (truncation radius) then goto 2. else 
8. at := (a(T- Att),a(T + At - Att),... ,a(T- Ate)) 
9. receive immediate cost rt 
10. Qt(xt,yt) := Qt(xt,yt) + 1 
11. R'(xt) := (rt + Rt(xt). -]zent Qt(xt,z)) / (1 + -]zent Qt(xt,z)) 
) (for-do) 
} (repeat) 
Before applying a multi-grid algorithm for the calculation of the value function on 
the basis of the observations, one should make sure that every box has at least 
some data for every control. Especially in the early stages of learning only the two 
coarsest grids f0, fl could be used for computation of the optimal value function 
and finer grids may be added (possibly locally) as learning evolves. 
5 Application of Multi-Grid Techniques 
The identification algorithm produces matrices Qt containing the number of tran- 
sitions between boxes in ft. We will calculate from the matrices Q the transition 
matrices P by the formula 
,x,yft, atUt, l=O,...,k. (20) 
Now we define matrices A and right hand sides f as 
A?, := (th a' - )/att := a?'/art, 
(21) 
where/t = e -zatt. The discrete Bellman equation takes the following form 
min {A'(x) - f' (x)} = 0. 
atUt 
(22) 
The problem is now in a form to which the multi-grid method due to Hoppe, Blofi 
([2], 1989) can be applied. For prolongation and restriction we choose bilinear 
interpolation and full weighted restriction for cell-centered grids. We point out, 
that for any cell x  ft only those neighboring cells shall be used for prolongation 
and restriction for which the minimum in (22) is attained for the same control as 
the minimizing control in x (see [2], 1989 and [3], 1996 for details). On every grid 
1038 $. Pareigis 
fl the defect in equation (22) is calculated and used for a correction on grid fl--1. 
As a smoother nonlinear Gauss-Seidel iteration applied to (22) is used. 
Our approach differs from the algorithm in Hoppe, Blofi ([2], 1989) in the special 
form of the matrices A  in equation (22). The stars are generally larger than 
nine-point, in fact the stars grow with decreasing h although the matrices remain 
sparse. Also, when working with empirical information the relationship between the 
matrices A  on the various grids is based on observation of a process, which implies 
that coarse grid corrections do not always correct the equation of the finest grid 
(especially in the early stages of learning). However, using the observed transition 
matrices A  on the coarse grids saves the computing time which would otherwise 
be needed to calculate these matrices by the Galerkin product (see Hackbusch [4], 
1985). 
6 Simulation with precomputed transitions 
Consider a homogeneous server problem with two servers holding data (Xl, X2) E 
[0, 1] x [0, 1]. Two independent data streams arrive, one at each server. A controller 
has to decide to which server to route. The modeling equation for the stream shall 
be 
dx =b(x,u)dt +cr(x)dw, u e {1,2} (23) 
with 
I 0 
b(x'2)=(- 1 ) or= (0 1) (24) 
The boundaries at Xl = 0 and x2 - 0 are reflecting. The exceeding data on 
either server Xl, x2 > I is rejected from the system and penalized with g(xx, 1) - 
g(1,x2) - 10, g - 0 otherwise. The objective of the control policy shall be to 
minimize 
IF,, e-Bt(Xl (t) + x2(t) + g(xl,x2))dt. (25) 
The plots of the value function show, that in case of high load (i.e. Xl, X 2 close to 
1) a maximum of cost is assumed. Therefore it is cheaper to overload a server and 
pay penalty than to stay close to the diagonal as is optimal in the low load case. 
For simulation we used precomputed (i.e. converged heuristic) transition probabili- 
ties to test the multi-grid performance. The discount/ was set to .7. The multi-grid 
algorithm reduces the error in each iteration by'a factor 0.21, using 5 grid levels 
and a V-cycle and two smoothing iterations on the coarsest grid. For comparison, 
the iteration on the finest grid converges with a reduction factor 0.63. 
7 Discussion 
We have given a condition for sampling controlled diffusion processes such that 
the value functions will converge while the discretization tends to zero. Rigorous 
numerical methods can now be applied to reinforcement learning algorithms in 
continuous-time, continuous-state as is demonstrated with a multi-grid algorithm 
for the HJB-equation. Ongoing work is directed towards adaptive grid refinement 
algorithms and application to systems that include hysteresis. 
Multi-Grid Methods for Reinforcement Learning in Diffusion Processes 1039 
0,6 
Y 
0.4 
0.2 
0,4 0.8 0 2 O t 
X X 
06 08 
Figure 2: Contour plots of the predicted reward in a homogeneous server problem with 
nonlinear costs are shown on different grid levels. On the coarsest 4 x 4 grid a sampling rate 
of one second is used with 9-point-star transition matrices. At the finest grid (64 x 64) a 
1 second is used with observation on 81-point-stars. Inside the egg-shaped 
sampling rate of  
area the value function assumes its maximum. 
References 
[1] 
[2] 
[3] 
A. Barto, S. Bradtke, S. Singh. Learning to Act using Real-Time Dynamic Pro- 
gramming, AI Journal on Computational Theories of Interaction and Agency, 
1993. 
M. Blotl and R. Hoppe. Numerical Computation of the Value Function of Op- 
timally Controlled Stochastic Switching Processes by Multi-Grid Techniques, 
Numer Funct Anal And Optim 10(3+4), 275-304, 1989. 
S. Pareigis. Lernen der L6sung der Bellman-Gleichung durch Beobachtung von 
kontinuierlichen Prozessen, PhD Thesis, 1996. 
W. Hackbusch. Multi-Grid Methods and Applications, Springer-Verlag, 1985. 
H. Kushner and P. Dupuis. Numerical Methods for Stochastic Control Prob- 
lems in Continuous Time, Springer-Verlag, 1992. 
