If we run TD(0) and MC for long enough time (and gives them enough
data) both algorithms will converge to the correct value function.
However on the finite input , algorithms can give different
results.
For understanding basic difference between MC and TD(0) we'll
study following example:
Let us assume that we have run series
of runs and got following results:
1.
(A,0) (B,0)
2.
(B,1)
3.
(B,1)
4.
(B,1)
5.
(B,1)
6.
(B,1)
7.
(B,1)
8.
(B,0)
We can create infinite sample from such a finite sample as
follows. Each time we need to sample a run, we choose a random run
out of the eight runs.
We want to estimate V(A) and V(B) using this finite sample.
For
V(B) we have 8 runs, 6 give return 1 and 2 of them give return 0.
Our estimate for V(B) is
.
How should we
estimate V(A)? There are two reasonable options.
1.
We have one run visiting state A with return 0, so
V(A) estimation is 0.
2.
We have one run visiting state A and this run moves
from state A to state B. So we can assume that we always move from
state A to state B. This implies that estimation is V(A) = V(B) =
Clearly first approach corresponds to MC. Now we show that TD(0)
implements second approach. It means that TD(0) evaluation is in
fact done using Markovian model produced by inspecting input : for
any state s an action a the next state distribution is
identical to the one observed in the sample.
Figure:
Empirical model.
Claim 8.2
Given finite sample
. If we run TD(0) until it
converges (sampling every time randomly one of the runs
) then estimation
equals
to return of policy
in empirical model for
.
Definition
Given runs
, the empirical model is defined as
follows:
and
where
is the
number of times in
we did action a in state
s and
is the number of times when after doing
action a we reached state s'.
It is easy to see from the definition
of TD(0) that based are impacted only on successive pairs of
states
.
Number of updates for state
is distributed according to
and so:
We received equations for model with
and
which is exactly the empirical model.
Here is conceptual difference between MC
and TD(0):
MC considers only information about initial state of
the run (received at the end of the run) and TD(0) assumes that
there is Markovian model and makes use of this assumption.
Next:TD( ) algorithm Up:No Title Previous:Online Versus Off-Line UpdatesYishay Mansour 2000-01-06