We can write down algorithm according to the defined operator:
Instead of computing expectation
we sample the next
state
and derive
or equivalently,
We'll call
-
temporal difference (TD) because this expression is difference
between reward received in current run
and expected reward
.
This algorithm is called TD(0).
It is easy to see that
and
The following theorem is used to show convergence of the
algorithm:
Theorem 8.1
Consider the iterative algorithm
such that
1)
and
for some constants A
and B
2) the step size
has the property that
and
3) H is a contracting mapping.
Then r converges to
with probability one, when
Recall that convergence with probability one implies that sequence
has the property that
The above theorem implies that TD(0)
converges with probability one to the correct value function (same
as MC algorithm).