Temporal Difference and TD(0) algorithm

Next: Online Versus Off-Line Updates Up: No Title Previous: Another way to look

Temporal Difference and TD(0) algorithm

We can write down algorithm according to the defined operator:

$\ V_{n+1}(s) \leftarrow (1-\alpha_{n}) V_{n}(s)+ \alpha_{n}[r_{\pi}(s)+ \lambda E_{s}[V(s')]]$
Instead of computing expectation $\ E_{s}$ we sample the next state $\ s'$ and derive

$\ V_{n+1}(s) \leftarrow (1-\alpha_{n}) V_{n}(s)+ \alpha_{n}[r_{\pi}(s)+ \lambda V_{n}(s')]$ or equivalently,
$\ V_{n+1}(s) \leftarrow V_{n}(s)+ \alpha_{n}[r_{\pi}(s)+ \lambda V_{n}(s') - V_{n}(s)]$
We'll call $\ r_{\pi}(s)+ \lambda V_{n}(s') - V_{n}(s)$ - temporal difference (TD) because this expression is difference between reward received in current run $\ r_{\pi}(s)+ \lambda V_{n}(s')$ and expected reward $\ V_{n}(s)$ .
This algorithm is called TD(0).
It is easy to see that $\ E(r_{\pi}(s)+ \lambda V_{n}(s')) = E(V_{n}(s))$ and $\ E(TD) = 0$
The following theorem is used to show convergence of the algorithm:

Theorem 8.1 Consider the iterative algorithm $\ r_{t+1}(i) \leftarrow (1-\alpha_{t}(i)) r_{t}(i)+ \alpha_{t}(i)[(Hr_{t})(i)+ w_{t}(i)]$ such that
1) $\ E(w_{t}(i)\vert F_{t}) = 0$ and $\ E(w^2_{t}(i)\vert F_{t}) = A + B\vert\vert r_{t}\vert\vert^2$ for some constants A and B
2) the step size $\ \alpha_{t}$ has the property that $\ \sum\alpha_{t} = \infty$ and $\ \sum\alpha^2_{t} < \infty;$
3) H is a contracting mapping.
Then r converges to $\ r^{*}$ with probability one, when $\ r^{*} = H r^{*}$

Recall that convergence with probability one implies that sequence $\ A_{n}$ has the property that $\lim_{N\rightarrow\infty}sup_{n \geq N}\vert A_{N} - A\vert = 0$ The above theorem implies that TD(0) converges with probability one to the correct value function (same as MC algorithm).

Yishay Mansour
2000-01-06