TD( ) as generalization of MC and TD(0) methods

Next: Forward Updates Versus Backward Up: TD( ) algorithm Previous: TD( ) algorithm

TD( $\ \gamma$ ) as generalization of MC and TD(0) methods

MC uses overall return of the run to perform update, i.e., $\ R_{t}=\sum_{i=0}^{T}\lambda^{i}r_{t+i}$
TD(0) uses only immediate reward to make update and its update has the form,
$\ R_{t}^{(1)}=r_{t} + \lambda V_{t}(s_{t+1})$
We can think about other options between those two extremes. For example we can use two immediate reward values , i.e. , $\ R_{t}^{(2)}=r_{t} + \lambda r_{t+1} + \lambda^{2} V_{t}(s_{t+2})$
Generally - if n values are in use, we have, $\ R_{t}^{(n)}=r_{t} + \lambda r_{t+1} + ... + \lambda^{n-1} r_{n-1} + \lambda^{n} V_{t}(S_{n})$
Note: If $\ n$ is more than number of steps to the end of the run, we set all the rewards after the run end to 0.
The update is defined as $\ \triangle V_{t}(s_{t}) = \alpha[R_{t}^{(n)}-V_{t}(s_{t})]$
Update can be performed:

1.: online: $\ V_{t+1}(s_{t}) \leftarrow V_{t}(s_{t}) + \triangle V_{t}(s_{t})$
2.: off-line: $\ V(s) \leftarrow V(s) + \sum\triangle V_{t}(s)$

It is easy to see that operator $\ R_{t}^{(n)}$ is $\ \lambda^{n}$ contracting.
$\ [ \Vert R_{t}^{(n)}(V^{(1)}) - R_{t}^{(n)}(V^{(2)})\Vert = \lambda^{n}\Vert V^{(1)} - V^{(2)}\Vert ]$
So we expect to get improvement and reduce error while making iterations.
The central remaining question is choice of $\ n$ . One way to do it is to use weighed average of all possible values.
The method $\ TD(\gamma)$ gives a geometric distribution with parameter $\ \gamma$ over $\ R_{t}^{(n)}$ ,
i.e. weight of $\ R_{t}^{(n)}$ is $\ (1-\gamma) \gamma^{n-1} .$
We define $\ R_{t}^{\gamma}$ to be $\ R_{t}^{\gamma} = (1-\gamma)\sum_{n=1}^{\infty}\gamma^{n-1}R_{t}^{(n)}$

For finite run this expression can be rewritten as :
$\ R_{t}^{\gamma} = (1-\gamma)\sum_{n=1}^{T-1}\gamma^{n-1}R_{t}^{(n)} + \gamma^{T}R_{t}$

1.: If $\ \gamma=0$ then weight of $\ R_{t}^{(1)}$ is 1 and weights of any other $\ R_{t}^{(n)}$ equals 0. This reduces TD(0), which we introduced before.
2.: For $\ \gamma = 1$ we'll get $\ R_{t}^{(1)} = R_{t}$ (total reward) .

Next: Forward Updates Versus Backward Up: TD( ) algorithm Previous: TD( ) algorithm

Yishay Mansour
2000-01-06