Algorithm

Next: Equality of schemes using Up: Forward Updates Versus Backward Previous: Forward Updates Versus Backward

Algorithm

The algorithm is as follows:

1.: Define an arbitrary $\ V(s)$ , $\ \forall s \in S$ set $\ e(s) = 0$
2.: Repeat for every run : Set initial state $\ s_{0}$
3.: Repeat for every step of the run:
$\ \cdot$ Choose action according to policy $\ \pi : a \leftarrow \pi (s)$
$\ \cdot$ Perform action $\ a$ , receive next state $\ s'$ and immediate reward $\ r_{t}$
$\ \cdot \delta_{t} \leftarrow r_{t} + \lambda V_{t}(s') - V_{t}(s)$
$\ \cdot e_{t+1}(s) \leftarrow \gamma\lambda e_{t}(s)+1$
$\ \cdot \forall f \in S$ set $\ e_{t+1}(f) \leftarrow \gamma\lambda e_{t}(f)$ for $\ f \neq s$ and $\ V_{t+1}(f) \leftarrow V_{t}(f)+ \alpha\delta e_{t}(f)$

$\ \cdot$ repeat steps 2,3 until final state is reached (for every run)
Using this scheme, we update previous values according to new immediate reward and according to state we have reached. For more distant states weight of the update is smaller and for more close states the weight is larger.
Let us examine different values of $\ \gamma$ .
$\ \cdot$ For $\ \gamma=0$ : $\ e(s) = 1$ iff $\ s = s_{t}$ and $\ e(s) = 0$ otherwise
So, we will update only current state according to received reward and we once again have the TD(0) scheme.
$\ \cdot$ For $\ \gamma = 1$ : e(s) is contribution of this step to state s, i.e. if s is visited in the run in steps $\ i_{1},...,i_{l}$ then $\ e_{t}(s) = \sum_{j=1}^{l} \lambda^{l-i_{j}}$
where $\ \lambda^{l-i_{j}}$ is contribution of current step to step $\ i_{j}$
Although TD(1) resembles MC , there is a minor difference : MC waits for the end of the run to perform updates while TD(1) performs updates online. By the end of the run, both methods give the same value function , but functions computed by TD(1) will be hopefully more accurate during the run, since they are updated online.

Next: Equality of schemes using Up: Forward Updates Versus Backward Previous: Forward Updates Versus Backward

Yishay Mansour
2000-01-06