Next: Equality of schemes using
Up: Forward Updates Versus Backward
Previous: Forward Updates Versus Backward
Algorithm
The algorithm is as follows:
- 1.
- Define an arbitrary
,
set
- 2.
- Repeat for every run : Set initial state
- 3.
- Repeat for every step of the run:
Choose action according to policy
Perform action
, receive next state
and immediate reward
set
for
and
repeat steps 2,3 until final state is reached (for
every run)
Using this scheme, we update previous values according to new
immediate reward and according to state we have reached. For more
distant states weight of the update is smaller and for more close
states the weight is larger.
Let us examine different values of
.
For
:
iff
and
otherwise
So, we will update only
current state according to received reward and we once again have
the TD(0) scheme.
For
: e(s) is contribution of this
step to state s, i.e. if s is visited in the run in steps
then
where
is
contribution of current step to step
Although TD(1) resembles MC , there is a minor difference : MC
waits for the end of the run to perform updates while TD(1)
performs updates online. By the end of the run, both methods give
the same value function , but functions computed by TD(1) will be
hopefully more accurate during the run, since they are updated
online.
Next: Equality of schemes using
Up: Forward Updates Versus Backward
Previous: Forward Updates Versus Backward
Yishay Mansour
2000-01-06