Next: Temporal Difference and TD(0)
Up: No Title
Previous: No Title
Another way to look on MC algorithm
In lecture 7 we discussed Monte-Carlo (MC) method for evaluation
policy reward.
This method performs number of experiments and uses the average to
evaluate policy reward.
Another way to express the evaluation is the following:
where
is total reward of n-th run starting first visit in s
(given s was visited in this run).
Note that
We can rewrite this formula ,as follows,
where:
- for some nonlinear operator
H ,and
is "noise" and
.
Recall the operator
, that was introduced to compute the return of
policy. We've already shown that
is a contracting
operator.
Yishay Mansour
2000-01-06