Consider the MDP in the Figure 7.1. Let N be the length of the run.
We choose
(undiscounted).
It is easy to see that,
Error Calculations
,
since the distribution is geometric ("success" with probability p when
moving to s0). However note that
, which implies that
r(s1,T,j) are independent.
The return from the k-th appearance of S1 until the end of the run is
N-k (there are N steps in the run, each of them with reward 1 and the run
ends the first time we get to S0).
The average is
Hence
.
Note that this is different from
.
The difference (bias) is because the random variables are dependent.
The fact that Every Visit is biased does not implies that it is inferier to First Visit ,
as an estimate. Consider using a single run to estimate the return in the above example.
For each p<1, the expected square error in Every Visit is smaller than in First Visit ,
meaning that Every Visit creates a biased estimation but it's variance is
much smaller. The reason is that Every Visit contains many more samples.
Note that in Every Visit the average is computed on all the runs, and not
within the runs.
Note that the variance in this case can be bigger than the expected value
when p is small. for example if p=0.1 then the expected value is 10 and
variance is 90.
Next: About this document ...
Up: Evaluating Policy Reward
Previous: Every Visit
Yishay Mansour
1999-12-16