We can hope to improve the approximations by using every appearance of state s in
each single run.
for each :
1.
Run
from s for m times, where the i-th run is Ti.
2.
For each run Ti and state s in it, let
r(s,Ti,j) be the
reward of
in run Ti from the j-th appearance of s in Ti
until the run ends (reaching state s0).
3.
Let the reward of policy
starting from s be :
The problem is that the random variables r(s,Ti,j) are dependent for different j's.