Every Visit

Next: Example Up: Evaluating Policy Reward Previous: First Visit

We can hope to improve the approximations by using every appearance of state s in each single run.

for each $s\in S$ :

1.: Run $\pi$ from s for m times, where the i-th run is T_i.
2.: For each run T_i and state s in it, let r(s,T_i,j) be the reward of $\pi$ in run T_i from the j-th appearance of s in T_i until the run ends (reaching state s₀).
3.: Let the reward of policy $\pi$ starting from s be : $\hat{V^{\pi}}(s) = Avg(r(s,T_{i},j))$

The problem is that the random variables r(s,T_i,j) are dependent for different j's.

Yishay Mansour
1999-12-16