Next: First Visit
Up: Evaluating Policy Reward
Previous: Evaluating Policy Reward
The Naive Approach
for each :
Run from s for m times, where the i-th run is Ti.
Let ri be the reward of Ti.
Estimate the reward of policy starting from s, by :
The variables ri are independent since the runs Ti are independent. By Chernoff's theorem :
This implies that :
.
for
the above holds.
Yishay Mansour
1999-12-16