Next: Convergence of Policy Iteration
Up: Policy Iteration
Previous: Policy Iteration
Policy Iteration Algorithm
Input: MDP, and
- 1.
- Initialize:
,
- 2.
- (policy evaluation)
Find vn (the value of dn) by solving the equations:
- 3.
- (policy improvement)
Choose a greedy policy with respect to vn:
Choose the next policy, dn+1, s.t.:
Choose
dn+1 = dn if possible.
- 4.
- If
dn+1 = dn stop,
else
,
return to (2).
Yishay Mansour
1999-12-18