Policy Iteration Algorithm

Next: Convergence of Policy Iteration Up: Policy Iteration Previous: Policy Iteration

Input: MDP, and $\lambda$

1.: Initialize: $d_0\in\Pi^{MD}$ , $n\leftarrow 0$
2.: (policy evaluation)
Find v_n (the value of d_n) by solving the equations:
$(I - \lambda{P_{d_n}})v = r_{d_n}$
3.: (policy improvement)
Choose a greedy policy with respect to v_n:
Choose the next policy, d_n+1, s.t.:
$d_{n+1}\in argmax_{d\in\Pi^{MD}}\{ r_d + \lambda{P_d}{v_{d_n}} \}$

Choose d_n+1 = d_n if possible.
4.: If d_n+1 = d_n stop,
else $n\leftarrow n+1$ , return to (2).

Yishay Mansour
1999-12-18