Convergence of Policy Iteration Algorithm

Next: Example: Running Policy Iteration Up: Policy Iteration Previous: Policy Iteration Algorithm

Convergence of Policy Iteration Algorithm

We shall see that when the number of states - |S|, and the number of actions - |A| are finite, there are no two non-consecutive iterations with the same policy, unless we have an optimal policy. Therefore d_n converges to the optimal policy in a finite number of steps. The key to the convergence of d_n is the monotonicity of $\{v_n\}$ .

Claim 6.4 Let v_n, v_n+1 be the values of consecutive iterations of the above algorithm, then $v_n \leq v_{n+1} \leq v_{\lambda}^*$ .

Proof:Let d_n+1 be the policy in the policy improvement step, then

$\begin{eqnarray*}r_{d_{n+1}} + \lambda{P_{d_{n+1}}}{v_n} &\geq& r_{d_n} + \lambda{P_{d_n}}{v_n}\\ &=& v_n\ \ \ \ \ \ (by\ definition\ of\ v_n)\\ \end{eqnarray*}$

Therefore: $r_{d_{n+1}} \geq (I - \lambda{P_{d_{n+1}}}){v_n}$ . Multiplying by $(I - \lambda{P_{d_{n+1}}})^{-1}$ we get:

$\begin{eqnarray*}v_{n+1} = (I - \lambda{P_{d_{n+1}}})^{-1}r_{d_{n+1}} \geq v_n \end{eqnarray*}$

$\Box$

Note: The above claim is valid even when S or A are infinite.

Theorem 6.5 Let S and A be finite, then the Policy Iteration algorithm converges to the optimal policy after at most |A|^|S| iterations.

Proof:Clearly, $\vert A\vert^{\vert S\vert} \geq \vert\Pi^{MD}\vert$ . According to claim , in each step $v_{n+1} \geq v_n$ except for the last step in which v_n+1 = v_n. Therefore no policy $\pi \in \Pi^{MD}$ can appear in two different iterations. Hence the number of iterations $\leq \vert\Pi^{MD}\vert \leq \vert A\vert^{\vert S\vert}$ $\Box$

**Figure:** Example Diagram

Next: Example: Running Policy Iteration Up: Policy Iteration Previous: Policy Iteration Algorithm

Yishay Mansour
1999-12-18