Example: Running Policy Iteration Algorithm

Next: A Comparison between VI Up: Policy Iteration Previous: Convergence of Policy Iteration

Example: Running Policy Iteration Algorithm

(Consider the MDP in figure

)
Let:

$\begin{eqnarray*}\lambda& = &0.95 \end{eqnarray*}$

Step 1:

$\begin{eqnarray*}{d_0}({s_1})& = &a_{12}\\ {d_0}({s_2})& = &a_{21} \end{eqnarray*}$

Policy Evaluation:

$\begin{eqnarray*}{v_0}({s_1})& = &10 + 0.95 v_0({s_2})\\ {v_0}({s_2})& = &-1 +... ...\\ & \Downarrow &\\ {v_0}({s_1}) = -9 &, &\ {v_0}({s_2}) = -20 \end{eqnarray*}$

Policy Improvement:

$\begin{eqnarray*}MAX\{ 5 + 0.475v_0({s_1}) + 0.475v_0({s_2}),\ 10 + 0.95v_0({s_2}) \} = MAX\{ -0.8775,\ -9 \} \end{eqnarray*}$

Therefore:

$\begin{eqnarray*}d_1({s_1})& = &a_{11}\\ d_2({s_2})& = &a_{21}\\ \end{eqnarray*}$

Policy Evaluation:

$\begin{eqnarray*}{v_1}({s_1})& = &5 + 0.475 v_1({s_1}) + 0.475 v_1({s_2})\\ {v... ...\\ &\Downarrow&\\ v_1({s_1}) = -8.571 & , & \ v_1({s_2}) = -20 \end{eqnarray*}$

The next policy improvement step shows that d₂ = d₁, and therefore the algorithm terminates and outputs d₁ as the optimal policy.

Yishay Mansour
1999-12-18