Next: Linear Programming Up: No Title Previous: Example: Running Policy Iteration

A Comparison between VI and PI Algorithms

In this section we will compare the convergence rate of the VI and PI algorithms. It will be shown that, assuming that the two algorithms begin with the same approximated value, the PI algorithm converges faster to the optimized value.

Theorem 6.6 Let $\{u_{i}\}$ be the series of values created by the VI algorithm (where u_n+1=Lu_n) and let $\{v_{i}\}$ be the series of values created by PI algorithm. If u₀=v₀, then $\forall n,\ u_{n}\leq v_{n}\leq v^*_\lambda$ .

Proof:We will use induction to prove the theorem.
Induction Basis: We assume that u₀=v₀. v₀ is the return value of a specific policy, and therefore it is clearly $\leq$ the optimal return value. Therefore: $u_{0} \leq v_{0} \leq v^*_\lambda$ .
Induction Step:

$\begin{eqnarray*}u_{n+1} = Lu_{n} = L_{p_{n}}u_{n} \ \ \ \ [Where \ p_{n} \in argmax_{d \in \Pi^{MD}}\{r_{d} + \lambda P_{d}u_{n}\}] \end{eqnarray*}$

From the induction hypothesis $u_n \leq v_n$ , and since L_{p_n} is monotonic it follows that:

$\begin{eqnarray*}L_{p_{n}}u_{n} & \leq &L_{p_{n}}v_{n} \end{eqnarray*}$

Since L is taking the maximum over all policies:

$\begin{eqnarray*}L_{p_{n}}v_{n} & \leq & Lv_{n} \end{eqnarray*}$

We denote the policy determined by PI algorithm in iteration nas d_n and therefore: Lv_n=L_{d_n}v_n

From the Optimality Equations we get:

$\begin{eqnarray*}L_{d_{n}}v_{n} &\leq& L_{d_{n}}v_{\lambda}^{d_{n}} \end{eqnarray*}$

From the definition of v_n+1 we have:

$\begin{eqnarray*}L_{d_{n}}v_{\lambda}^{d_{n}} &=& v_{n+1} \end{eqnarray*}$

And we get $u_{n+1} \leq v_{n+1}$ . In Theorem

it was proven that $v_{n+1} \leq v^*_\lambda$ and therefore $u_{n+1} \leq v_{n+1} \leq v^*_\lambda$ . $\Box$

From Theorem it follows that, assuming the same starting point, PI algorithm requires less stages then VI algorithm to converge to the optimal policy. Yet, it should be noticed that each single stage of PI requires a solution of a set of linear equations (the policy evaluation stage) and therefore it is computationally more expensive than a single stage of VI algorithm.

Next: Linear Programming Up: No Title Previous: Example: Running Policy Iteration

Yishay Mansour
1999-12-18