Correctness of Value Iteration Algorithm

Next: Convergence of Value Iteration Up: Finding the Optimal Policy: Previous: The Value Iteration Algorithm

Correctness of Value Iteration Algorithm

In the following theorem we show that the algorithm finds an $\epsilon$ -optimal policy in a finite number of steps. Note that the selected policy might in fact be the optimal policy, but we have no way of knowing that in advance.

Theorem 6.1 For the series $\{\vec{v}_{n}\}$ and the policy $\pi_{\epsilon}$ computed by VI the following holds:

1.: $\lim_{n\rightarrow \infty} v_{n} = v_{\lambda}^{*}$
2.: $\exists N, \forall n>N, \Vert v_{n+1}-v_{n}\Vert < \epsilon \cdot \frac{1-\lambda}{2\lambda}$
3.: The policy $\pi_{\epsilon}$ is $\epsilon$ -optimal
4.: If $\Vert v_{n+1}-v_{n}\Vert < \epsilon \cdot \frac{1-\lambda}{2\lambda}$ then $\Vert v_{n+1} - v_{\lambda}^{*}\Vert < \frac{\epsilon}{2}$

Proof:Parts (1) and (2) follow directly from the properties of the series v_n+1 = Lv_n, that were shown in Lecture 5.
For part (3) we assume that $\Vert v_{n+1}-v_{n}\Vert < \epsilon \cdot \frac{1-\lambda}{2\lambda}$ , as is the case when the algorithm stops, and show that $\Vert v_{\lambda}^{\pi_{\epsilon}} - v_{\lambda}^{*}\Vert < \epsilon$ , which would make the policy $\pi_{\epsilon}$ $\epsilon$ -optimal.

$\displaystyle \Vert v_{\lambda}^{\pi_{\epsilon}} - v_{\lambda}^{*}\Vert < \Vert... ...\lambda}^{\pi_{\epsilon}} - v_{n+1}\Vert + \Vert v_{n+1} - v_{\lambda}^{*}\Vert$

(6.1)

We now bound each part of the sum individually:

$\begin{eqnarray*}\Vert v_{\lambda}^{\pi_{\epsilon}} - v_{n+1}\Vert & = & \Vert L... ...{\pi_{\epsilon}} - Lv_{n+1}\Vert + \Vert Lv_{n+1} - v_{n+1}\Vert \end{eqnarray*}$

Since $\pi_{\epsilon}$ is maximal over the actions using v_n+1, it implies that $L_{\pi_{\epsilon}}v_{n+1} = Lv_{n+1}$ and we conclude that:

$\begin{eqnarray*}\Vert v_{\lambda}^{\pi_{\epsilon}} - v_{n+1}\Vert & \leq & \Ver... ...i_{\epsilon}} - v_{n+1}\Vert + \lambda\Vert v_{n+1} - v_{n}\Vert \end{eqnarray*}$

From the inequality it follows:

$\begin{eqnarray*}\Vert v_{\lambda}^{\pi_{\epsilon}} - v_{n+1}\Vert \leq \frac{\lambda}{1-\lambda}\Vert v_{n+1} - v_{n}\Vert \end{eqnarray*}$

For the second part of the sum we derive similarly that:

$\begin{eqnarray*}\Vert v_{n+1} - v_{\lambda}^{*}\Vert \leq \frac{\lambda}{1-\lambda}\Vert v_{n+1} - v_{n}\Vert \end{eqnarray*}$

From which part (4) of the theorem also follows.
Returning to inequality

, it follows:

$\begin{eqnarray*}\Vert v_{\lambda}^{\pi_{\epsilon}} - v_{\lambda}^{*}\Vert \leq \frac{2\lambda}{1-\lambda}\Vert v_{n+1} - v_{n}\Vert < \epsilon \end{eqnarray*}$

Therefore the selected policy $\pi_{\epsilon}$ is $\epsilon$ -optimal. $\Box$

Next: Convergence of Value Iteration Up: Finding the Optimal Policy: Previous: The Value Iteration Algorithm

Yishay Mansour
1999-12-18