Bounding

Next: Putting It All Together Up: Proof for Phased Q-Learning Previous: Notations

Bounding $\left\Vert \widehat{Q}_{l_D} - Q^*\right\Vert$

After l_D iterations, the algorithm Phased-Q-Learning returns a policy within $\varepsilon$ of the optimal policy, if

$\begin{displaymath}\left\Vert \widehat{Q}_{l_D} - Q^*\right\Vert \leq \varepsilon \end{displaymath}$

(5)

(due to the definition of Q^*).
By the triangle inequality:

$\begin{displaymath}\left\Vert \widehat{Q}_l - Q^*\right\Vert \leq \left\Vert \widehat{Q}_l - Q_l \right\Vert + \left\Vert Q_l - Q^* \right\Vert \end{displaymath}$

(6)

Let's bound each of this values separately:

Claim 7.1

$\begin{displaymath}\forall (s,a)\in S\times A,\ 0\leq i\leq l_D\ \\ \Pr( \lef... ...widehat{v}_i(j) \right\Vert\geq w ) \leq 2e^{-2{m_D}{w^2}} \end{displaymath}$

(7)

Proof:Note that

$\sum_{j \in S} P^{a}_{sj} \widehat{v}_i(j)$ is the expectation for the $\widehat{v}$ value of the next samples received from PS(M) for the pair (s,a) (since the samples are taken from the distribution P^a_sj); i.e. $E[\frac{1}{m_D} \sum_{k=1}^{m_D}$ $\widehat{v}_i(x_k)] = \sum_{j \in S} P^{a}_{sj} \widehat{v}_i(j)$
$\widehat{v}_k$ is bounded (since the immediate return is bounded and $\gamma<1$ )
The samples are independent and identically distributed.

Therefore, from Chernoff Inequality we receive that $\forall w>0$

$\begin{displaymath}\forall s\in S,\ a\in A,\ 0\leq i\leq l_D\ \\ \Pr( \left\V... ...widehat{v}_i(j) \right\Vert\geq w ) \leq 2e^{-2{m_D}{w^2}} \end{displaymath}$

(8)

$\Box$

Claim 7.2 Bounding $\left\Vert \widehat{Q}_l - Q_l \right\Vert$ :

$\left\Vert \widehat{Q}_l - Q_l \right\Vert \leq \max_{(s,a)\in S\times A,\ 0\... ...}_{sj} \widehat{v}_{l-i}(j) \right\Vert \cdot \sum_{i=1}^{l}\gamma^i \right\}$
$\Pr( \left\Vert \widehat{Q}_l - Q_l \right\Vert \geq w \cdot \sum_{i=1}^l{\gamma^i} ) \leq l \cdot \vert S\vert \cdot \vert A\vert \cdot 2e^{-2{m_D}{w^2}}$

Proof:First by induction let's see that:

$\begin{displaymath}\left\Vert \widehat{Q}_l - Q_l \right\Vert \leq \max_{(s,a)\i... ...{l-i}(j) \right\Vert \cdot \sum_{i=1}^{l}\gamma^i \right\} \end{displaymath}$

(9)

Induction's Base: For l=0 both sides of the equation are zero, and the inequality holds.
Induction's Step: Assume the claim for l-1 and prove it for l:

Now, after we've accomplished the bound for $\left\Vert \widehat{Q}_l - Q_l \right\Vert$ , let's show that:

$\begin{displaymath}\Pr( \left\Vert \widehat{Q}_l - Q_l \right\Vert \geq w \cdot ... ...cdot \vert S\vert \cdot \vert A\vert \cdot 2e^{-2{m_D}{w^2}} \end{displaymath}$

(10)

$\Box$

Claim 7.3 Bounding $\left\VertQ_l - Q^*\right\Vert$ :

$\begin{displaymath}\left\VertQ_l - Q^*\right\Vert \leq \frac{\gamma^{l-1}}{1-\gamma} \left\Vertv_1 - v_0\right\Vert \end{displaymath}$

(11)

Proof:We saw in class that:

$\begin{displaymath}\forall l \left\Vert v_l - v^* \right\Vert \leq \frac{\gamma^l}{1-\gamma} \left\Vertv_1 - v_0\right\Vert \end{displaymath}$

(12)

so,

$\begin{eqnarray*}\left\VertQ_l - Q^*\right\Vert &=& \max_{(s,a) \in S\times A}\l... ...frac{\gamma^{l-1}}{1-\gamma} \left\Vertv_1 - v_0\right\Vert\\ \end{eqnarray*}$

and the bound is received. $\Box$

Next: Putting It All Together Up: Proof for Phased Q-Learning Previous: Notations

Yishay Mansour
2000-05-30