Finding the Constants mD and lD

Next: Proof for The Indirect Up: Proof for Phased Q-Learning Previous: Putting It All Together

Finding the Constants m_D and l_D

For parameters $w,\ m_D\ and\ l_D$ s.t.:

1.: $\varepsilon = w \cdot \sum_{i=1}^{l_D}{\gamma^i} + \frac{\gamma^{l_D-1}}{1-\gamma} \left\Vertv_1 - v_0\right\Vert$
2.: $\delta = {l_D} \cdot \vert S\vert \cdot \vert A\vert \cdot 2e^{- 2 {m_D} {w^2}}$

we get, by the above result, that with probability at least $1-\delta$ we have reached an $\varepsilon-optimal$ policy i.e.: $\Pr( \left\Vert\widehat{Q}_{l_D} - Q^*\right\Vert < \varepsilon ) > 1-\delta$
So, let's find such parameters $w,\ m_D\ and\ l_D$ . Note that m_D and l_D must be natural numbers, but it suffices to constrain them to a real value, since taking an upper value won't change the asymptotic complexity. >From (1) we get:

$\begin{eqnarray*}w &=& \frac{ \varepsilon - \frac{\gamma^{l_D-1}}{1-\gamma} \lef... ...}\left\Vertv_1- v_0\right\Vert) }{ \gamma-\gamma^{{l_D}+1} }\\ \end{eqnarray*}$

In order to use Chernoff, we must have w>0,

$\begin{eqnarray*}w>0 &\Leftrightarrow& \frac{ (1-\gamma)\varepsilon - \gamma^{l_... ...gamma)\varepsilon }{ \left\Vertv_1- v_0\right\Vert })} + 1\\ \end{eqnarray*}$

And we get the constraint:

$\begin{displaymath}l_D > log_{\gamma}{(\frac{ (1-\gamma)\varepsilon }{ \left\Vertv_1- v_0\right\Vert })} + 1\\ \end{displaymath}$ (14)

This constraint is fulfilled by setting $l_D = log_{\gamma}{(\frac{ (1-\gamma)\varepsilon }{ \left\Vertv_1- v_0\right\Vert })} + f$ , for any f>1, and in particular, for $f = log_{\gamma}{2}$ ( $log_{\gamma}{2}>1$ since $0<\gamma<1$ ), we get:

$\begin{displaymath} l_D = {\log_{\gamma}}{\frac{ (1-\gamma)\varepsilon }{ \le... ...2(1-\gamma)\varepsilon }{ \left\Vertv_1- v_0\right\Vert }} \end{displaymath}$ (15)

>From (2) we get:

$\begin{displaymath}\ln\delta = \ln{(l_D\cdot \vert S\vert\cdot \vert A\vert\cdot... ... = \ln{(2 l_D \vert S\vert\cdot \vert A\vert)} - 2 {m_D}{w^2} \end{displaymath}$

(16)

Hence:

$\begin{displaymath}m_D = \frac{ \ln{(2 l_D \vert S\vert \cdot \vert A\vert)} - \... ...l_D \vert S\vert \cdot \vert A\vert}{\delta})} } { 2{w^2} } \end{displaymath}$

(17)

By substituting with the w received from (1), we get:

$\begin{eqnarray*}% latex2html id marker 811 m_D &=& \frac{ \ln{(\frac{2 l_D \ver... ...\Vert })} }{\delta})} } { 2 {(1-\gamma)}^4 \varepsilon^2 }\\ \end{eqnarray*}$

Now, let's bound the number of calls to PS(M) i.e. let's bound $m_D \times l_D$ :

So the number of calls to PS(M) that insures with probability at least $1-\delta$ that the received policy in the Phased-Q-Learning algorithm is within $\varepsilon$ distance from the optimal policy is $O( \frac{1}{\varepsilon^2} \cdot \ln{\frac{\vert S\vert\cdot \vert A\vert}{\de... ...silon^2} \cdot \ln{\frac{1}{\varepsilon}} \cdot \ln\ln{\frac{1}{\varepsilon}} )$
$\Box$

Next: Proof for The Indirect Up: Proof for Phased Q-Learning Previous: Putting It All Together

Yishay Mansour
2000-05-30