Next: Proof for Phased Q-Learning Up: Reinforcement Learning - Final Previous: The Main Theorem -

My Results

The article presents no proof to its theorem, so the core of my project is providing a proof to the theorem. The results I've accomplished depend on |A| (the size of the actions-set), but under the assumption of a constant |A|, the result are the same:

For an appropriate choice of the parameters m_D and l_D, the total number of calls to PS(M) required by the Phased-Q-Learning algorithm in order to ensure that, with probability at least $1-\delta$ , the expected return of the resulting policy will be within $\varepsilon$ of the optimal policy, is:

$\begin{displaymath}O( (\frac{1}{\varepsilon^2} \cdot \ln{\frac{1}{\varepsilon}})... ...vert A\vert}{\delta})} + \ln\ln{\frac{1}{\varepsilon}}) )\\ \end{displaymath}$ (3)
For an appropriate choice of the parameters m_I and l_I, the total number of calls to PS(M) required by the indirect algorithm in order to ensure that, with probability at least $1-\delta$ , the expected return of the resulting policy will be within $\varepsilon$ of the optimal policy, is

$\begin{displaymath}O( \frac{1}{\varepsilon^2} \cdot (\ln{(\frac{\vert S\vert\cdot \vert A\vert}{\delta})} + \ln\ln\frac{1}{\varepsilon}) ) \end{displaymath}$ (4)

Yishay Mansour
2000-05-30