The Main Theorem - Bound on the Number of Samples

Theorem 5.1 Main Theorem

For an appropriate choice of the parameters m_D and l_D, the total number of calls to PS(M) required by the Phased-Q-Learning algorithm in order to ensure that, with probability at least $1-\delta$ , the expected return of the resulting policy will be within $\varepsilon$ of the optimal policy, is:

$\begin{displaymath}O( (\frac{1}{\varepsilon^2} \cdot \ln{\frac{1}{\varepsilon}})... ...ert S\vert}{\delta}}\ + \ \ln\ln{\frac{1}{\varepsilon}}) ) \end{displaymath}$ (1)
For an appropriate choice of the parameters m_I and l_I, the total number of calls to PS(M) required by the indirect algorithm in order to ensure that, with probability at least $1-\delta$ , the expected return of the resulting policy will be within $\varepsilon$ of the optimal policy, is

$\begin{displaymath}O( (\frac{1}{\varepsilon^2}) \cdot (\ln{\frac{\vert S\vert}{\delta}}\ + \ \ln\ln{\frac{1}{\varepsilon}}) ) \end{displaymath}$ (2)