Next: About this document ... Up: Reinforcement Learning - Final Previous: Bounding the Number of

Conclusions

We saw that both Phased-Q-Learning and the indirect algorithm enjoy a rather rapid convergence to the optimal policy as a function of the number of observed transitions. Both have roughly the same sampling complexity, with a slight advantage to the indirect algorithm. This advantage is rather surprising since this sampling complexity is not enough to enable the construction of a good model of the given MDP.

Yishay Mansour
2000-05-30