Next: About this document ...
Up: Reinforcement Learning - Final
Previous: Bounding the Number of
Conclusions
We saw that both Phased-Q-Learning and the
indirect algorithm enjoy a rather rapid convergence to the optimal
policy as a function of the number of observed transitions. Both
have roughly the same sampling complexity, with a slight advantage
to the indirect algorithm. This advantage is rather surprising
since this sampling complexity is not enough to enable the
construction of a good model of the given MDP.
Yishay Mansour
2000-05-30