Indirect Algorithm

Next: The Main Theorem - Up: The Learning Algorithms Previous: Direct Algorithm - Phased-Q-Learning

The indirect algorithm works as follows:

First it makes m_I calls to PS(M) to obtain m_I next states for each state-action pair. Here, again, m_I is determined later by the analysis.
The next step of the indirect algorithm is building an empirical model of the transition probabilities using the collected samples as follows: ${\widehat{P}^a}_{st} = \frac{\char93 (s { \rightarrow_a} t)}{m_I}$ , note that ${\widehat{P}^a}_{st}$ , the transition probabilities in the empirical model, is an estimate to the transition probability from state s to state t by performing action a, in the given MDP M.
The third stage is iterating the Value-Iteration algorithm (i.e. $Q_{l+1}(s,a) = {R^a}_M(s) + \gamma \sum_{t \in S}{\widehat{P}^a}_{st}\widehat{v}_l(t)$ ) on the model we've established in the second stage of the algorithm for l_I iterations, and returns the achieved policy.

Note that the indirect algorithm requires m_I calls to PS(M).

Yishay Mansour
2000-05-30