Next: The Main Theorem -
Up: The Learning Algorithms
Previous: Direct Algorithm - Phased-Q-Learning
Indirect Algorithm
The indirect algorithm works as
follows:
- First it makes mI calls to PS(M) to obtain mI next
states for each state-action pair. Here, again, mI is determined later by the
analysis.
- The next step of the indirect algorithm is building an
empirical model of the transition probabilities using the
collected samples as follows:
,
note that
,
the transition probabilities
in the empirical model, is an estimate to the transition
probability from state s to state t by performing action
a, in the given MDP M.
- The third stage is iterating the
Value-Iteration algorithm
(i.e.
)
on the model we've established in the second stage of
the algorithm for lI iterations, and returns the achieved policy.
Note that the indirect algorithm requires mI calls to PS(M).
Yishay Mansour
2000-05-30