Next: SARSA
Up: Q-learning
Previous: Q-learning
remarks:
- if we choose Q to be optimal Q=Q* then
.
- The algorithm Q-Learning is off-policy since
we don't control the policy that
porforms actions. In general an off-line algorithm doesn't control the actions it does.
- For on-policy we can give any action a small probability s.t. we reach all MDP.
For on-policy we can hope to achieve rewards getting closer to optimal.
For on-policy
is
regarding Q at the moment. Thus we get Sarsa
algorithm
Yishay Mansour
2000-01-07