Notations

Next: Bounding Up: Proof for Phased Q-Learning Previous: Proof for Phased Q-Learning

Q_l: A (state,action) value function defined by
$Q_{l+1}(s,a) = {{R_M}^a}(s) + \gamma \sum_{t\in S}{P^a}_{st}v_l(t)$ , where $v_l(t) = \max_{b\in A}{\left\{Q_l(t,b)\right\}}$ .
Note, this is the operation of the Value-Iteration algorithm
$\widehat{Q}_l$ : A (state,action) value function defined by
$\widehat{Q}_{l+1}(s,a) = {{R_M}^a}(s) + \gamma \frac{1}{m_D}\sum_{k=1}^{m_D}\widehat{v_l}(t_k)$ , where $\widehat{v_l}(t_k) = \max_{b\in A}{\left\{\widehat{Q_l}(t_k,b)\right\}}$ , and t_k are the m_D next states observed from (s,a) on the m_D calls to PS(M). Note, this is the operation of the phased-Q-learning algorithm
Q^* denotes, as usually, the optimal value function.

Yishay Mansour
2000-05-30