next up previous
Next: Assumptions Up: Notation Previous: Expectation of Reward

   
The discounted value of policy $\pi$

$\vec{v}_{\lambda}^{\pi}=\sum_{t=1}^{\infty}\lambda^{t-1}P_{\pi}^{t-1}r_{d_{t}}$

(where for deterministic policies, rdt(s)=r(s,dt(s)) is the immediate reward for transition from s to dt(s))

Theorem 5.1   Let Q be a matrix such that $\Vert Q\Vert<1$, then
1.
There exists (I-Q)-1
2.
$(I-Q)^{-1} = \lim_{N \rightarrow \infty}\sum_{i=0}^NQ^i$
(The proof can be found in Puterman's book)



Yishay Mansour
1999-11-24