The discounted value of policy

Next: Assumptions Up: Notation Previous: Expectation of Reward

The discounted value of policy $\pi$

$\vec{v}_{\lambda}^{\pi}=\sum_{t=1}^{\infty}\lambda^{t-1}P_{\pi}^{t-1}r_{d_{t}}$

(where for deterministic policies, r_{d_t}(s)=r(s,d_t(s)) is the immediate reward for transition from s to d_t(s))

Theorem 5.1 Let Q be a matrix such that $\Vert Q\Vert<1$ , then

1.: There exists (I-Q)^-1
2.: $(I-Q)^{-1} = \lim_{N \rightarrow \infty}\sum_{i=0}^NQ^i$

(The proof can be found in Puterman's book)

Yishay Mansour
1999-11-24