Next: Example showing that tied Up: No Title Previous: No Title

Large State Space

When we have a large state space and we can not compute V(s,r), we would like to build an approximation function $\tilde{V}(s,r)$ . Let

$\begin{displaymath}\epsilon = \min_{r} \{ \vert\vert\tilde{V}(s,r) - V^*\vert\vert \} \end{displaymath}$

be the minimal distance between $\tilde{V}(s,r)$ and V^*. (It is clear that $\epsilon$ is the upper bound for any approximation algorithm.(If $\epsilon$ is large, we can not expect a good approximation regardless the learning process.) We will show later error bounds of the form: $\frac{\epsilon}{(1-\lambda)^2}$ or $\frac{\epsilon}{(1-\lambda)}.$
Theses bounds might seem disappointing, since when $\lambda \rightarrow 1$ our bound diverges. On the other hand if we enrich our architecture (enlarge the family of r) $\epsilon \rightarrow 0$ , and when $\lambda$ is a constant the bound will go to 0.
If we approximate Q^*(s,a) by $\tilde{Q}(s,a,r)$ , the greedy policy $\pi$ is

$\begin{displaymath}\pi{(s,r)} = argmax_{a\in A_s} \{ \tilde{Q}(s,a,r) \} \end{displaymath}$

If we have $\tilde{V}(s,r)$ , then the greedy policy $\pi$ is,

$\begin{displaymath}\pi{(s,r)} = argmax_{a\in A_s} \{ r(s,a) + \lambda E_{s^{'}}[\tilde{V}(s^{'},r)] \} \end{displaymath}$

We have to approximate $E_{s^{'}}[\tilde{V}(s^{'},r)]$ , so we have additional errors due to approximation too.
If we have only a few states S^', than we compute the expectation exactly, otherwise we will approximate it by taking samples. We will get a stochastic policy since every time we get different sample and we may decide a different action, unlike the case of $\tilde{Q}$ where we get deterministic policy. The next theorem relates errors on the value function to differences in the return.

Theorem 11.1 Consider a discounted problem, with parameter $\lambda$ . If V satisfies

$\begin{displaymath}\epsilon = \vert\vert V^* - V \vert\vert , \end{displaymath}$

and $\pi$ is a greedy policy based on V, then

$\begin{displaymath}\vert\vert V^{\pi} - V^* \vert\vert \leq \frac{2\lambda\epsilon}{1-\lambda} \end{displaymath}$

Furthermore there exists $\epsilon_{0}$ s.t for every $\epsilon \leq \epsilon_{0}$ the policy $\pi$ is the optimal policy.

Proof:
Consider the operators,

$\begin{displaymath}L_{\pi}V = r_{\pi} + \lambda P_{\pi},V\end{displaymath}$

and

$\begin{displaymath}LV = \max_{\pi} \{r_{\pi} + \lambda P_{\pi}r\}.\end{displaymath}$

Then

$\begin{eqnarray*}\vert\vert V^{\pi} - V^* \vert\vert & = & \vert\vert L_{\pi}V^{... ...vert V^* - V \vert\vert + \lambda \vert\vert V - V^* \vert\vert \end{eqnarray*}$

This implies that,

$\begin{displaymath}\vert\vert V^{\pi} - V^* \vert\vert \leq \frac{2\lambda\epsilon}{1-\lambda} \end{displaymath}$

For the second part,
Since we have finite number of policies, then there exist $\delta$ s.t.

$\begin{displaymath}\delta = \min_{\pi\neq\pi^*} \{ \vert\vert V^{\pi} - V^* \vert\vert \} \end{displaymath}$

For any $\epsilon$ such that,

$\begin{displaymath}\delta > \frac{2\lambda\epsilon}{1-\lambda},\end{displaymath}$

we have that

$\begin{displaymath}\pi = \pi^*.\end{displaymath}$

Therefore we can choose $\epsilon_{0} = \frac{\delta(1 - \lambda}{2\lambda}.$ $\Box$

Next: Example showing that tied Up: No Title Previous: No Title

Yishay Mansour
2000-01-11