When we have a large state space and we can not compute V(s,r), we would like to build an approximation function
.
Let
be the minimal distance between
and V*.
(It is clear that
is the upper bound for any approximation algorithm.(If
is large, we can not expect a good approximation regardless the learning process.) We will show later error bounds of the form:
or
Theses bounds might seem disappointing, since when
our bound diverges. On the other hand if we enrich our architecture (enlarge the family of r)
,
and when
is a constant the bound will go to 0.
If we approximate
Q*(s,a) by
,
the greedy policy
is
If we have
,
then the greedy policy
is,
We have to approximate
,
so we have additional errors due to approximation too.
If we have only a few states S', than we compute the expectation exactly, otherwise we will approximate it by taking samples.
We will get a stochastic policy since every time we get different sample and we may decide a different action, unlike the case of
where we get deterministic policy.
The next theorem relates errors on the value function to differences in the return.
Theorem 11.1
Consider a discounted problem, with parameter .
If V satisfies
and
is a greedy policy based on V, then
Furthermore there exists
s.t for every
the policy
is the optimal policy.
Proof:
Consider the operators,
and
Then
This implies that,
For the second part,
Since we have finite number of policies, then there exist
s.t.