Next: Proof for The Indirect
Up: Proof for Phased Q-Learning
Previous: Putting It All Together
Finding the Constants mD and lD
For parameters
s.t.:
- 1.
-
- 2.
-
we get, by the above result, that with probability at least
we have reached an
policy i.e.:
So, let's find such parameters
.
Note that mD and lD must be natural numbers, but it suffices
to constrain them to a real value, since taking an upper
value won't change the asymptotic complexity.
>From (1) we get:
In order to use Chernoff, we must have w>0,
And we get the constraint:
|
(14) |
This constraint is fulfilled by setting
,
for any f>1, and in particular,
for
(
since
), we get:
|
(15) |
>From (2) we get:
|
(16) |
Hence:
|
(17) |
By substituting with the w received from (1), we get:
Now, let's bound the number of calls to PS(M) i.e. let's bound
:
So the number of calls to PS(M) that insures with probability at
least
that the received policy in the
Phased-Q-Learning algorithm is within
distance
from the optimal policy is
Next: Proof for The Indirect
Up: Proof for Phased Q-Learning
Previous: Putting It All Together
Yishay Mansour
2000-05-30