Next: Putting It All Together
Up: Proof for Phased Q-Learning
Previous: Notations
Bounding
After
lD iterations, the algorithm Phased-Q-Learning returns a policy
within
of the optimal policy, if
|
(5) |
(due to the definition of Q*).
By the triangle inequality:
|
(6) |
Let's bound each of this values separately:
Proof:Note that
-
is the
expectation for the
value of the next samples received from
PS(M) for the pair (s,a) (since the samples are taken from the distribution
Pasj); i.e.
-
is bounded (since the immediate return is bounded and )
- The samples are independent and identically distributed.
Therefore, from Chernoff Inequality we receive that
|
(8) |
Claim 7.2
Bounding
:
Proof:First by induction let's see that:
|
(9) |
Induction's Base: For l=0 both sides of the equation are zero, and the
inequality holds.
Induction's Step: Assume the claim for l-1 and prove it for l:
Now, after we've accomplished the bound for
,
let's show that:
|
(10) |
Claim 7.3
Bounding
:
|
(11) |
Proof:We saw in class that:
|
(12) |
so,
and the bound is received.
Next: Putting It All Together
Up: Proof for Phased Q-Learning
Previous: Notations
Yishay Mansour
2000-05-30