1.
2. Repeat
until t=1
The next lemma states the correction of the algorithm.
Proof:We will use a reversed induction to show that
equals the expected return of
given history ht.
Basis: for t=N the lemma is obviously true (by the definition of V).
Induction Step:
can be written as:
(1)
Xt=st is known from ht
Yt=dt(ht) is known from
since the policy is deterministic.
(2)
and at stage 2 of the algorithm we calculate exactly that expectation. Therefore, by using the induction hypothesis, it holds also for t. This implies that at the end, we have computed the expected return of
from the start state, which is by definition
.
the policy was a stochastic one (
), we would change stage 2 of the algorithm and get:
(2)