Dynamic Programming Algorithm for

Next: Computational Complexity Up: Return Estimation of a Previous: Return Estimation of a

Dynamic Programming Algorithm for $\pi\in\Pi^{HD}$

1. $t \leftarrow N$

$\forall h_N=(h_{N-1},a_{N-1},s_N) : U^{\pi}_N(h_N)\leftarrow r_N(s_N)$

2. Repeat $t\leftarrow t-1$ until t=1

$\forall h_t=(h_{t-1},a_{t-1},s_t) \in H_t$

$U^{\pi}_t(h_t)\leftarrow r_t(s_t,d_t(h_t)) + \sum_{j\in S} P_t(j\vert s_t,d_t(h_t))*U^{\pi}_{t+1}(h_t,d_t(h_t),j)$

The next lemma states the correction of the algorithm.

Lemma 3.1 for $\pi\in\Pi^{HD}, U^{\pi}_1(s)=V^{\pi}_N(s)$

Proof:We will use a reversed induction to show that $U^{\pi}_{t}(h_{t})$ equals the expected return of $\pi$ given history h_t.

Basis: for t=N the lemma is obviously true (by the definition of V).

Induction Step: $U^{\pi}_{t}(h_{t})$ can be written as:

(1) $U^{\pi}_t(h_t)= E^{\pi}_{h_t}[r_t(X_t,Y_t)+U_{t+1}(h_t,Y_t,X_{t+1)}]$

X_t=s_t is known from h_t

Y_t=d_t(h_t) is known from $\pi$ since the policy is deterministic.

(2) $U^{\pi}_t(h_t) = r_t(s_t,d_t(h_t)) + E^{\pi}_{h_t}[U^{\pi}_{t+1}(h_t,d_t(h_t),x_{t+1)}]$ and at stage 2 of the algorithm we calculate exactly that expectation. Therefore, by using the induction hypothesis, it holds also for t. This implies that at the end, we have computed the expected return of $\pi$ from the start state, which is by definition $V^{\pi}_{N}(s)$ . $\Box$ the policy was a stochastic one ( $\pi\in\Pi^{HR}$ ), we would change stage 2 of the algorithm and get:

(2) $U^{\pi}_t(h_t)\leftarrow \sum_{a\in A} q_{h_t}(a)\{r_t(s_t,d_t(h_t)) + \sum_{j\in S} P_t(j\vert s_t,d_t(h_t))*U^{\pi}_{t+1}(h_t,d_t(h_t),j)\}$

Yishay Mansour
1999-11-15