Return Estimation of a Given Policy

Next: Dynamic Programming Algorithm for Up: Finite Horizon Previous: Expected Total Reward

Return Estimation of a Given Policy

Let $\pi\in\Pi^{HR}$ be a given policy. We would like to calculate it's value $V^{\pi}_{N}(s)$ . An inefficient way of calculating $V^{\pi}_{N}(s)$ is creating all possible histories, finding the return value for each one and calculating the expectation of it. Then we can compute the value by taking weighted sum. An alternative calculation method is calculating return of the last step first, and recurse for the previous steps, until reaching the first step. Let us define a set of variables $U^{\pi}_{t}(h_{t})$ to represent the expected return starting in time t until time N where h_t notes the history (previous steps and actions) until time t. We can write $U^{\pi}_t$ as:

$U^{\pi}_t = E^{\pi}_{h_t}[\sum_{n=t}^{N-1} r_n(X_n,Y_n)+r_N(X_N)]$ where h_t = (h_t-1, a_t-1, s_t)

We can define $U^{\pi}_t$ as a function of $U^{\pi}_{t+1}$ as follows:

$U^{\pi}_t(h_t) = E^{\pi}h_t[r_t(X_t,Y_t)+U_{t+1}(h_t,Y_t,X_{t+1})]$

Note that X_t is known from the history and Y_t is known if $\pi$ is deterministic. X_t+1 is unknown for any policy since the environment is stochastic.

The idea: Calculate the values of $U^{\pi}_t$ from t=N to t=1 using a dynamic programming algorithm.

Next: Dynamic Programming Algorithm for Up: Finite Horizon Previous: Expected Total Reward

Yishay Mansour
1999-11-15