Let be a given policy. We would like to calculate it's value . An inefficient way of calculating is creating all possible histories, finding the return value for each one and calculating the expectation of it. Then we can compute the value by taking weighted sum. An alternative calculation method is calculating return of the last step first, and recurse for the previous steps, until reaching the first step. Let us define a set of variables to represent the expected return starting in time t until time N where ht notes the history (previous steps and actions) until time t. We can write as:
where ht = (ht-1, at-1, st)
We can define as a function of as follows:
Note that Xt is known from the history and Yt is known if is deterministic. Xt+1 is unknown for any policy since the environment is stochastic.
The idea: Calculate the values of from t=N to t=1 using a dynamic programming algorithm.