Let
be a given policy. We would like to calculate it's value
.
An inefficient way of calculating
is creating all possible histories, finding the return value for each one and calculating the expectation of it. Then we can compute the value by taking weighted sum. An alternative calculation method is calculating return of the last step first, and recurse for the previous steps, until reaching the first step.
Let us define a set of variables
to represent the expected return starting in time t until time N where ht notes the history (previous steps and actions) until time t. We can write
as:
where
ht = (ht-1, at-1, st)
We can define
as a function of
as follows:
Note that Xt is known from the history and Yt is known if
is deterministic. Xt+1 is unknown for any policy since the environment is stochastic.
The idea: Calculate the values of
from t=N to t=1 using a dynamic programming algorithm.