Let us define: . For the simplicity of the discussion we assume that both S and A are finite. The optimality equations (also known as Bellman equations) are:
Where UN(hN)=rN(sN) for hN=(hN-1,aN-1,sN).
Note that replacing the max operation in the equation with the action taken by policy , we get the equation for computing the return of policy .
(1)
(2)
Proof:First, we will show by induction that for and for every . Then we exhibit a specific policy for which and that this will conclude the proof.
Basis: When t=N, by definition for every policy and therefore uN(hN)=u*N(hN).
Induction Step: Let us assume that for and . Let be a policy in (that performs the operation dn' on step n). for t=n:
Consider an arbitrary policy ,
that for history hn uses stochastic action q(a). By the induction assumption,
We will now show that using reversed induction on n.
Basis: for n=N this is obviously true.
Induction Step: