When the system is started at state ,
the goal of the agent is to choose an action
that maximizes the value of
v(a)=E[R1(s0,a1) + r(s')]
We will choose a deterministic policy
that selects an action as follows:
u(a)= E[R1(s,a)] +E[R(x2)], when x2 is a random variable describing the state reached after action a. Writing it explicitly,
The agent will choose an action a*, that maximizes V(a), i.e.
In the above example, there is no stochastic rule that is better then the deterministic policy presented. If there was, then such a stochastic rule would have an action distribution of q(a) and the return of such a policy would be
,
but since we chose a* so that
for any action a, then
(in simpler words, the best deterministic choice is always at least as good as any average).