When the system is started at state , the goal of the agent is to choose an action that maximizes the value of v(a)=E[R1(s0,a1) + r(s')]
We will choose a deterministic policy that selects an action as follows: u(a)= E[R1(s,a)] +E[R(x2)], when x2 is a random variable describing the state reached after action a. Writing it explicitly,
The agent will choose an action a*, that maximizes V(a), i.e.
In the above example, there is no stochastic rule that is better then the deterministic policy presented. If there was, then such a stochastic rule would have an action distribution of q(a) and the return of such a policy would be , but since we chose a* so that for any action a, then (in simpler words, the best deterministic choice is always at least as good as any average).