Example 1

Next: Example 2 Up: Introduction Previous: Policy

Example 1

The first example is a Markovian decision making problem with one time phase. This implies that N=2, $T=\{1,2\}$ and r(s') is the value of state s' in the end.

When the system is started at state $s_0\in{S}$ , the goal of the agent is to choose an action $a\in{A}$ that maximizes the value of v(a)=E[R₁(s₀,a₁) + r(s')]

We will choose a deterministic policy $\pi$ that selects an action as follows: u(a)= E[R₁(s,a)] +E[R(x₂)], when x₂ is a random variable describing the state reached after action a. Writing it explicitly, $u(a) = r_1(s,a) + \sum_{j \in S} P(j\vert s,a)*r_2(j)$

The agent will choose an action a^*, that maximizes V(a), i.e. $a^* \in \{a_1\vert u(a_1) = \max_{a \in A} \{U(a)\}\}$

In the above example, there is no stochastic rule that is better then the deterministic policy presented. If there was, then such a stochastic rule would have an action distribution of q(a) and the return of such a policy would be $\sum q(a)U(a)$ , but since we chose a^* so that $U(a) \leq U(a^*)$ for any action a, then $\sum q(a)U(a) \leq \sum q(A)U(a^*) \leq U(a^*)$ (in simpler words, the best deterministic choice is always at least as good as any average).

Yishay Mansour
1999-11-15