The Optimality Principal

Next: About this document ... Up: Finite Horizon Previous: Computational Complexity

The Optimality Principal

Let us define: $U^*_t(h_t) = \max_{\pi\in{\Pi^{HR}}} \{U^{\pi}_t(h_t)\}$ . For the simplicity of the discussion we assume that both S and A are finite. The optimality equations (also known as Bellman equations) are:

$U_t(h_t) = \max_{a\in A} \{r_t(s_t,a) + \sum_{j\in S} P_t(j\vert s_t,a)U_{t+1}(h_t,a,j)\},$

Where U_N(h_N)=r_N(s_N) for h_N=(h_N-1,a_N-1,s_N).

Note that replacing the max operation in the equation with the action taken by policy $\pi$ , we get the equation for computing the return of policy $\pi$ .

Theorem 3.2 Let U_t be the solution for the optimality equations for t = 1 up to N-1, then:

(1) $\forall t \leq N-1, \forall h_t\in H_t, \; u_t(h_t)=u^*_t(h_t)$

(2) $\forall s_1 \in S, \; u_1(s_1)=V^*_N(s_1)$

Proof:First, we will show by induction that $u_t(h_t) \geq u^*_t(h_t)$ for $1\leq t\leq N$ and for every $h_t\in H_t$ . Then we exhibit a specific policy $\pi$ for which $V^{\pi}_t(h_t)=u_t(h_t)$ and that this will conclude the proof.

Basis: When t=N, by definition $u_N(h_N)=r_N(s_N)=u^{\pi}_N(h_N)$ for every policy $\pi$ and therefore u_N(h_N)=u^*_N(h_N).

Induction Step: Let us assume that $u_t(h_t) \geq u^*_t(h_t)$ for $n+1 \leq t \leq N$ and $h_t\in H_t$ . Let $\pi'$ be a policy in $\Pi^{HR}$ (that performs the operation d_n' on step n). for t=n:

$u_n(h_n) = \max_{a \in A} \{ r(s_n,a) + \sum_{j \in S} P_n(j\vert s_n,a)u_{n+1}(h_n,a,j)\}$

Consider an arbitrary policy $\pi'$ , that for history h_n uses stochastic action q(a). By the induction assumption,

$\begin{eqnarray*}u_n(h_n) & \geq & \max_{ a \in A} \{ r(s_n,a) + \sum_{j \in S} ... ...vert s_n,a) u^{\pi'}_{n+1}(h_n,a,j)\} \\ & = & u^{\pi'}_n(h_n) \end{eqnarray*}$

Therefore, $u_n(h_n) \geq U^{\pi'}_n(h_n)$ for every $\pi'$ and in particular to the optimal policy, $u_n(h_n)\geq u^*_n(h_n)= \max_{\pi'} u^{\pi'}_n(h_n)$ . It is left to show that there exists a policy $\pi'$ for which $V^{\pi'}_N(h_n)=u_n(h_n)$ . Let us define $\pi'$ in the following manner:

$\pi'(h_n) = argmax_{a \in A} \{ r(s_n,a) + \sum_{j \in S} P_n(j\vert s_n,a)u_{n+1}(h_n,a,j) \}$

We will now show that $u_n(h_n)=u^{\pi'}_n(h_n)$ using reversed induction on n.

Basis: for n=N this is obviously true.

Induction Step:

$\begin{eqnarray*}u^{\pi'}_n(h_n) & = & r(s_n,\pi'(h_n)) + \sum_{j \in S} P_n(j\v... ...\in S} P_n(j\vert s_n, \pi'(h_n))u_{n+1}(h_n) \\ & = & u_n(h_n) \end{eqnarray*}$

The second equality is by the induction hypothesis, that $u^{\pi'}_t=u_t(h_t)$ for $t \geq n+1$ . Therefore $u^{\pi'}_n(h_n)=u_n(h_n)$ and since $u^*_n(h_n) \geq u^{\pi'}_n(h_n)$ it is clear that u_n(h_n)=u^*_n(h_n) as was required. $\Box$

Next: About this document ... Up: Finite Horizon Previous: Computational Complexity

Yishay Mansour
1999-11-15