Approximate Policy Iteration

Next: The algorithm using Monte Up: Large State Space Previous: Example showing that tied

Approximate Policy Iteration

The general structure is the same as in the Policy Iteration, except the following differences:

We will not use $V^{\pi}$ , instead we use $\tilde{V}^{\pi}$ (or $\tilde{Q}^{\pi}$ ), which is only an approximation of $V^{\pi}$ . The reasons of using approximations are the architecture that may not be strong enough and the noise caused by the simulations.
Let $\tilde{\pi}$ be the greedy policy of $\tilde{V}^{\pi}$ . We might take $\pi$ , which is close to $\tilde{\pi}$ .

Those two differences are a source for an error.

**Figure:** Regular Policy Iteration
$\begin{figure}\psfig{file=Policy.ps,width=4in,clip=} \end{figure}$

**Figure:** Approximate Policy Iteration
$\begin{figure}\psfig{file=PolicyApp.ps,width=4in,clip=} \end{figure}$

Yishay Mansour
2000-01-11