MDP description

Next: "Encoding" Functions Up: TD-Gammon Previous: State encoding TD-Gammon

MDP description

1.

The discount factor is set to $\lambda = 1$ .

2.

Immediate rewards:

(a): In non-terminal states the immediate rewards are equal to 0.
(b): In a winning terminal state the immediate reward equals 1.
(c): In a losing terminal state the immediate reward equals 0.

I.e. the TD has a difference of V(s_t+1) - V(s_t) for all non-terminal states. In every move the parameters are changed in the direction of the TD. Generally, assume we have a function F(s,r) = V_r(s) ,which gives each state s a value according to r (r is actually a program which gets a state s as an input). We will update $\overrightarrow{r}$ by the derivative of V_r(s) according to $\overrightarrow{r}$ . I.e. according to the vector $[\frac{\partial}{\partial r_{1}}V_{r}(s),\ldots ,\frac{\partial}{\partial r_{k}}V_{r}(s)]$ . Updating $\overrightarrow{r}$ in this direction will hopefully change the value of V_r(s) in the "right" direction. TD tries to minimize the difference between two succeeding states: assuming that V_r(s_t+1) > V_r(s_t), we would like to "strengthen" the weight of the action taken, and update in the direction of $[V_{r}(s_{t+1}) - V_{r}(s_{t})]\nabla_{\overrightarrow{r}} V_{t}(s_{t})$ For example, if r is a table: $\overrightarrow{r} = (r_{1},\ldots, r_{k})$ , then $\nabla_{\overrightarrow{r}} V_{r}(s) = (0, 0,\ldots, 1, 0,\ldots, 0)$ , and the update will occur only in the s'th entry of r. TD Gammon updates $\overrightarrow{r}$ while running the system, where the current policy is the greedy policy with respect to the function V_r(s). Specifically, $\overrightarrow{r_{t+1}} = \overrightarrow{r_{t}} + \alpha[V_{t}(s_{t+1}) - V_{t}(s_{t})]\cdot\overrightarrow{e_{t}}$ , where $\overrightarrow{e_{t}} = \gamma\lambda\overrightarrow{e_{t-1}} + \nabla_{\overrightarrow{r_{t}}}V_{t}(s_{t})$ TD-Gammon simply "learns" by playing against itself, i.e it updated $\overrightarrow{r_{t}}$ while playing against itself. After about 300,000 games the system achieved a very good playing skill (equivalent to other backgammon playing programs).

Next: "Encoding" Functions Up: TD-Gammon Previous: State encoding TD-Gammon

Yishay Mansour
2000-01-17