Next: Calculating the Derivative of Up: No Title Previous: Choosing the parameters for

TD-Gammon

Let V^*(s,0) be the probability of white winning from state s and it is white's turn (assuming white and black are playing optimally). Let V^*(s,1) be the probability of white winning from state s and it is black's turn. We estimate V^*(s,l) using a neural network which calculates $\mathaccent'176{V}(s,l,r)$ .
The Neural Network:

198 inputs.
40 nodes in the second level.
1 output node in the third level.

Network Initialization: (small) random weights.
Training Method: The program plays for both sides. For each point in time we have a state s_t, a vector r_t, and a turn l_t. For each state s' accessible from s_t (according to the dice), we calculate $\mathaccent'176{V}(s',l_{t},r_{t})$ , and choose the best state. (For white's turn choose the state corresponding to the maximum value; for black's - the minimum.)
Updating Parameters: At the end of each turn, we compute:

$\begin{displaymath}d_{t}=\mathaccent'176{V}(s_{t+1},l_{t+1},r_{t}) - \mathaccent'176{V}(s_{t},l_{t},r_{t})\ , \end{displaymath}$

which is the TD (temporal difference). (In the final state we replace $\mathaccent'176{V}$ with the game outcome.) In addition, we update $\overrightarrow{r_{t}}$ :

$\begin{displaymath}\overrightarrow{r_{t+1}}\leftarrow \overrightarrow{r_{t}}+\a... ...-k} \nabla_{r_{k}} \mathaccent'176{V}(s_{k},l_{k},r_{k})}\ . \end{displaymath}$

$\begin{displaymath}\hspace{1.3in}\overrightarrow{e_{t}}\end{displaymath}$

At the end of each game a new game is started, and r₀ is set to the previous game's parameter vector.

1.: $\alpha$ is set to a constant (determined by experiments).
2.: $\gamma$ does not affect the results significantly in this case.

At the end of the training phase, we get a function $\mathaccent'176{V}(s,l,r)$ (r is fixed) which we can use to play backgammon.
Improvements:

Instead if 40 nodes at the second level, we add 40 more (80 total), the additional 40 units are set to represent important patterns for backgammon.
After $\mathaccent'176{V}$ is set, it can be used immediately (one step), or, alternatively, we can look a few steps ahead, by building a game search tree.

Comments:

$\mathaccent'176{V}$ wasn't a good estimate for the probability of white winning, but the policy derived from it is a very good one. (the reason for this is as yet unclear.)
The chosen policy is totally greedy (selected deterministically). However, in backgammon there is a lot of randomness, because of the dice. This enables us to "explore", although we do not perform this explicitly.
TD-Gammon has achieved a skill level close to the level of the best players in the world today!

Next: Calculating the Derivative of Up: No Title Previous: Choosing the parameters for

Yishay Mansour
2000-01-17