Reinforcement Learning Workshop - Project Page

Four in a Row Automated Player -
Reinforcement Learning Workshop Project

Authors

Eli Libman

Shanee Lavi

Nur Lan

Introduction

About the Game

Learning Algorithm

Game Representation

Training Process

Training will be done by having two copies of the player playing each other. Both learning their respected roles.
Bootstrapping was achieved by hard coding:

winning moves – will cause victory rewards to be awarded early
random start moves – Allowing us to explore more of the state space
self play – Allows player to be more challenged as the learning process continues

Learning rate . t – number of games played so far in the training session.

This way we achieve a properly decaying learning rate.

Discount factor – one, because rewards are only given at the end of the game.
Epsilon – in order to balance between exploration and exploitation we'll choose the next action with a policy. We tested three methods used to determine throughout the training:

A constant value: with probability ε we choose a random action, otherwise – a greedy action.
Epsilon decreasing strategy: decreasing epsilon as the learning progresses allows us to explore more at the beginning of the learning process and exploit as we converge. We tested two methods:
- . We reduce the exploration rate after each game. The value of c will affect the time it will take to converge to zero. It is common to choose .
- Using a reduction function: .

Q Approximation weights – These are the values to be learned by the training process.

Final training parameters
Our final player was trained using the following technique:

Gamma was set to 1
Initial epsilon value was set to 0.9
The epsilon update method chosen was annealing
Altogether we played 2 million games:
- We ran a session of 1 million games with an annealing factor 0.99999
- We ran another session of 1 million games with an annealing factor 0.9999. This run continued the learning of the previous run (meaning the initial epsilon of this run is the final epsilon of the previous run), the only difference was the change of the anneal factor.

Related Background

Workshop Teachers Prof. Yishai Mansour
TA: Mariano Schain