RL Player template library

RL Course - Final Project

Rusinov Alexander 308765957

Toolkit For Reinforcement Learning Process Implementation

Introduction

There is variety of applications containing some kind (or kinds) of units having following properties:
Provided with some set of data by application, the unit chooses action from list of currently applicable actions. Action could be choosen with some predefined policy or using some kind of learning process. Application designer (toolkit "user") may be interested to use learning algorithm to choose next action or to evaluate performance of some predefined policy.

Abstract

We would try to provide application designer with uniform and easy way to implement learning process.

Set of data provided by application is ussually contains too much information to make up reasonable learning process.So application designer (toolkit "user") has to create model which is reducing information used in the learning process but still describing process of execution reasonably well (precise).

We would provide application designer with toolkit for describing models of learning process and with library of learning algorithms to pick up the most suitable one.

The library is template library to permitt user to define different learning process models.

Terminology

We'll call a unit which is provided with data set and chooses action to perform : "player". We'll refer to data type containing set of data provided to the "player" as "Observation" type. Enumeration type denoting all possible actions should be provided. We'll refer it as "Action " type.

"Player" class interface.

Player class is templated with Observation & Action parameters.Generic definition of player class is placed in rl_headers/player.h file. In this file simple interface to be implemented by "player" is defined:

// new run is started : get initial observation & choose action
virtual Action start_run (Observation* startObserv) = 0;

// inside run : get current observation ,choose action , update
// statistics by using knoweledge of the last achieved reward
virtual Action do_step (Observation* currObserv, double lastReward) = 0;

// run is finished : get final observation, update statistics
virtual void end_run (Observation* currObserv, double lastReward) = 0;

// print accumulated statistics.
virtual void print_stats () = 0;

Creation of "player" instances

In the same file rl_headers/player.h generic definition of factory creating players is provided. All parameters needed for specific player creation are passed through string parameter.

virtual Player<Observation,Action>* CreatePlayer (const char* params) = 0;

Reinforcement Learning Algorithms Library

Generally, there is no limitation on how player is defined, given it implements specified in player.h interface. But the toolkit provides the user with the template library which implements players that use Reinforcement Learning algorithms (look rl_headers/lrn_alg.h ) .

All Reinforcement Learning algorithms are based on the concept of "State" and "Reward" . Action made will transfer the player to some state (or leave it in the same state) and provide player with some reward. The algorithm computes real number valued function associated with State (look rl_headers/evaluation_alg.h ) or with [State,Action] pair (look rl_headers/steplrn_alg.h ) used to evaluate total reward for the user starting from any specific state or starting from any specific state by performing any specific action.

The toolkit also provides the user with way to create model which is suitable for reinforcement learning process by creating model definition file and providing it as input for code generation process.

On-line Vs Off-line policies

We adistinguish two types of Reinforcement Learning (RL) Players.

One of them uses some predefined policy for action choise and uses Reinforcement Learning algorithm only to evaluate this not depended on the algorithm policy. Currently provided algotithms of this kind are: TD0 (look rl_headers/td0_alg.h) and Q-Learning (look rl_headers/qlrn_alg.h ).
Another kind of RL Players uses the evaluation function to choose next action (usually using eps-greedy policy!). Currently provided algotithm of this kind is: sarsa (look rl_headers/sarsa_alg.h)

Model Definition.

As stated, players have observation & action as template parameters. Different implementations for player with same observation & action types would have different internal player states, different way of transformation observation to internal state, different subsets of permitted actions for state and so on.

There is way to generate specific player implementations by creating configuration file containing learning model definition.With the toolkit the Black Jack sample application is provided. The model definition file for the apllication is provided (look bjack_game/player3.rep ).

Data for Code Generation Process

In the model definition file we place following definitions:

Fields to be used in player internal state definitions.In the present library version all such fields are of enumeration types.
So, the format is:FieldName EnumTypeName
States Types in format: StateTypeName FieldName1{,...,FieldNameN}
Transformation from Observation Type to State Type: TransName definition. This definition is provided as conventional C++ code in the model definition file or inserted into the generated code .
Range defines group of states as follows:
RangeName StateTypeName FieldName1[MinValue1..MaxValue1]{,..., FieldNameN[MinValueN..MaxValueN]} (for every field in the state definition).
Actions permitted for range are specified in the following format: ActPerRangeName RangeName Action
In the present version only Markov policies can be generated automatically . Such policy is defined,as follows : PolicyName StateTypeName ActPerRangeName1{,ActPerRangeName2,...,ActPerRangeNameN}
Player Implementation is specified in the following format:
PlayerName StateTypeName TransName ActPerRangeName1{,ActPerRangeName2,...,ActPerRangeNameN}
where StateTypeName is internal state of the player definition, TransName is C++ code used to get internal state from the current observation and list of actions per range specifies all actions which are valid for different internal states.

How To Use The Toolkit.

So, to inplement leaning process using the toolkit (look bjack_game/bjack2.cc sample):

Create data file describing models for all kinds of RL players for the current application which means : players with Observation & Action types provided by the application.
Set environment variable TEMPLATE_DIR to templates and run script scripts/gen_defs.pl using model defintion file as input to perform generation of the code for specific player implementations.
In the application module using player object include file rl_headers/rl_player.h
Include also the file produced by code generation process.
Get pointer to RL Players factory:
PlayerFactory* factory = RL_PlayerFactory::Instance();
Create new player using the following command: Player* player = factory->CreatePlayer(params); where params string contains all parameters needed for player creation: in format :
- RL PlayerName AlgKind Lambda PolicyName , when player uses some fixed policy. Possible AlgKind values are TD0,Q_Learning.
- RL PlayerName AlgKind Lambda Eps , when player uses eps-greedy policy. Possible AlgKind value is Sarsa.

Toolkit availability

Get rl_player.tar "tar" archive from the RL course site.

When opened it contains the current file and 4 directories:

rl_headers contains reinforcement learning algorithms and player creation header files.
templates contains "templates" for code generation
scripts contains script for code generation process
black_jack contains sample application

Currently the toolkit is compiled and works properly on Linux & Sun OS platforms. Migration to MS Windows is not finished yet because of differences in C++ Standard Library & Perl implementations.