RL Player template library

There is variety of applications containing some kind (or kinds) of units having following properties:
Provided with some set of data by application, the unit chooses action from list of currently applicable actions.

We'll call such a unit: "player". We'll refer to data type containing set of data provided to the "player" as "Observation" type. Enumeration type denoting all possible actions should be provided. We'll refer it as "Action" type.

Player class is templated with Observation & Action parameters.Generic definition of player class is placed in rl_headers/player.h file. In this file simple interface to be implemented by "player" is defined:

// new run is started : get initial observation & choose action
virtual Action start_run (Observation* startObserv) = 0;

// inside run : get current observation ,choose action , update
// statistics by using knoweledge of the last achieved reward
virtual Action do_step (Observation* currObserv, double lastReward) = 0;

// run is finished : get final observation, update statistics
virtual void end_run (Observation* currObserv, double lastReward) = 0;

// print accumulated statistics.
virtual void print_stats () = 0;

In the same file rl_headers/player.h generic definition of factory creating players is provided. All parameters needed for specific player creation are passed through string parameter.

virtual Player<Observation,Action>* CreatePlayer (const char* params) = 0;

Generally, there is no limitation on how player is defined, given it implements specified in player.h interface. But the provided library implements players that use Reinforcement Learning algorithms (look rl_headers/lrn_alg.h ). All Reinforcement Learning algorithms are based on the concept of "State" , which describes current player state & real number valued function associated with State (look rl_headers/evaluation_alg.h ) or with [State,Action] pair (look rl_headers/steplrn_alg.h )

We also distinguish two types of Reinforcement Learning (RL) Players.

One of them uses some predefined policy for action choise and uses Reinforcement Learning algorithm only to evaluate this not depended on th algorithm policy. Currently provided algotithms of this kind are: TD0 (look rl_headers/td0_alg.h) and Q-Learning (look rl_headers/qlrn_alg.h ).
Another kind of RL Players uses the function to choose next action (ussally using eps-greedy policy!). Currently provided algotithm of this kind is: sarsa (look rl_headers/sarsa_alg.h)

As stated, players have observation & action as template parameters. Different implementations for player with same observation & action types may have different internal player states. Those specific implementations may be created using code generation process. Provided with the library Black Jack sample application contains data file for such code generation process (look bjack_game/player2.rep ). In this file we place following definitions:

Fields to be used in player internal state definitions.In the present library version all such fields are of enumeration types.
So, the format is:FieldName EnumTypeName
States Types in format: StateTypeName FieldName1{,...,FieldNameN}
Transformation from Observation Type to State Type: TransName definition. This definition is provided as conventional C++ code in the definition file or inserted into the generated code
Range defines group of states as follows:
RangeName StateTypeName FieldName1[MinValue1..MaxValue1]{,..., FieldNameN[MinValueN..MaxValueN]} (for every field in the state definition).
Actions permitted for range are specified in the following format: ActPerRangeName RangeName Action
In the present version only Markov policies can be generated automatically. Such policy is defined,as follows : PolicyName StateTypeName ActPerRangeName1{,ActPerRangeName2,...,ActPerRangeNameN}
Player Implementation is specified in the following format:
PlayerName StateTypeName TransName ActPerRangeName1{,ActPerRangeName2,...,ActPerRangeNameN}
,where StateTypeName is internal state of the player definition, TransName is C++ code used to get internal state from the current observation and list of actions per range specifies all actions which are valid for different internal states.

So, to use the library (look bjack_game/bjack1.cc sample):

Create data file describing models for all kinds of RL players for the current application which means : players with Observation & Action types provided by the application.
Set environment variable TEMPLATE_DIR to templates and run script scripts/gen_defs.pl using data file as input to perform generation of the code for specific player implementations.
In the application module using player object include file rl_headers/rl_player.h
Get pointer to RL Players factory:
PlayerFactory* factory = RL_PlayerFactory::Instance();
Get new player using the following command: Player* player = factory->CreatePlayer(params); where params string contains all parameters needed for player creation: in format :
- RL PlayerName AlgKind Lambda PolicyName , when player uses some fixed policy. Possible AlgKind values are TD0,Q_Learning.
- RL PlayerName AlgKind Lambda Eps , when player uses eps-greedy policy. Possible AlgKind value is Sarsa.

Currently the library is compiled on Linux & Sun OS. Migration to Windows is not finished yet because of diffences in C++ Standard Library & Perl implementations.