There is variety of applications containing some kind (or kinds) of units
having following properties:
Provided with some set of data by application, the unit chooses action from
list of currently applicable actions.
We'll call such a unit: "player". We'll refer to data type containing set of
data provided to the "player" as "Observation" type. Enumeration type denoting
all possible actions should be provided. We'll refer it as "Action" type.
Player class is templated with Observation & Action parameters.Generic definition
of player class is placed in rl_headers/player.h file.
In this file simple interface to be implemented by "player" is defined:
// new run is started : get initial observation & choose action
virtual Action start_run (Observation* startObserv) = 0;
// inside run : get current observation ,choose action ,
update
// statistics by using knoweledge of the last achieved
reward
virtual Action do_step (Observation* currObserv, double
lastReward) = 0;
// run is finished : get final observation, update statistics
virtual void end_run (Observation* currObserv, double lastReward) = 0;
// print accumulated statistics.
virtual void print_stats ()
= 0;
In the same file rl_headers/player.h generic definition of factory creating players is provided. All parameters
needed for specific player creation are passed through string parameter.
virtual
Player<Observation,Action>* CreatePlayer (const char* params) =
0;
Generally, there is no limitation on how player is defined, given it implements specified
in player.h interface. But the provided library implements players that use Reinforcement
Learning algorithms (look rl_headers/lrn_alg.h ).
All Reinforcement Learning algorithms are based on the concept of "State" , which describes
current player state & real number valued function associated with State (look
rl_headers/evaluation_alg.h ) or with [State,Action]
pair (look rl_headers/steplrn_alg.h )
We also distinguish two types of Reinforcement Learning (RL) Players.
- One of them uses some predefined policy for action choise and uses Reinforcement Learning
algorithm only to evaluate this not depended on th algorithm policy. Currently provided
algotithms of this kind are: TD0 (look rl_headers/td0_alg.h)
and Q-Learning (look rl_headers/qlrn_alg.h ).
- Another kind of RL Players uses the function to choose next action (ussally using eps-greedy
policy!). Currently provided algotithm of this kind is: sarsa (look
rl_headers/sarsa_alg.h)
As stated, players have observation & action as template parameters. Different implementations
for player with same observation & action types may have different internal player states.
Those specific implementations may be created using code generation process.
Provided with the library Black Jack sample application contains data file for such
code generation process (look bjack_game/player2.rep ).
In this file we place following definitions:
- Fields to be used in player internal state definitions.In the present library version
all such fields are of enumeration types.
So, the format is:FieldName EnumTypeName
- States Types in format: StateTypeName FieldName1{,...,FieldNameN}
- Transformation from Observation Type to State Type: TransName definition.
This definition is provided as conventional C++ code in the definition file or inserted
into the generated code
- Range defines group of states as follows:
RangeName StateTypeName FieldName1[MinValue1..MaxValue1]{,...,
FieldNameN[MinValueN..MaxValueN]}
(for every field in the state definition).
- Actions permitted for range are specified in the following format:
ActPerRangeName RangeName Action
- In the present version only Markov policies can be generated automatically. Such policy
is defined,as follows :
PolicyName StateTypeName ActPerRangeName1{,ActPerRangeName2,...,ActPerRangeNameN}
- Player Implementation is specified in the following format:
PlayerName StateTypeName TransName ActPerRangeName1{,ActPerRangeName2,...,ActPerRangeNameN}
,where StateTypeName is internal state of the player definition, TransName
is C++ code used to get internal state from the current observation and list
of actions per range specifies all actions which are valid for different
internal states.
So, to use the library (look bjack_game/bjack1.cc sample):
- Create data file describing models for all kinds of RL players for the current application
which means : players with Observation & Action types provided by the application.
- Set environment variable TEMPLATE_DIR to templates
and run script scripts/gen_defs.pl using data file as
input to perform generation of the code for specific player implementations.
- In the application module using player object include file
rl_headers/rl_player.h
- Get pointer to RL Players factory:
PlayerFactory* factory =
RL_PlayerFactory::Instance();
- Get new player using the following command:
Player* player =
factory->CreatePlayer(params);
where params string contains all parameters needed for player creation:
in format :
- RL PlayerName AlgKind Lambda PolicyName , when player uses
some fixed policy. Possible AlgKind values are TD0,Q_Learning.
- RL PlayerName AlgKind Lambda Eps , when player uses eps-greedy
policy. Possible AlgKind value is Sarsa.
Currently the library is compiled on Linux & Sun OS.
Migration to Windows is not finished yet because of diffences in C++ Standard Library
& Perl implementations.