Let’s MIX together
This work is about learning, which takes place in an environment for which, in order to describe it, an exact model of enormous size would be needed.
The environment discussed in this article is the game MIX, as will be described below. We will see the difficulties involved in trying to build learning models for the MIX game, and we will see one suggested solution for this problem.
The MIX game is a generalization of the famous X-MIX-DRIX (tic-tac-toe) game.
Two players contest upon a game board whose size is 10 rows by 20 columns, one is the X player and the other is the O player; each of them has the X or O symbol respectively. The X player puts his symbol on the board, and after that the O player does the same and so on. The players can only place their respective symbols in a feasible slot; there can be no unoccupied slots between the aforesaid symbol and the bottom row. A player can only put his symbol in column that is not fully occupied (at most 20 possible plays).
The winner of the game is the player who has a sequence of four consecutive symbols, either horizontally or vertically – but not diagonally.
The learning of the model that will be described now is based on the SARSA algorithm, implemented on MDP of 130 states (actually much less, as will be explained later) and combined with ability of ‘future seeing’.
An exact model for the MIX game, as described above, will contain 3**(20+10) states. That of course is too large a model to be handled by a not super strong computer, not talking about the complexity involved when trying to learn the value of each one of the states.
So, the first decision about the model was to limit it to a reasonable size – using several boolean parameters (bits) to identify the global state of the environment.
The MDP that was used in this work has 2**7 states, plus two extra states – one indicates X winning and other indicats O winning. The other states indicate combinations of seven boolean parameters (bits).
The next issue was to decide what are the parameters that will give the best information about the model. To determine which parameters will define state in the MDP was actually the heart of the whole problem and it is obvious that this is the program’s ‘wisdom’. Many parameters were tried by using them with one of the players and not with the other, and soon it was clear enough that no ‘local’ parameter, indicating specific sequence or trap in the board, would give enough information. The final parameters, as will be described soon, make sense in a manner that they define three global situations: certain win, win if got the turn, and potential of winning.
Here are the parameters that were used in the final version of the program :
The first parameter indicates whose turn it is – the X’s or the O’s.
The second and third bits indicate whether the players have a ‘mate’ sequence or not. Mate is the “_SSS_” sequence where ‘S’ is the player symbol and ‘_’ is a feasible slot, meaning that a player, having a mate sequence, will win the game when the turn will be his, because the other player will not be able to prevent this sequence from being a winning sequence in one play.
The fourth and fifth bits indicate whether the players have a ‘chess’ sequence or not. Chess is a sequence of size of four slots, three of them are occupied by the player symbol, and the fourth is an empty feasible slot. The possible chess sequences are either vertical, three player’s symbols one on top of the other and an empty slot above, or ‘_SSS’, ‘S_SS’, ‘SS_S’, ‘SSS_’ sequences. A player, having a chess sequence will win the game if the turn is his.
The seventh and eighth bits indicate whether the players have a ‘potential chess’ sequence or not, where potential chess is a chess sequence with a not feasible slot instead of feasible one. A player can not win from this sequence, but this sequence could become a chess sequence.
One can learn from the above definitions that many of the bits combinations are not possible because if the X player has a mate sequence, for example, it is obvious that he has a chess, and a potential chess sequence as well. So, the real number of states in the MDP (the possible combinations of bits) is less than 60.
The SARSA algorithm, implementd on the above MDP, attaches values for each one of the possible states of the MDP, excepting the two extra states which have fixed value of 1 for the winning state and –1 for the losing state. The algorithm, when performing a game run, gives a new value for each state in which the game passes through except the winning/losing states. The new values are weighted average of (1-α) of the current value and α of the next state value,
Where α = 1/(run_number/(total_number_of_runs/10)+1),
That is to say α goes from 1 to 0, decreasing every 10 runs.
The ε parameter, which effects the algorithm strategy – between exploration and exploitation, is set to be
ε = run_number/total_number_of_runs,
That is to say ε goes from 0 to 1 in a linear manner. i.e. the algorithm prefers checking existing knowledge over finding new knowledge increasingly from run to run: on the first run it explores for new data, and on the last run it acts according to the cumulated data.
The ‘future seeing’ ability which was mentioned before is the behavior of the algorithm, when exploiting the cumulated data, to check every possible state for the next two plays, i.e. for every player’s play and for every other player’s reaction, the state of the highest value is chosen.
The combination of the above ability with the updating of the SARSA values in the algorithm has a very strong effect on the learning: when exploiting data, the algorithm checks all the possible states for the next two moves, and uses this to update the relevant SARSA values. The result of this is that the states values are updated not only by a move, but also by the exploitation of the data, and that causes the algorithm to converge to optimal values much faster.
The learning itself happen by 500 games, played by the algorithm for both players, and producing data file ‘mix.init’ containing the values of each state of the MDP.
Then any human can play against the program that uses those values to choose plays (always exploiting knowledge).
The program in its final version, is a very good opponent for anyone.
Actually, the only way to defeat it is by preparing potential chess in the upper rows, forcing the program to enable it to become a chess when no other option is possible as a result of the occupation of the table slots.
Of course, the limited model causes loss of information, especially the fact that the parameters, which describe the state, are binary, meaning that the program can not distinguish between situation of one potential chess and many. i.e. the program will be satisfied with one potential chess and will not try to accomplish others once it has one.
The small number of parameters, besides the small memory and complexity result, is another problem, because the situation definitions are quite rough.
One other thing is the future seeing which is implemented for only two future moves. This makes the program quite naive, trying to maximize the state immediately, meaning that whenever the program has a potential chess, it will try to make it a feasible threat because of the potential value involved, but that, of course, will enable the opponent to block it immediately. On the other hand, the benefit in checking more than two steps forward is not large, as long as the parameters are few, because only more delicate state definition will enable the program to plan effective traps, otherwise it just increases the complexity.
As was discussed above, the best one can do to improve the program is to make the state definitions more delicate. Then, increasing the number of bits (model size) and the ability to see farther than two moves forward will enhance its performance.
One other way to make the model more sensitive is to replace the boolean parameters to accumulated ones. Especially those represents the chess – tthe program will recognize a trap state where two chess are going to happen together - and the potential chess – then the program will have a motivation to score many potential chess.
This can be done by replacing the boolean parameters to accumulated ones. A simple neuron network can implement the counter parameter model.
When improving as suggested above, there will be a reason to let the program learn from different human opponents and adopt some methods for traps.
Another issue is the converging to the optimal values. In the model described here, the final states values were quite similar from run to run. When making the model more complicated, it is possible that we will lose this fine result of converging. In such a case it is possible to combine a genetic algorithm using playing agents – each is a run product – to converge to optimal values.
One last suggestion is to change the game by defining a diagonal sequence as a winning sequence as well. This will make the environment much more complicated, although the general situation definitions, as described above, should be proper for this new game – just by updating the chess and potential chess recognition.
As it was shown, this work suggests a proper solution for a large-scale problem.
Upon giving the program the minimal prior knowledge, it was able to evaluate by itself each possible state of the model, and use this learning information when playing afterwards.
Header file – mix.h
Program code – mix.c
Initialization file - mix.init (use it just if you want the program skip the learning phase)