Diminishing the Size of the Searched MDP
by Utilizing Symmetries
Introduction
The MAXQ Value Function Decomposition
A New State Abstraction
Example: 2x2x2 Hungarian Cube
Summary
References
Introduction
This work is based on a paper by Thomas G. Dietterich [1]. This paper presents an approach to hierarchical reinforcement learning of fully observable MDPs, called the MAXQ value function decomposition.
Hierarchical reinforcement learning is a general name for several methods that try to diminish the searched policy space of the learning algorithm in a specific problem, by limiting the search to policies that follow certain constraints, that fit a hierarchical description of the problem in terms of sub-goals.
The MAXQ value function decomposition, is one such method. One of its advantages is that it allows the use of state abstractions, which enable ignoring certain aspects of a state when they are irrelevant. This work suggests a different kind of state abstraction not treated in the paper. The new state abstraction is intended to exploit symmetries that exist in a specific problem, and can diminish the size of the learned MDP by an order of magnitude and more.
Followed is a brief introduction of the relevant topics presented in the MAXQ paper. Afterwards, the new state abstraction will be described and defined formally, and an example for its implementation on the problem of a Hungarian Cube will be given.
The MAXQ paper [1] presents an approach to hierarchical reinforcement learning, That is based on a programmer identifying sub-goals in the original problem’s MDP and defining subtasks that can achieve these goals.
This definition of sub-goals is then used by decomposing the original problem’s MDP into smaller MDPs that are organized hierarchically – each smaller MDP’s actions can be both the primitive actions of the original MDP, or a non-primitive action, which is the execution of another small MDP. Each execution of a small MDP ends when a terminating state is reached.
Using this organization, the value function of the original MDP can be decomposed into an additive combination of the value function of the smaller MDPs. The value function of any policy that is consistent with the given hierarchy can be represented using this method, named the MAXQ value function decomposition.
The MAXQ paper presents an online model-free learning algorithm for the MAXQ hierarchy called MAXQ-Q, which converges with probability 1 to a recursively optimal policy. The recursively optimal policy is the policy in which every sub-MDP has the optimal policy when the sub-MDPs it can use as actions are executing their recursively optimal policies. This policy is not always the optimal hierarchical policy, since each sub-MDP learns its policy context-free of its parent MDPs. The paper considers two settings of rewarding the agent: Episodic and discounted infinite horizon.
Some formal definitions and equalities used in the paper to describe the calculations of the MAXQ-Q algorithm are used later in this work:
Since a non-primitive action can take more that one step to execute, which can affect the reward in a discounted infinite horizon problem, the transition probability function incorporates also the probability to complete the action a in a specific number of states: . The transition probabilities depend on the hierarchical policy , which is executed for the non-primitive actions.
The completion function , is the expected discounted cumulative reward of completing subtask Mi after invoking the subroutines for subtask Ma in state s. The reward is discounted back to the point in time where a begins execution.
A new function defined for a MAXQ graph is the pseudo-reward function, , This function allows the programmer to define different rewards for different terminating states of a sub-goal, without changing the original MDP.
An important benefit of using the MAXQ method is that it allows the use of state abstractions. State abstractions are ways to let subtasks ignore large parts of the state space, and consequently diminish substantially the actual size of the problem to be solved by the MAXQ-Q algorithm. The paper presents five different conditions that allow introducing state abstractions without changing the convergence of the learning algorithm. Each condition can be applied independently of the others, and defines a set of states or of values stored by the algorithm, that can be ignored or merged. An example to such an abstraction is ignoring, within a certain subtask, some dimensions of the state, that are irrelevant for the subtask.
This work suggests a new kind of state abstraction. The motivation is to exploit symmetries that exist in many problems, to diminish the state space that has to be searched by a learning algorithm. When several states in a problem are symmetric to each other, it will be shown that it is possible to find the optimal policy using only one representative state for each group of symmetrical states, where all these states together form a new, smaller MDP. This means that in a problem with several symmetries, the size of the MDP to be explored, can be a small fraction of the size of the original MDP.
Examples of symmetries can be found in many games. 2-dimensional and 3-dimensional Tic-Tac-Toe have a spatial symmetry: a state of the game and its mirroring along any of the game board’s symmetry axes, can be treated as exactly the same state, with the exception that any action on the state should also be mirrored to give the same action for the mirrored state.
Another kind of symmetry that doesn’t involve axes of symmetry can be found in the maze problem presented by Parr [2] and referenced in the MAXQ paper. The maze is presented in figure 1.
In this problem, the agent starts in the upper left and has to reach the lower right corner. The obstacles on the way can only be detoured. Both the MAXQ paper and Parr’s paper define the detouring of a single obstacle as a subtask that can be used by a parent task of moving from intersection to intersection. In these papers, there is no room for searching the best policy for detouring, since it is given with no degrees of freedom. But consider a problem in which the best policy to detour the obstacle is to be searched. Since all the obstacles have the same structure (with an inner division into obstacles with one open side and obstacles with two open sides), the best policy for detouring can be learned on all of them equally and simultaneously, if we ‘rotate’ the actions with the relative rotation of the obstacle.
The most dramatic effect of using the symmetries of the problem can be seen in the Hungarian Cube, where viewing different orientations and different color schemes of the cube as symmetric can give a dimminishing of orders of magnitude in the number of states reachable from the starting state.
The following definitions formalize the concept of state symmetry, in a way that will enable learning the optimal policy for the original problem MDP over a diminished MDP. The formalization and proofs will use the MAXQ value function decomposition and hierarchy for more generality, but they are, of course, also valid for a regular MDP, which can be viewed as a one-level hierarchy.
In this formalization, the representative state for each group of state is defined by a function from the set of states in the specific subtask into the same set of states. The conversion of actions that is necessary in some of the symmetries is defined by a function for each state, that is a permutation of the actions allowed in the subtask.
To achieve the diminishing of the number of states, the set of representative states has to be a prsubset of the set of states within the subtask. While this is not needed for the definitions and proofs, it is, in practice, the most important feature of the representation function.
Definition 2
Obviously, any policy that is abstract with respect to a state symmetry (T, L) for some node i, is also abstract with respect to all the state symmetries of the descendants of node i, that result from narrowing the domains of T and L.
Let M be an MDP with a full-state MDP graph, and let (T, L) be a state symmetry for node i in a MAXQ graph H of M. Let be a hierarchical policy that is abstract with respect to (T, L). Then the action-value function at node i for all the states that are equivalent under (T, L), can be represented compactly, with only one value of the value function and the completion function for each equivalence class of states in Si. The action-value function can be computed as follows:
Proof:
The policy is well defined over sym(Mi), by considering its decisions for states in sym(Si) only.
The action-value function for over sym(Mi) is the unique solution to the following Bellman equation, for any :
The Bellman equation for the original MDP, Mi, for any is:
Let w = T(s). Because (T,L) is a state symmetry for node i, we can use equation 2, equation 3 and equation 4, and with the assumptions stated above for w, the following equalities hold:
Using these, equation 6 becomes:
The summation over can be rearranged as a summation of summations over each equivalence class of states in Si to give:
We want to find the solution to the Bellman equations for the original MDP Mi, presented in equation 7, by referring to the solution for sym(Mi). Since the solution to the Bellman
s is unique, we only need to describe a set of values for that solves the set of equations.
First, let us assume that for every and , such that and , the equality holds. It will be shown that the given solution is consistent with this assumption.
is an abstract policy with respect to (T,L), therefore for any pair of states such that , the equality holds. Hence, following our assumption, . We can now rewrite equation 7 to get:
Since the bracketed expression does not depend on , the order summation can be rearranged to give:
Rewriting equation 2 for w and s gives:
With this and with our definitions for sym(Mi), equation 8 becomes:
Specifically, this equality is also true for s = w. Since L(w,a)=w, this gives the exact set of equations for over T(Si) as the set of equations presented by equation 5 for . Therefore, for every , and for every action , we can let . In addition, for every , the value of depends only on the value of for . Therefore equation 9 for is a simple assignment. This completes the solution of the Bellman equations for Mi. Note that this solution does in fact follow our assumption: for every and , such that and , equation 9 is exactly the same, therefore .
To conclude, we have shown the following relation between values of for node i:
Which can be rewritten to give:
.
Q.E.D.
After having shown that any abstract policy can be represented compactly, we would like to know that the optimal policy at node i is also abstract with respect to (T,L).
Corollary 1
Proof:
Q.E.D.
The above lemma and corollary show that we can save space by representing the completion function, value function and action-value function for only a subset of Si, which is the image of T, and use these values to find out the values for all the states in Si, either when evaluating a policy or when finding the optimal policy. The next corollary will show that we can not only save representation space, but also diminish the effective size of the MDP at node i, to the size of the image of T.
Corollary 2
Under the same conditions as Lemma 2, the MDP sym(Mi) can be simulated by running the original MDP Mi, and adding a simple interface defined by the functions T and L.
Proof:
To simulate sym(Mi) using Mi, the following interface will be added:
We have shown that the algorithm, using the described interface, runs over a simulation of sym(Mi).
Q.E.D.
In this section, the example of a 2x2x2 Hungarian Cube will be used to demonstrate the application of the definition of a state symmetry. This simplification of the 3x3x3 Cube, is taken to shorten the description, but can be easily applied to the full cube as well. The following analysis of the configurations of the Hungaian Cube is partly based on [3].
  
The goal is to find, for a specific start state, the shortest path to a terminating state: this state can be any of the configurations in which on every face of the cube, all the squares are the same color. Therefore, the reward scheme will be discounted infinite horizon, with a reward of 1 for reaching a terminal state.
The number of different configurations reachable from the starting state in the 2x2x2 Cube is 8!*37 = 88179840. Each configuration can be represented by the color of each square out of the 24 squares of the face of the cube. The order of the squares in the representation can be arbitrarily chosen. From now on, this order will be used to describe the different squares in a state.
The possible actions on the cube are rotating one of the halves of the cube along its center pivot clockwise or counter-clockwise. This gives 12 different possible actions. The actions can be represented by the rotation axis, the half to be moved, and the direction of rotation - clockwise/counter-clockwise.
The state symmetry that will be defined exploits both the spatial symmetry and the symmetry between the different colors.
First, we will define an ordering over the states, that does not differentiate states that are different only in their color assignments.
To compare two states, first their representation will be converted into the same color scheme: For each state, the color of the first square will be replaced by an arbitrary color ‘1’. The next color found when going over the squares in order, will be replaces by a color ‘2’, and so forth.
Second, the new representation will be compared using a lexicographic comparison: The first different ‘color’ square will determine which of the states is bigger. If, in the new color scheme, both states are exactly the same, they are equal.
Now, we can define the state symmetry:
For finding T(s), we will compute the minimal state, s', in the set of states that can be reached by a composition of any subset of the following transformations on s:
Transformation for T(s) |
Transformation for L(s) |
Mirroring X axis |
For a=(X-axis, half, direction), the half and direction are to be inverted |
Mirroring Y axis |
For a=(Y-axis, half, direction), the half and direction are to be inverted |
Mirroring Z axis |
For a=(Z-axis, half, direction), the half and direction are to be inverted |
Permuting the x axis and the y axis |
Inverse the direction of rotation for any a. For a = (X-axis, half, rotation), and for a = (Y-axis, half, rotation), replace X-axis by Y-axis and vice versa. |
Permuting the y axis and the z axis |
Inverse the direction of rotation for any a. For a = (Y-axis, half, rotation), and for a = (Z-axis, half, rotation), replace Y-axis by Z-axis and vice versa. |
Permuting the z axis and the x axis |
Inverse the direction of rotation for any a. For a = (Z-axis, half, rotation), and for a = (X-axis, half, rotation), replace Z-axis by X-axis and vice versa. |
It can be seen from the definition of T and L, that whenever s1 and s2 are symmetric, the corresponding action defined by and takes the pair of states into two states that are also symmetric. By the definition of T, all the states that are symmetric to each other, are taken by T to the same state, because T always chooses the orientation that gives the minimal state out of all the possible symmetric configurations, and always gives the same color scheme. Therefore, .
For condition (2.2), we need to show that that for any pair of states such that , for any action : . Using the definition of the value function in equation 1, since the action is always primitive, and since the reward in this problem is independent of the starting state and of the action, we get:
Therefore, by the same argument made for condition 2.1, and since the terminating states form an equivalence class with respect to (T,L), the equality holds.
Condition 2.3 is trivially satisfied, since we have no pseudo-reward in this problem.
We have shown that (T,L) is a state symmetry. This symmetry can be used to implement an interface to a learning agent as described in the proof to corollary 2. The size the biggest equivalence classe of the states in the original MDP is the number of different reachable orientations x the number of different reachable color schemes = 576. (The equivalence class is smaller for states that are symmetric to themselves). The size of the searched MDP can, therefore, be dimminished substantially.
This work shows that in a problem that has inner symmetries, these can be exploited to boost the learning algorithm, using an interface between the learning agent and the environment. The discovery and formalization of the symmetries in a specific problem have to be made by hand. Nonetheless, in some problems this effort can turn an unsolvable problem into a solvable one. With more work on this subject, it is probable that the state symmetries of groups of problems with similar structure can be defined in general terms with few free parameters. An example of such a group of similar problems is the group of problems with the planar or spatial symmetry of a square or a cube.
[2] Parr, R. & Russel, S. (1998). Reinforcement Learning with Hierarchies of Machines. Computer Science Division, UC Berkley.
[3] Hofstadter, D.R. (1985). Metamagical Themas: Questing the Essence of Mind and Pattern. pages 301 - 366. Published by Basic Books, New York