A MULTI-AGENT APPROACH TO POMDPS USING OFF-POLICY REINFORCEMENT LEARNING AND GENETIC ALGORITHMS

This paper introduces novel concepts for accelerating learning in an off-policy reinforcement learning algorithm for Partially Observable Markov Decision Processes (POMDP) by leveraging multiple agents frame work. Reinforcement learning (RL) algorithm is considerably a slow but elegant approach to learning in an unknown environment. Although the action-value (Q-learning) is faster than the state-value, the rate of convergence to an optimal policy or maximum cumulative reward remains a constraint. Consequently, in an attempt to optimize the learning phase of an RL problem within POMD environment, we present two multi-agent learning paradigms: the multi-agent off-policy reinforcement learning and an ingenious GA (genetic Algorithm) approach for multi-agent offline learning using feedforward neural networks. At the end of the trainings (episodes and epochs) for reinforcement learning and genetic algorithm respectively, we compare the convergence rate for both algorithms with respect to creating the underlying MDPs for POMDP problems. Finally, we demonstrate the impact of layered resampling of Monte CarloвЂ™s particle filter for improving the belief state estimation accuracy with respect to ground truth within POMDP domains. Initial empirical results suggest practicable solutions.


INTRODUCTION
Recent advances in the field of artificial intelligence (AI) has unveiled a wide range of efficient algorithms [1] which if skillfully hybridized, could result in a plausible model for solving some of the problems in the field.
Since the introduction of value iteration algorithm for planning [2][3][4][5] in the 1970s, it has undergone a couple of refinement by numerous authors with an attempt to adapt it to solving more complex real world Problems. The combinational explosion of linear components (also referred to as the curse of dimensionality in some literature sources) in the value function is one of the major reasons that POMDPs are impractical for most applications [6][7][8]. Another related problem with value iteration is the exponential growth of distinct action-observation histories (also referred to as the curse of history). Some ingenious pruning methods have been used to ameliorate the problem but these pruning methods are in themselves computationally expensive to implement and only work for small finite horizon problems [9,10]. Some better strategies have been implemented such as PBVI (Point Based Value Iteration) [3] which iteratively update a sub set of representative belief points. Another promising method implemented for both discrete and continuous belief states is the MCMDP (Monte Carlos Markov Decision Process) [6,11]. This method attempts to map POMDPs directly to their underlying MDPs using Bayes Particle filter for belief updates.
On a parallel front, Reinforcement learning (RL) algorithm is considerably a slow but elegant approach to learning in an unknown environment. Although the action-value (Q-learning) is faster than the state-value, the rate of convergence to an optimal policy or maximum cumulative reward remains a constraint. However, RL has the advantage of learning an underlying MDP for both dynamic and stochastic environments [12][13][14].
In this paper, the authors via experiments, investigate the effect, impact or/and contributions of multi-agents to accelerating the rate (thereby shortening the duration) at which the utilities converge to an optimal policy for planning within POMDP environments. The agents leverage on the greedy strategy for online exploration-exploitation using off-policy model free algorithm [15]. We then compare this multi-agent RL model with an ingenious multi-agent framework equipped with a feedforward neural network which is optimized offline via an objective function (based on localization of the goal and absorbing nodes) using genetic algorithm. We then identify the promises and constraints of both paradigms and thus propose future recommendations.
Furthermore, because every POMDP can be mapped directly to its underlying MDP, we examine how an agent armed with a single range sensor could minimize the margin of error between an agent's belief state and its actual state via an ingenious resampling algorithm for the Monte Carlos particle filter [16][17][18][19]. The rational is to unveil (in the failure one or more sensors) a cost saving and relatively efficient approach to robot localization in POMDP environments. Empirical results show that this simple procedure quickly filters out outliers responsible for large errors in the initial approximate belief of an agent's state.

Hybrid
Genetic algorithms Evolution computation is a field that includes genetic algorithm, genetic programming along with evolution techniques which capture the entire process of selection and mutation [20][21][22]. The biological model of natural selection and genetics form the basis on which these computational techniques are implemented. A class of 'random search algorithm' with theory firmly embedded in biological models of selection and evolution is referred to as genetic algorithm (GA). Given a clearly defined problem to be solved, a basic GA can be represented as a set of string of bits (chromosome) which could be decoded to represent a solution to the problem. Each chromosome is tested to see how good it is at solving the problem by assigning a fitness function to them [23,24]. The probability of a chromosome being selected is proportional to its fitness. The higher the fitness score, the better the probability of chromosome being selected. A popular method of selection is the Roulette wheel selection. This iterative process unveils an ingenious paradigm for optimal path creation in maximizing the coverage of the search space when solving the multisource/target problem.
An evolutionary neural network is a hybridization of two powerful AI algorithms: the genetic algorithm and the artificial neural networks [25][26][27]. They are both biologically inspired and are often designed as feed forward ENNs when combined. This combination is achieved by evolving the weights in a fixed neural network while providing the network with a set of inputs [28][29][30].

REINFORCEMENT LEARNING
Reinforcement learning is the science of sequential decision making. For grid world agents, it is characterized by an agent's ability to maximize long term rewards leveraging on past experiences obtained via interaction with a stochastic environment. Because the environment is initially unknown to the agent, the agent has to surmount the challenge of handling the delicate balance between exploring and exploiting the environment while maximizing the expected long term reward. Consequently, RL agents usually combine online learning and planning simultaneously via policy optimization [31,32].
The utilities of each state in RL are often referred to as state-value function. Analogous to the statevalued function is the action value function often referred to as Q-value function. The process of learning with Q-value functions is referred to as Qlearning [33].
where, ( , ) is the current value of the state under a specific action policy ; +1 is the received reward; ( +1 , ) is the maximum Q-value of the subsequent state under a specific action policy ; ∝ is proportional to the learning rate weighted by the discount factor.
In this paper, we adopt a Q-learning RL for our implementation because it learns considerably faster than the state-value function. However, the reinforcement learning process is generally slow. Consequently, we attempt to accelerate the learning phase via the introduction of multi-agents.

MDP AND POMDPs
MDPs have a reputation for robotic navigations in a known environment. The environment is assumed to be Markovian (i.e., the effects of an action stochastically depends on the current state of the world and the executed action). Because the resulting state from the action is not deterministic, the subsequent state of the agent may be unintended. Amidst the stochasticity, the robot must navigate from its current location to a goal location with the minimum possible steps. Thus, MDPs create a policy for every possible node in grid world that is fully observable and stochastic [3,6,11]. MDPs are usually defined as a tuple < S, A, T, R> where: S-a set of environment states (which must encapsulate all relevant information for taking correct decisionse.g., Map, exact location within map, state of the world (open or closed door).
A-all actions that the agent can execute. A simplified example would be UP, DOWN, LEFT, RIGHT; T-the stochastic transition function T( S, A, S ' ) = P(S ' t+1 = So | St =s, At= a)the probability of executing an action 'a' from state 'S' at time 't' and arriving at state S ' at time 't+1'; R-the reward function which models the utility of the current state as well as the cost of taking a particular action R(S, a). A negative living reward (non-zero cost) is usually associated with grid world implementations.
In this paper, our simulation is based on planning problems which has a finite and discrete state and action space. The purpose of planning is to find a policy (set of optimal actions) that describes the agent's behavior in order to maximize the sum of expected rewards where, is the discounted reward as 't' tends towards infinity 0 ≤ < 1. This keeps the solution bounded. However, since our horizon is finite, (i.e., has an absorbing or goal state) we set = 1 For every state S, we can compute a utility function with the following equation: The optimal utility for each state is given by the Bellman equation The optimal policy is given by the equation In real world domains, most of the assumptions behind the implementation of MDPs fall apart because the agent cannot directly observe the state of the environment. POMDPs give us more efficient alternative to modeling real world problems via probability distribution over states also referred to a belief states. This is because the actual state of the world cannot be fully observed due to inaccurate sensor readings. Alternatively in POMDP environments, beliefs provide a sufficient statistic for the history thereby availing sufficient information for the optimal policy per state with the assumption that the underling MDP is also Markovian [3].
POMDPs therefore can be defined as belief-space MDP with the tuple < B, A, T, RB > such that: -B is the set of possible states over beliefs over state S; -A is the set of possible actions; -T is the belief transition function T (B, a, B ' o); representing the transition probability of starting a belief B, taking an action a, and arriving at a new belief state B ' o.
-RB is the reward at each belief state.
Just like the MDP model, we define the Bellman update operator [9] for the Belief-Space MDP (POMDP) as: Consequently, like MDPs the goal of POMDPs is to find the policy for action selection that maximizes the reward ( ) .

PARTICLE FILTERS ALGORITHM
Particle filter is an elegant algorithm with the potential of mapping trajectory history into belief states which consequently aid agents to learn a mapping from belief states to action in POMDPs [34][35][36]. Particle filters are the implementations of recursive Bayesian filtering used for modeling non-Gaussian distributions [37,38]. Using the motion and sensor observation model, the algorithm iteratively updates the belief-states via a sequence of prediction steps and correction steps usually referred to as belief updates [39,40].
Predictor step is given by: While the correction step is given by: Combining both equations, we get the Bayes particle filter equation as follows: where, ɳ is the normalization factor ( ) is the belief of being in state at time t. ( | ) is the probability of sensing given a state location at time t.
is the action or motion step at time t.

EXPERIMENTAL SETUP
The experiments can be divided into two sub sections: Section A and Section B.

Section A
In this section (Section A), we show how the multi-agent Q-learning RL algorithm [41][42][43][44][45] converges quickly when compared with a single offpolicy agent. It is important to note that learning algorithm creates an underlying MDP model for the grid world ( Fig. 1) at convergence.
The first simulation had a single RL agent in a 30 X 20 grid world (Fig. 2)  In the second simulation (Fig. 3), three more agents were added to the single agent. In a deliberate attempt to investigate the significance of the addition of a single agent, we ran a third simulation with 5 agents (Fig 4.0). The results show a significant difference in the convergence rate. It is interesting to note that multiagents displayed some emergent behaviors (outside the scope of this research) during the on-line training process while migrating the algorithm towards convergence.

GENETIC ALGORITHM PARADIGM
In comparison, we simulate an alternate approach to creating an underlying MDP model for a grid world using multi-agents (4 agents) each equipped with feedforward neural networks, whose weights are optimized using genetic algorithm.
The objective of function of these agents is to learn the model of the world via exploration. Training is done off-line via epochs over multiple generations. The fitness function for each generation of the multiagents is given by: where, R(s) are the living positive reward for each new explored state (i,j) in the grid world, , are extra rewards assigned to absorbing and goal nodes.
The iteration terminates after a predefined number of epochs or after a predefined minimum sum of rewards has been obtained. When the simulation terminates, it creates underlying MDP (Optimal policy) using dynamic programming with respect to the goal node. It is important to note that the entire learning procedure is considered to be offline. Each epoch ran for a fixed duration (3750) CPU-time over 12 epochs (Fig. 5) before termination.

Section B
In Section B, we simulate the planning phase for a single agent in a POMDP environment that leverages on the underlying MDP created in Section A. Our methodology incorporates the particle filter algorithm leveraging the roulette wheel selection for the resampling phase [46].

Figure 6 -Agent motion model for POMDPs
In our simulation, four sensor nodes are strategically placed at the edges of the grid world with which the agent is able to localize itself with respect to its belief update [47]. Gaussian noise was added to the sensor inputs. For simplicity, we discretized the agent's motion within the stochastic environment. The key idea is to efficiently map the belief state of the agent (particle filter averaged output) with the actual state of the agent. From (Fig. 6), the agent's policy is mapped directly to its belief which is based on the underlying MDP. Consequently, an accurate mapping would ultimately guide the agent to the goal node. The resampling model depicted in (Fig. 7) is the traditional resampling model where particles are initialized randomly within the entire grid world [48] as depicted with the capital A. Thereafter, a new weighted sample based on important weights is produced (lower case a) via the roulette wheel selection algorithm. The x. y coordinates of the belief state are thereafter obtained by averaging the sum of the particles x. y coordinates. Fig. 8 shows the average result of this model. It is important to note that the agent motion model (Fig. 6) is iterated about five times with zero motion at the initialization phase before state transitions commence. The key idea is to minimize the error between the belief state and actual state before any transition begins. It is important to note that the initial state (position) of the agent in the world is unknown.
An improved model ( Fig. 9) attempts to eliminate outliers resulting from the weighted samples by passing those samples through roulette wheel a second time to produce better weighted sample ( Fig. 10) (lower case b) before averaging.  Introducing a third layer (Fig. 11) resampling produced even better results on the averages as shown in Fig. 12.  In our final model, we include a preprocessing phase with N (such that N =1000) number of particles randomly replicated 4 times in batches over the entire world as depicted in the A, B, C and D segments (Fig. 13).
The agent intuitively attracts the batch of particles with the highest probabilistic weights into the iterative phase, leaving behind other batches of N-particles. This procedure keeps the computational complexity simple while improving accuracy as shown in Fig. 14. This implementation drastically reduced the frequency of occurrence of false negatives ( Fig. 15a the agent believes it is in a wall when it is actually not), and false positives ( Fig. 15b the agent believes it is not in a wall, when it actually is) when observed over multiple runs. The final model maintained true positives (the agent's belief and actual state are approximately the same) over multiple runs as shown in Fig. 15c.

THE AMCL (ADAPTIVE MONTE CARLOS LOCALIZATION) APPROACH
The AMCL model is a relatively recent state of the art algorithm with which we compare our proposed localization algorithm. This algorithm randomly adjusts the number of free particles during the resampling phase based on their weights. By leveraging on the Kullback-Leibler divergence (KLD) algorithm [50,51], the AMCL adapts a linear relationship to the number of particles in non-empty cells of the state space, and an upper bound on the number of resampled particles throughout the sense and move cycle [52]. Agent belief state and actual state transition from start position are shown in Fig. 16.

DISCUSSION OF RESULTS
We have obtained preliminary results for ongoing research in two phases: phase one for a typical learning problem and phase two for a complementary planning problem within a POMDP environment. In the first phase, we simulate learning of a POMDP environment using online, off-policy reinforcement Q-learning using both single and multi-agents. The rational is for the agents to learn optimal policy within a stochastic environment. The simulation results showed significant difference in CPU-time over episodes between the single and multi-agent frame work. The multi-agent (with a size of 4) converged much faster. With the addition of an extra agent, we witnessed even further improvement in CPU-time.
In contrast, we simulate an alternative off-line learning approach using feedforward neural networks for multiple agents (with a size of 4) whose weights were optimized using genetic algorithm over multiple epochs. This approach enables the agents learn the model of the world by localizing all absorbing states including the goal node and thereafter terminating with an optimal policy with respect to the goal node using dynamic programming [49]. This model converged faster than the Q-learning model however not without some drawbacks. The model is not naturally suited for dynamic environments (such as open/closed doors) Particle filter (Belief) Agent without a major modification to the algorithm which could impact computational complexity. In the second phase (planning phase), results show how segmented initialization of N-particles combined with multi-layer resampling improved belief state accuracy with respect to ground truth for scenarios in which sensor fusion may be impracticable. Our proposed approach to the resampling phase revealed better accuracy when compared with the AMCL (KLD) algorithm. Consequently, the mapping of the POMDP to the underlying MDP was with relatively high fidelity.

CONCLUSION
In this paper, the authors compare two learning paradigms for POMDP problems and also contributed to the planning phase via a clever modification to the resampling stage of the particle filter algorithm. The proposed algorithms could be implemented in partially observable environments where a search and rescue operation may be required.
The multi-agent Q-learning showed more robustness for both static and dynamic environments, however it asymptotes relatively slower when compared with the multi-agent feedforward neural network counterpart. But then, the feedforward neural network offline learning paradigm is unable to adequately model dynamic environments.
The results from the grid world for state representation using multi-agent Q-learning showed that the increase in the number of agents, increases the rate of convergence. Though this may be true for the grid world with a computable finite state space, future research may reveal the veracity of this theory in more complex scenarios where states are represented using feature vectors.
Furthermore, we leverage on a classical resampling method (the roulette wheel) to demonstrate how an ingenious adaptation of the particle filter algorithm improved the belief state accuracy with respect to robot localization within POMDP environments.