Reinforcement learning (RL) is is the very basic and most intuitive form of trial and error learning, it is the way by which most of the living organisms with some form of thinking capabilities learn. Often referred to as learning by exploration, it is the way by which a new born human baby learns to take its first steps, that is by taking random actions initially and then slowly figuring out the actions which lead to the forward walking motion.

Note, this post assumes a good understanding of the Reinforcement learning framework, please make yourself familiar with RL through week 5 and 6 of this awesome online course AI_Berkeley.

Now the question that I kept asking myself is, what is the driving force for this kind of learning, what forces the agent to learn a particular behavior in the way it is doing it. Upon learning more about RL I came across the idea of rewards, basically the agent tries to choose its actions in such a way that the rewards that is gets from that particular behavior are maximized. Now to make the agent perform different behaviors, it is the reward structure that one must modify/exploit. But assume we only have the knowledge of the behavior of the expert with us, then how do we estimate the reward structure given a particular behavior in the environment? Well, this is the very problem of Inverse Reinforcement Learning (IRL), where given the optimal expert policy (actually assumed to be optimal), we wish to determine the underlying reward structure.

<div class="imgcap" align="middle"> <img src="/assets/IRL/rl_des.png" width="50%"> <div class="thecap" align="middle"> The reinforcement learning framework. </div> </div> <div class="imgcap" align="middle"> <img src="/assets/IRL/irl_des.png" width="50%"> <div class="thecap" align="middle" > The Inverse reinforcement learning framework. </div> </div>

Again, this is not an Intro to Inverse Reinforcement Learning post, rather it is a tutorial on how to use/code Inverse reinforcement learning framework for your own problem, but IRL lies at the very core of it, and it is quintessential to know about it first. IRL has been extensively studied in the past and algorithms have been developed for it, please go through the papers Ng and Russell,2000, and Abbeel and Ng, 2004 for more information.

This posts adapts the algorithm from Abbeel and Ng, 2004 for solving the IRL problem.

Problem to be solved

The idea here is to program a simple agent in a 2D world full of obstacles to copy/clone different behaviors in the environment, the behaviors are input with the help of expert trajectories given manually by a human/computer expert. This form of learning from expert demonstrations is called Apprenticeship Learning in the scientific literature, at the core of it lies inverse Reinforcement Learning, and we are just trying to figure out the different reward functions for these different behaviors.

Apprenticeship vs. imitation learning - what is the difference?

In general, yes, they are the same thing, which means to learn from demonstration (LfD). Both methods learn from demonstration, but they learn different things:

Apprentiship learning via inverse reinforcement learning will try to infer the goal of the teacher. In other words, it will learn a reward function from observation, which can then be used in reinforcement learning. If it discovers that the goal is to hit a nail with a hammer, it will ignore blinks and scratches from the teacher, as they are irrelevant to the goal.
Imitation learning (a.k.a. behavioral cloning) will try to directly copy the teacher. This can be achieved by supervised learning alone. The AI will try to copy every action, even irrelevant actions such as blinking or scratching, for instance, or even mistakes. You could use RL here too, but only if you have a reward function.

Working Environment

<div class="imgcap" align="middle"> <img src="/assets/IRL/envo.png" width="50%"> <div class="thecap" align="middle" > The white dots represent the extent to which the agent's sensors extend. </div> </div>

Agent: the agent is a small green circle with its heading direction indicated by a blue line.
Sensors: the agent is equipped with 3 distance cum color sensors, and this is the only information that the agent has about the environment.
State Space: the state of the agent consists of 8 observable features-

Distance sensor 1 reading ( /40 to normalize)
Distance sensor 2 reading ( /40 to normalize)
Distance sensor 3 reading ( /40 to normalize)
No. of sensors seeing black color ( /3 to normalize)
No. of sensors seeing yellow color ( /3 to normalize)
No. of sensors seeing brown color ( /3 to normalize)
No. of sensors seeing red color ( /3 to normalize)
Boolen to indicate a crash/bump into an obstacle. (1:crash, 0:alive)

Note, the normalization is done to ensure that every observable feature value is in the range [0,1] which is a necessary condition on the rewards for the IRL algorithm to converge.

Rewards: the reward after every frame is calculated as a weighted linear combination of the feature values observed in that respective frame. Here the reward r_t in the t th frame is calculated by the dot product of the weight vector w with the vector of feature values in t th frame, that is the state vector phi_t. Such that r_t = w^T x phi_t.
Available Actions: with every new frame, the agent automatically takes a forward step, the available actions can either turn the agent left, right or do nothing that is a simple forward step, note that the turning actions include the forward motion as well, it is not an in-place rotation.
Obstacles: the environment consists of rigid walls, deliberately colored in different colors. The agent has color sensing capabilities that helps it to dishtinguish between the obstacle types. The environment is designed in this way for easy testing of the IRL algorithm.
The Starting position(state) of the bot is fixed, as according to the IRL algorithm it is necessary that the starting state is same for all the iterations.

Important modifications over the RL algorithm in Matt's code

Note, that the RL algorithm is completely adopted from this post by Matt Harvey with minor changes, thus it makes perfect sense to talk about the changes that I have made, also even if the reader is comfortable with RL, I highly recommend a glance over that post in order to get an understanding of how the reinforcement learning is taking place.

The environment is significantly changed, with the agent getting abilites to not only sense the distance from the 3 sensors but also sense the color of the obstacles, enabling it to dishtinguish between the obstacles. Also, the agent is now smaller smaller in size and its sensing dots are now closer in order to get more resolution and better performance. Obstacles had to be made static for now, in order to simplify the process of testing the IRL algorithm, this may very well lead to overfitting of data, but I am not concerned about that at the moment. As discussed above the observation set or the state of the agent has been increased from 3 to 8, with the inclusion of the crash feature in the agent's state. The reward structure is completely changed, the reward is now a weighted linear combination of these 8 features, the agent no more receives a -500 reward on bumping into obstacles, rather, the feature value for bumping is +1 and not bumping is 0 and it is on the algorithm to decide what weight should be assigned to this feature based on the expert behavior.

As stated in Matt's blog, the aim here is to not just teach the RL agent to avoid obstacles, I mean why to assume anything about the reward structure, let the reward structure be completely decided by the algorithm from the expert demonstrations and see what behavior a particular setting of rewards achieves!

Inverse Reinforcement Learning

Important definitions -

The features or basis functions phi_i which are basically observables in the state. The features in the current problem are discussed above in the state space section. We define phi(s_t) to be the sum of all the feature expectations phi_i such that:

<a href="https://www.codecogs.com/eqnedit.php?latex=\begin{}&space;\phi(s_t)&space;&=&space;\phi_1&space;+&space;\phi_2&space;+&space;\phi_3&space;+&space;.......&space;+&space;\phi_n&space;\\&space;\end{}" target="_blank"><img src="https://latex.codecogs.com/gif.latex?\begin{}&space;\phi(s_t)&space;&=&space;\phi_1&space;+&space;\phi_2&space;+&space;\phi_3&space;+&space;.......&space;+&space;\phi_n&space;\\&space;\end{}" title="\begin{} \phi(s_t) &= \phi_1 + \phi_2 + \phi_3 + ....... + \phi_n \\ \end{}" /></a>
Rewards r_t - linear combination of these feature values observed at each state s_t.

<a href="https://www.codecogs.com/eqnedit.php?latex=\begin{}&space;r(s,a,s')&space;&=&space;w_1&space;\phi_1&space;+&space;w_2&space;\phi_2&space;+&space;w_3&space;\phi_3&space;+&space;.......&space;+&space;w_n&space;\phi_n&space;\\&space;&=&space;w^T*\phi(s_t)&space;\\&space;\end{}" target="_blank"><img src="https://latex.codecogs.com/gif.latex?\begin{}&space;r(s,a,s')&space;&=&space;w_1&space;\phi_1&space;+&space;w_2&space;\phi_2&space;+&space;w_3&space;\phi_3&space;+&space;.......&space;+&space;w_n&space;\phi_n&space;\&space;&=&space;w^T*\phi(s_t)&space;\&space;\end{}" title="\begin{} r(s,a,s') &= w_1 \phi_1 + w_2 \phi_2 + w_3 \phi_3 + ....... + w_n \phi_n \ &= w^T*\phi(s_t)

ToyCarIRL

Install / Use

README