CONTINUOUS LEARNING SERIES

Building a DQN in PyTorch: Balancing Cart Pole with Deep RL

Part 3 in Reinforcement Learning Series

Introduction

Hi Geeks, welcome to Part-3 of our Reinforcement Learning Series. In the last two blogs, we covered some basic concepts in RL and also studied the multi-armed bandit problem and its solution methods. This blog will be a bit longer as we will first learn some new concepts and then we will apply Deep Learning to build a deep RL agent. We will then train this agent to balance the cart pole.

The code repository corresponding to this blog can be accessed here.

The Cart Pole Balancing Problem

We will be using the CartPole-v0 environment provided by OpenAI GYM. I am still including a complete environment description here for the sake of completeness.

Description

A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The pendulum starts upright, and the goal is to prevent it from falling over by increasing and reducing the cart’s velocity.

Cart Pole Environment

State Space

The observation of this environment is a four tuple :

Action Space

There are just two possible actions: Left or Right, corresponding to the direction in which the agent can push the cart pole.

Reward

The reward is 1 for every step taken, including the termination step.

Starting State

All observations are assigned a uniform random value between ±0.05.

Episode Termination

1. Pole Angle is more than ±12°

2. Cart Position is more than ±2.4 (center of the cart reaches the edge of the display)

3. Episode length is greater than 200 (500 for v1).

Solved Requirements

Considered solved when the average reward is greater than or equal to 195.0 over 100 consecutive trials.

The Behavior of a Random Agent

We will first check the average reward that a random agent can earn. By Random Agent, I am referring to an agent selecting actions randomly i.e without using any environment information. Running this snippet gave an average reward of 23.3 in my case. It may vary slightly in your case. But still, the problem is far from solved.

The Real Question !!!

Consider the following interaction between Agent and Environment.

Taken From Reinforcement Learning — An Introduction

Based on the received observations and rewards from the environment, the agent selects some action. The agent must have some policy (aka strategy) according to which it selects an action. Just having a policy is not enough, the agent must have a mechanism to improve this strategy as it interacts more and more with the environment. Now (some of ) the questions are:

1. How to represent a policy?
2. How to evaluate the current policy ( Policy Evaluation)?
3. How to improve the policy ( Policy Improvement )?

Deep Learning to The Rescue

Ideally, we should first discuss these issues with traditional methods but that will make this blog very long. To summarize, we will still be using the traditional approaches but with the deep neural networks as function approximators. In the absence of deep neural network, to apply these algorithms we would need to store a table of dimension S x A where S is the number of possible states and A is the number of actions that can be taken in the environment. Even with a simple environment, this table is too large to be usable in practice.

Let us see how we can use Deep Learning to address the above concerns :

In DL we use neural networks as function approximators. We can represent our policy via Deep NN. This NN will look at the given observation and will tell us which action is best to take in the current state. We refer to such Deep Neural Networks as Policy Network.
By policy evaluation, we mean to check how “good’ or “impactful” is our current policy. The loss of Policy Network can be used to check this. In this blog, we will use the Mean Squared Error between predicted and target returns to evaluate our policy network.
The Policy Evaluation step gives us the loss value of the current policy network. With this information, we can use Gradient Descent to optimize the weights of the policy network to minimize this loss. In this way, the policy network can be improved.

Deep Q-Network

A DQN is a Q-value function approximator. At each time step, we pass the current environment observations as input. The output is the Q-value corresponding to each possible action.

Q-Network

But wait… where are the ground truths ???

In Supervised Learning, we have a ground truth corresponding to each input data point. The network prediction can be compared against the corresponding ground truth to evaluate its performance. But here we do not have the ground truths or at least not in the popular sense.

In most cases, we do not have the exact dynamics of the environment. That means we do not exactly know the value of selecting an action in a state even if the environment dynamics are known, then we would need to run the agent-environment interaction for a sufficiently long time or ideally until the end of the episodes. Then we can go back and update the ground-truth value. Note that this also means that we would need to store the entire sequence of interaction which is not feasible in most scenarios.

Discounted Returns as Ground Truth

The value of taking action a in state s i.e q(s, a) can be written as :

q(s{t}, a{t} ) = R{t} + γ * R{t+1} + γ² * R{t+2} + γ³ * R{t+3} + …….

where γ is the discount factor, the value of which belongs to the interval [0,1]. The idea here is that we care not only for the immediate rewards but also for the future rewards that can result after taking this action.

The discount rate determines the present value of future rewards: a reward received k time steps in the future is worth only pow(γ ,k-1) times what it would be worth if it were received immediately¹.

With a bit rearrangement, the above equation can be simplified to :

q(s{t}, a{t} ) = R{t} + γ * MAX-OVER-ACTION q( s(t+1), a)

Training Algorithm

Step-1: Initialize game state and get initial observations.
Step-2: Input the observation (obs) to Q-network and get Q-value corresponding to each action. Store the maximum of the q-value in X.
Step-3: With a probability, epsilon selects random action otherwise select action corresponding to max q-value. 
Step-4: Execute the selected action in the game state and collect the generated reward( r{t} ) and next state observation(obs_next).
Step-5: Pass these next state observation through Q-network and store the maximum of these Q-values in a variable say q_next_state. If the discount factor is Gamma then the ground truth can be calculated as :
Y = r{t} + Gamma * q_next_state
Step-6: Take X as the predicted return of current state and Y as the actual return. Calculate loss and perform an optimization step.
Step-7: Set obs = obs_next.
Step-8: Repeat Step-2 to Step-7 for n episodes.

Balancing Exploration and Exploitation

In the beginning, our agent has no idea of the environment dynamics. So we should let it explore and as it interacts with the environment and it should increasingly exploit its learning along with exploration. There is a need to balance this exploration and exploitation. We can either choose the action corresponding to maximum Q-value(exploitation) or with a small probability, epsilon, a random action can be selected(exploration). In this agent’s training, we started with epsilon = 1 i.e 100% exploration and slowly decrease it to 0.05.

Catastrophic Forgetting and Need For Replay Buffer

There is a serious issue with the above training process. After each step of agent-environment interaction, we are performing an optimization step. This can lead to catastrophic forgetting.

Today’s deep learning methods struggle to learn rapidly in the incremental, online settings that are most natural for the reinforcement learning algorithms emphasized in this book. The problem is sometimes described as one of “catastrophic interference” or “correlated data.” When something new is learned it tends to replace what has previously been learned rather than adding to it, with the result that the benefit of the older learning is lost. Techniques such as “replay buffers” are often used to retain and replay old data so that its benefits are not permanently lost¹.

So as you might have guessed by now, we will be using replay buffers to address this problem. The agent will gather the experience in replay buffer and then a random batch of experience will be sampled from this buffer. This batch will be used for training the agent using mini-batch gradient descent.

Training Instability and Need for Two Identical Q-Network

Until now, the same Q-network is used for predicting the Q-value of the current state and next state. The Q-value of the next state is then used to calculate ground truth. In simple words,

We executed our optimization step to bring the prediction close to ground truth but at the same time we are changing the weights of the network which gave us the ground truth. This causes instability in training.

The solution is to have another network called Target Network which is an exact copy of the Main Network. This target network is used to generate target values or ground truth. The weights of this network are held fixed for a fixed number of training steps after which these are updated with the weight of Main Network. In this way, the distribution of our target return is also held fixed for some fixed iterations which increase training stability.

Also, note that we are using the term policy network and q-network almost interchangeably but these are two different types of networks. Given a state, a policy network generates a probability distribution over actions while a Q-network generates Q-values corresponding to every action.

Coding our DQN Agent

It seems quite natural to wrap our agent in a class. The agent receives state observations and rewards from the environment. It then acts on the environment based on current observation. The Deep Q-Network is the brain of our agent. The agent learns from interactions and adjusts the weight of Q-network accordingly. Let us quickly go through the code :

The init function builds two identical deep neural networks. Before that we first seed torch random generator. In this way, the weights of the neural network are initialized deterministically.

“Seed is also a Hyper-parameter” 🙂

Kindly remove all occurrences of “.cuda()” from this code if you do not have Cuda support on your machine. The variable network_sync_freq donate the number of training steps to take before updating the target network with the weight of the main network. The variable network_sync_counter is incremented after each training step in train() function and is reset to 0 when it reaches network_sync_freq. The variable experience_replay is a deque. In train() function, the Q-value of the current state is estimated using Main Q-Network. The Q-value of the next state is calculated using Target Network, which is then used to calculate the target return.

The rest of the code is pretty much self-explanatory.

Deep Q-Network Agent

Driver code

The driver code is very simple. We first initialize both environment and agent. Then the replay buffer is filled to its full capacity, 256 in this case. Then we fix it for 4 training steps and during each training step, a batch of length 16 is sampled randomly from this buffer. Then the agent interacts with the environment for the next 128 time steps and collects the experience in the buffer. Note that since it is a deque after it is filled to its full capacity ( which we do before the main training loop), with each new experience inserted into it, one element from the front is also removed.

To balance exploration and exploitation, we are using the epsilon-greedy strategy. We first promote full exploration by setting epsilon =1 and update it after each episode to slowly decrease it to 0.05.

Driver Code for training the agent

Some Plots

This plot shows how reward varies as we make progress in training. Roughly after 6500 episodes, it scores maximum in each episode.

Variation of Reward with episode

2. This plot shows the variation in loss value as the training progress.

x-axis: epoch, y-axis: loss

3. This plot shows the variation of epsilon as the training progress.

x-axis: epoch, y-axis: epsilon

Video Time !!!

This video shows how gracefully our agent is balancing the cart pole. The Pole almost appears to be still. It scored the maximum score each time I tried. Although taking the average over a large number of episodes is a much better idea.

CartPole Balancing via Deep Q-Network

The above video is generated by the following code snippet :

Limitations

There are some limitations to our DQN Agent. Let us look at some of them.

Hacks… A lot of them !!!

As you can easily observe that getting the right values of hyperparameters needs a lot of experimentation. Even the way neural networks are initialized has a significant effect on network training.

Online vs Offline training

Due to the need for a target network to stabilize training and use of replay buffer to address catastrophic forgetting, our agent is not trainable in an online manner.

Bad Generalization

I was not able to get the same agent work in other environments. The reason being is our agent is a very basic one. However, the agent described in the original DQN paper was able to generalize in different environments.

Conclusion

Combining Deep Learning and Reinforcement learning is very fascinating. Building this DQN and getting it to work was an amazing experience. But still, there are a lot of limitations to this approach. DQN was introduced in 2013. The DQN we implemented in this blog is a much simpler version of the proposed DQN. In the paper, it is described as :

We refer to convolutional networks trained with our approach as Deep Q-Networks (DQN).

After 2013 a lot of progress has been made in Deep Reinforcement Learning. There is a great compilation of the resources at this link. With this blog, I just tried to scratch the surface. There is a long way to go from here. So we will keep exploring!!!

What’s Next: A Journey to AWS DeepRacer

Our Team FRacer in Time Trial Race of AWS Deep Racer

In the next few blogs, we will take you through a journey of the AWS Deep Racer Platform. We will describe how we made use of these skills to break into the top 1% global ranking in the AWS DeepRacer Virtual Circuit contest for the month of August 2020.