Mục Lục

Deep Q-Network (DQN)-II

Experience Replay and Target Networks

This is the second post devoted to Deep Q-Network (DQN), in the “Deep Reinforcement Learning Explained” series, in which we will analyse some challenges that appear when we apply Deep Learning to Reinforcement Learning. We will also present in detail the code that solves the OpenAI Gym Pong game using the DQN network introduced in the previous post.

Challenges in Deep Reinforcement Learning

Unfortunately, reinforcement learning is more unstable when neural networks are used to represent the action-values, despite applying the wrappers introduced in the previous section. Training such a network requires a lot of data, but even then, it is not guaranteed to converge on the optimal value function. In fact, there are situations where the network weights can oscillate or diverge, due to the high correlation between actions and states.

In order to solve this, in this section we will introduce two techniques used by the Deep Q-Network:

Experience Replay
Target Network

There are many more tips and tricks that researchers have discovered to make DQN training more stable and efficient, and we will cover the best of them in future posts in this series.

Experience Replay

We are trying to approximate a complex, nonlinear function, Q(s, a), with a Neural Network. To do this, we must calculate targets using the Bellman equation and then consider that we have a supervised learning problem at hand. However, one of the fundamental requirements for SGD optimization is that the training data is independent and identically distributed and when the Agent interacts with the Environment, the sequence of experience tuples can be highly correlated. The naive Q-learning algorithm that learns from each of these experiences tuples in sequential order runs the risk of getting swayed by the effects of this correlation.

We can prevent action values from oscillating or diverging catastrophically using a large buffer of our past experience and sample training data from it, instead of using our latest experience. This technique is called replay buffer or experience buffer. The replay buffer contains a collection of experience tuples (S, A, R, S′). The tuples are gradually added to the buffer as we are interacting with the Environment. The simplest implementation is a buffer of fixed size, with new data added to the end of the buffer so that it pushes the oldest experience out of it.

The act of sampling a small batch of tuples from the replay buffer in order to learn is known as experience replay. In addition to breaking harmful correlations, experience replay allows us to learn more from individual tuples multiple times, recall rare occurrences, and in general make better use of our experience.

As a summary, the basic idea behind experience replay is to storing past experiences and then using a random subset of these experiences to update the Q-network, rather than using just the single most recent experience. In order to store the Agent’s experiences, we used a data structure called a deque in Python’s built-in collections library. It’s basically a list that you can set a maximum size on so that if you try to append to the list and it is already full, it will remove the first item in the list and add the new item to the end of the list. The experiences themselves are tuples of [observation, action, reward, done flag, next state] to keep the transitions obtained from the environment.

Experience = collections.namedtuple(‘Experience’, 
           field_names=[‘state’, ‘action’, ‘reward’, 
           ‘done’, ‘new_state’])
class ExperienceReplay:
  def __init__(self, capacity):
      self.buffer = collections.deque(maxlen=capacity)
  def __len__(self):
      return len(self.buffer)
  def append(self, experience):
      self.buffer.append(experience)
  
  def sample(self, batch_size):
      indices = np.random.choice(len(self.buffer), batch_size,
                replace=False)
      states, actions, rewards, dones, next_states = 
             zip([self.buffer[idx] for idx in indices])
      return np.array(states), np.array(actions),         
             np.array(rewards,dtype=np.float32), 
             np.array(dones, dtype=np.uint8),    
             np.array(next_states)

Each time the Agent does a step in the Environment, it pushes the transition into the buffer, keeping only a fixed number of steps (in our case, 10k transitions). For training, we randomly sample the batch of transitions from the replay buffer, which allows us to break the correlation between subsequent steps in the environment.

Most of the experience replay buffer code is quite straightforward: it basically exploits the capability of the deque library. In the sample() method, we create a list of random indices and then repack the sampled entries into NumPy arrays for more convenient loss calculation.

Target Network

Remember that in Q-Learning, we update a guess with a guess, and this can potentially lead to harmful correlations. The Bellman equation provides us with the value of Q(s, a) via Q(s’, a’) . However, both the states s and s’ have only one step between them. This makes them very similar, and it’s very hard for a Neural Network to distinguish between them.

When we perform an update of our Neural Networks’ parameters to make Q(s, a) closer to the desired result, we can indirectly alter the value produced for Q(s’, a’) and other states nearby. This can make our training very unstable.

To make training more stable, there is a trick, called target network, by which we keep a copy of our neural network and use it for the Q(s’, a’) value in the Bellman equation.

That is, the predicted Q values of this second Q-network called the target network, are used to backpropagate through and train the main Q-network. It is important to highlight that the target network’s parameters are not trained, but they are periodically synchronized with the parameters of the main Q-network. The idea is that using the target network’s Q values to train the main Q-network will improve the stability of the training.

Later, when we present the code of the training loop, we will enter in more detail how to code the initialization and use of this target network.

Deep Q-Learning Algorithm

There are two main phases that are interleaved in the Deep Q-Learning Algorithm. One is where we sample the environment by performing actions and store away the observed experienced tuples in a replay memory. The other is where we select the small batch of tuples from this memory, randomly, and learn from that batch using a gradient descent (SGD) update step.

These two phases are not directly dependent on each other and we could perform multiple sampling steps then one learning step, or even multiple learning steps with different random batches. In practice, you won’t be able to run the learning step immediately. You will need to wait till you have enough tuples of experiences in D.

The rest of the algorithm is designed to support these steps. We can summarize the previous explanations with this pseudocode for the basic DQN algorithm that will guide our implementation of the algorithm:

In the beginning, we need to create the main network and the target networks, and initialize an empty replay memory D. Note that memory is finite, so we may want to use something like a circular queue that retains the d most recent experience tuples. We also need to initialize the Agent, one of the main components, which interacts with the Environment.

Note that we do not clear out the memory after each episode, this enables us to recall and build batches of experiences from across episodes.

Coding the Training Loop

Hyperparameters and execution time

Before going into the code, mention that DeepMind’s Nature paper contained a table with all the details about hyperparameters used to train its model on all 49 Atari games used for evaluation. DeepMind kept all those parameters the same for all games, but trained individual models for every game. The team’s intention was to show that the method is robust enough to solve lots of games with varying complexity, action space, reward structure, and other details using one single model architecture and hyperparameters.

However, our goal in this post is to solve just the Pong game, a quite simple and straightforward game in comparison to other games in the Atari test set, so the hyperparameters in the paper are are not the most suitable for a didactic post like this one. For this reason, we decided to use more personalized parameter values for our Pong Environment that converges to mean score of 19.0 in a reasonable wall time, depending on the GPU type that colab assigns to our execution (about a couple of hours at most). Remember that we can know the type of GPU that has been assigned to our runtime environment with the command !nvidia-smi.

Let’s start introducing the code in more detail. The entire code of this post can be found on GitHub (and can be run as a Colab google notebook using this link). We skip the import details of the packages, it is quite straightforward, and we focus on the explanation of the hyperparameters:

DEFAULT_ENV_NAME = “PongNoFrameskip-v4” 
MEAN_REWARD_BOUND = 19.0 
gamma = 0.99                    or
batch_size = 32                 
replay_size = 10000             
learning_rate = 1e-4            
sync_target_frames = 1000        
replay_start_size = 10000      
eps_start=1.0
eps_decay=.999985
eps_min=0.02

These DEFAULT_ENV_NAME identify the Environment to train on and MEAN_REWARD_BOUNDthe reward boundary to stop training. We will consider that the game has converged when our agent reaches an average of 19 games won (out of 21) in the last 100 games. The remaining parameters indicate:

gammais the discount factor
batch_size, the minibatch size
learning_rateis the learning rate
replay_sizethe replay buffer size (maximum number of experiences stored in replay memory)
sync_target_framesindicates how frequently we sync model weights from the main DQN network to the target DQN network (how many frames in between syncing)
replay_start_size the count of frames (experiences) to add to replay buffer before starting training.

Finally, the hyperparameters related to the epsilon decay schedule are the same as the previous post:

eps_start=1.0
eps_decay=.999985
eps_min=0.02

Agent

One of the main components we need is an Agent, which interacts with the Environment, and saves the result of the interaction into the experience replay buffer. The Agent class that we will design already save directly the result of the interacts with the Environment into the experience replay buffer, performing these three steps of the sample phase indicated in the portion of the previous pseudocode:

First of all, during the Agent’s initialization, we need to store references to the Environment and experience replay buffer D indicated as an argument in the creation of the Agent’s object as exp_buffer:

class Agent: 
    def __init__(self, env, exp_buffer):
        self.env = env
        self.exp_buffer = exp_buffer
        self._reset()
def _reset(self):
        self.state = env.reset()
        self.total_reward = 0.0

In order to perform Agent’s steps in the Environment and store its results in the experience replay memory we suggest the following code:

def play_step(self, net, epsilon=0.0, device=”cpu”):
    done_reward = None
    if np.random.random() < epsilon:
       action = env.action_space.sample()
    else:
       state_a = np.array([self.state], copy=False)
       state_v = torch.tensor(state_a).to(device)
       q_vals_v = net(state_v)
       _, act_v = torch.max(q_vals_v, dim=1)
       action = int(act_v.item())

The method play_step uses an ϵ-greedy(Q) policy to select actions at every time step. In other words, with the probability epsilon (passed as an argument), we take the random action; otherwise, we use the past model to obtain the Q-values for all possible actions and choose the best.

After obtaining the action the method performs the step in the Environment to get the next observation: next_state, reward and is_done:

    new_state, reward, is_done, _ = self.env.step(action)
    self.total_reward += reward

Finally, the method stores the observation in the experience replay buffer, and then handle the end-of-episode situation:

    exp = Experience(self.state,action,reward,is_done,new_state)
    self.exp_buffer.append(exp)
    self.state = new_state
    if is_done:
       done_reward = self.total_reward
       self._reset()
    return done_reward

The result of the function is the total accumulated reward if we have reached the end of the episode with this step, or None if not.

Main Loop

In the initialization part, we create our environment with all required wrappers applied, the main DQN neural network that we are going to train, and our target network with the same architecture. We also create the experience replay buffer of the required size and pass it to the agent. The last things we do before the training loop are to create an optimizer, a buffer for full episode rewards, a counter of frames and a variable to track the best mean reward reached (because every time the mean reward beats the record, we will save the model in a file):

env = make_env(DEFAULT_ENV_NAME)
net = DQN(env.observation_space.shape,
          env.action_space.n).to(device)
target_net = DQN(env.observation_space.shape,
          env.action_space.n).to(device)
buffer = ExperienceReplay(replay_size)
agent = Agent(env, buffer)
epsilon = eps_start
optimizer = optim.Adam(net.parameters(), lr=learning_rate)
total_rewards = []
frame_idx = 0
best_mean_reward = None

At the beginning of the training loop, we count the number of iterations completed and update epsilon as we introduced in the previous post. Next, the Agent makes a single step in the Environment (using as arguments the current neural network and value for epsilon). Remember that this function returns a non-None result only if this step is the final step in the episode. In this case, we report the progress in the console (count of episodes played, mean reward for the last 100 episodes and the current value of epsilon):

while True:
  frame_idx += 1
  epsilon = max(epsilon*eps_decay, eps_min)
  reward = agent.play_step(net, epsilon, device=device)
  if reward is not None:
     total_rewards.append(reward)
     mean_reward = np.mean(total_rewards[-100:])
     print(“%d: %d games, mean reward %.3f, (epsilon %.2f)” % 
          (frame_idx, len(total_rewards), mean_reward, epsilon))

After, every time our mean reward for the last 100 episodes reaches a maximum, we report this in the console and save the current model parameters in a file. Also, if this mean rewards exceed the specified MEAN_REWARD_BOUND ( 19.0 in our case) then we stop training. The third if, helps us to ensure our experience replay buffer is large enough for training:

if best_mean_reward is None or 
        best_mean_reward < mean_reward:
             torch.save(net.state_dict(), 
                       DEFAULT_ENV_NAME + “-best.dat”)
             best_mean_reward = mean_reward
             if best_mean_reward is not None:
             print(“Best mean reward updated %.3f” % 
                  (best_mean_reward))
if mean_reward > MEAN_REWARD_BOUND:
             print(“Solved in %d frames!” % frame_idx)
             break
if len(buffer) < replay_start_size:
             continue

Learn phase

Now we will start to describe the part of the code, from the main loop, that refers to the phase where the network learn (a portion of the previous pseudocode):

The whole code that we wrote for implementing this part is as follows:

batch = buffer.sample(batch_size) 
states, actions, rewards, dones, next_states = batch
states_v = torch.tensor(states).to(device)
next_states_v = torch.tensor(next_states).to(device)
actions_v = torch.tensor(actions).to(device)
rewards_v = torch.tensor(rewards).to(device)
done_mask = torch.ByteTensor(dones).to(device)
state_action_values = net(states_v).gather(1, 
                          actions_v.unsqueeze(-1)).squeeze(-1)
next_state_values = target_net(next_states_v).max(1)[0]
next_state_values[done_mask] = 0.0
next_state_values = next_state_values.detach()
expected_state_action_values=next_state_values * gamma + rewards_v
loss_t = nn.MSELoss()(state_action_values, 
                     expected_state_action_values)
optimizer.zero_grad()
loss_t.backward()
optimizer.step()
if frame_idx % sync_target_frames == 0:
   target_net.load_state_dict(net.state_dict())

We are going to dissect it to facilitate its description since it is probably the most complex part to understand.

The first thing to do is to sample a random mini-batch of transactions from the replay memory:

batch = buffer.sample(batch_size) 
states, actions, rewards, dones, next_states = batch

Next, the code wraps individual NumPy arrays with batch data in PyTorch tensors and copies them to GPU ( we are assuming that the CUDA device is specified in arguments):

states_v = torch.tensor(states).to(device)
next_states_v = torch.tensor(next_states).to(device)
actions_v = torch.tensor(actions).to(device)
rewards_v = torch.tensor(rewards).to(device)
done_mask = torch.ByteTensor(dones).to(device)

This code inspired by the code of Maxim Lapan. It is written in a form to maximally exploit the capabilities of the GPU by processing (in parallel) all batch samples with vector operations. But explained step by step it can be understood without problems.

Then, we pass observations to the first model and extract the specific Q-values for the taken actions using the gather() tensor operation. The first argument to this function call is a dimension index that we want to perform gathering on. In this case, it is equal to 1, because it corresponds to actions dimension:

state_action_values = net(states_v).gather(1, 
                          actions_v.unsqueeze(-1)).squeeze(-1)

The second argument is a tensor of indices of elements to be chosen. Here it is a bit more complex to explain the code. Let’s try it!. Maxim Lapan suggest to use the functions unsqueeze() and squeeze(). Because the index should have the same number of dimensions as the data we are processing (2D in our case) it apply a unsqueeze()to the action_v (that is a 1D) to compute the index argument for the gather functions. Finally, to remove the extra dimensions we have created, we will use the squeeze()function. Let’s try to illustrate what a gather does in summary on a simple example case with a batch of four entries and four actions:

Note that the result of gather() applied to tensors is a differentiable operation that will keep all gradients with respect to the final loss value.

Now that we have calculated the state-action values for every transition in the replay buffer, we need to calculate target “y” for every transition in the replay buffer too. Both vectors are the ones we will use in the loss function. To do this, remember that we must use the target network.

In the following code, we apply the target network to our next state observations and calculate the maximum Q-value along the same action dimension, 1:

next_state_values = target_net(next_states_v).max(1)[0]

Function max() returns both maximum values and indices of those values (so it calculates both max and argmax). Because in this case, we are interested only in values, we take the first entry of the result.

Remember that if the transition in the batch is from the last step in the episode, then our value of the action doesn’t have a discounted reward of the next state, as there is no next state from which to gather the reward:

next_state_values[done_mask] = 0.0

Although we cannot go into detail, it is important to highlight that the calculation of the next state value by the target neural network shouldn’t affect gradients. To achieve this, we use thedetach() function of the PyTorch tensor, which makes a copy of it without connection to the parent’s operation, to prevent gradients from flowing into the target network’s graph:

next_state_values = next_state_values.detach()

Now, we can calculate the Bellman approximation value for the vector of targets (“y”), that is the vector of the expected state-action value for every transition in the replay buffer:

expected_state_action_values=next_state_values * gamma + rewards_v

We have all the information required to calculate the mean squared error loss:

loss_t = nn.MSELoss()(state_action_values, 
                     expected_state_action_values)

The next piece of the training loop updates the main neural network using the SGD algorithm by minimizing the loss:

optimizer.zero_grad()
loss_t.backward()
optimizer.step()

Finally, the last line of the code syncs parameters from our main DQN network to the target DQN network every sync_target_frames:

if frame_idx % sync_target_frames == 0:
   target_net.load_state_dict(net.state_dict())

And so far the code for the main loop!

What is next?

This is the second of three posts devoted to present the basics of Deep Q-Network (DQN), in which we present in detail the algorithm. In the next post, we will talk about the performance of the algorithm and also show how we can use it.