Mục Lục

Dueling Deep Q Networks

Dueling Network Architectures for Deep Reinforcement Learning

Review & Introduction

Let’s go over some important definitions before going through the Dueling DQN paper. Most of these should be familiar.

Given the agent’s policy π, the action value and state value are defined as, respectively:

The above Q function can also be written as:

The Advantage is a quantity is obtained by subtracting the Q-value, by the V-value:

Recall that the Q value represents the value of choosing a specific action at a given state, and the V value represents the value of the given state regardless of the action taken. Then, intuitively, the Advantage value shows how advantageous selecting an action is relative to the others at the given state.

What Changes & Motivation

(Wang et al.) presents the novel dueling architecture which explicitly separates the representation of state values and state-dependent action advantages via two separate streams.
The key motivation behind this architecture is that for some games, it is unnecessary to know the value of each action at every timestep. The authors give an example of the Atari game Enduro, where it is not necessary to know which action to take until collision is imminent.

By explicitly separating two estimators, the dueling architecture can learn which states are (or are not) valuable, without having to learn the effect of each action for each state. Like the Enduro example, this architecture becomes especially more relevant in tasks where actions might not always affect the environment in meaningful ways.

Architecture

Like the standard DQN architecture, we have convolutional layers to process game-play frames. From there, we split the network into two separate streams, one for estimating the state-value and the other for estimating state-dependent action advantages. After the two streams, the last module of the network combines the state-value and advantage outputs.

Now, how do we combine/aggregate the two values?

It seems intuitive to just sum the two values, as immediately obvious from the definition of the advantage value:

However, the authors present two issues about this method:

It is problematic to assume that and gives reasonable estimates of the state-value and action advantages, respectively. Naively adding these two values can, therefore, be problematic.
The naive sum of the two is “unidentifiable,” in that given the Q value, we cannot recover the V and A uniquely. It is empirically shown in Wang et al. that this lack of identifiability leads to poor practical performance.

Therefore, the last module of the neural network implements forward mapping shown below:

which will force the Q value for the maximizing action to equal V, solving the identifiability issue.

Alternatively, as used in Wang et al.’s experiments, we can also use:

we then choose the optimal action a* based on:

Training

Because the dueling architecture shares the same input-output interface with the standard DQN architecture, the training process is identical. We define the loss of the model as the mean squared error:

and take the gradient descent step to update our model parameters.

Implementation

So, we will go through the implementation of Dueling DQN.

1. Network architecture: As discussed above, we want to split the state-dependent action advantages and the state-values into two separate streams. We also define the forward pass of the network with the forward mapping as discussed above:

2. Next, we will implement the update function:

Besides these, nothing changes from the standard DQN architecture; for the full implementation, check out my vanilla DQN post, or my Github repository: