[Paper Summary] Dueling Network Architectures for Deep Reinforcement Learning

[Paper Summary] Dueling Network Architectures for Deep Reinforcement Learning

1. Introduction

Dueling Network

Q-network(top) / Dueling Q-network (bottom)

Dueling Q-network has two separate streams while sharing conv parts.

One stream is for state-value and the other is for advantages for each action.

What do the value and advantage stream learn?

The value stream learns to pay attention to the road.

The advantage stream learns to pay attention only when there are cars immediately in front, so as to avoid collisions.

The Dueling architecture represents both the value and advantage functions with a single deep model whose output combines the two to produce a state-action value.

2. Background

The optimal Q function satisfies the Bellman equation

The advantage function.

the value function V measures the how good it is to be in a particular state s. The Q function measures the value of choosing a particular action when in this state. The advantage function subtracts the value of the state from the Q function to obtain a relative measure of the importance of each action.

2.1 Deep Q-networks

freeze the Q-network in y DQN (target network)

update main Q-network online by gradient descent.

Experience replay

2.2 Double Deep Q-networks

Only thing that is different from DQN is target

DQN has an overoptimistic value estimates problem.

3. The Dueling Network Architecture

The module that combines the two streams of fully connected layers to output a Q estimate.

Advantage function

theta denotes the parameters of the convolutional layers, and alpha and beta are the parameters of the two streams of fully-connected layers.

Could anyone explain this part easily?

The advantage function above is unidentifiable in the sense that given Q we cannot recover V and A uniquely.

the advantage function estimator will have zero advantage at the chosen action.

used e-greedy policy and e is chosen to be 0.001

4. Experiments

As the action space is bigger, the dueling DQN outperform DQn more.

4.2 General Atari Game-Playing

They also evaluate by measuring improvement in percentage in score over the better of human and baseline agent scores:

It indicates that how much the dueling architecture has better improvement over the baseline single network.

Duel Clip does better than Single Clip as the table below shows.

Also, The combination of prioritized replay and the dueling network results in vast improvements over the previous state-of-the-art.

What do the two streams of the dueling network do?

  1. The value stream pays attention to the score and the horizon where the appearance of a car could affect future performance.

2. The advantage stream cares more about cars that are on an immediate collision course.

6. Conclusion

The Dueling architecture leads to dramatic improvements over existing approaches for deep RL in the challenging Atari domain.

https://arxiv.org/pdf/1511.06581.pdf