4. Deep Q-Networks – Reinforcement Learning [Book]

Distributional RL

The most fundamental deviation from standard DQN is the reintroduction of probability into the agents. All Q-learning agents derive from the Bellman equation (Equation 2-10), which defines the optimal trajectory as the action that maximizes the expected reward. The key word here is expected. Q-learning implements this as an exponentially decaying average. Whenever you use a summary statistic you are implying assumptions about the underlying distribution. Usually, if you choose to use the mean as a summary statistic, you place an implicit assumption that the data is distributed according to a normal distribution. Is this correct?

For example, imagine you are balancing on the cliff in the Cliffworld environment. There is a large negative reward if you fall off the cliff. But there is also a much greater (but near zero) reward of walking along the cliff edge. “Q-Learning Versus SARSA” showed that SARSA followed the average “safe” path far away from the cliff. Q-learning took the risky path next to the cliff because the fewer number of steps to the goal resulted in a slightly greater expected return. All states in the Cliffworld environment lead to two outcomes. Both outcomes have their own distribution representing a small chance that some exploration may lead the agent to an alternative outcome. The expected return is not normally distributed. In general, rewards are multimodal.

Now imagine a Cliffworld-like environment where the difference between success and failure is not as dramatic. Imagine that the reward upon success is +1 and failure is –1. Then the expected return, for at least some states, will be close to zero. In these states, the average expected value yields actions that make no difference. This can lead to agents randomly wandering around meaningless states (compare this with gradient descent algorithms that plateau when the loss functions become too flat). This is shown in Figure 4-5. In general, reward distributions are much more complex.

Figure 4-5.

A depiction of multimodal rewards for a state. (a) presents a simple grid environment with a penalizing cliff on one side and a rewarding goal on the other. The reward distribution, shown in (b), has two opposite rewards and the mean, which is the same as the value function for that state, is zero.

In practice, this leads to convergence issues. The worst-case scenario is that the agent never converges. When it does, researchers often describe the agent as “chattering.” This is where the agent converges but never stabilizes. It can oscillate like in Figure 4-6 or can be more complex. Note that this is not the only cause of chattering.

Figure 4-6.

A depiction of chattering using a plot of the rewards during training.

Learning the distribution of rewards can lead to more stable learning.9 Bellemare et al. reformulate the Bellman equation to account for random rewards that you can see in Equation 4-3, where
γ
is the discount factor, and
R
is a stochastic reward that depends on a state and action.

Equation 4-3.

Distributional Bellman equation

Z

(
s
,
a
)


R

(
s
,
a
)

+
γ
Z

(
S ‘
,
A ‘
)

Here I temporarily break from this book’s convention and use capital letters to denote stochastic variables.
Z
represents the distribution of rewards. The expectation of
Z
is the Q-value. Like the Q-value,
Z
should be updated recursively according to the received reward. The key difference is the emphasis that the expected return,
Z
, is a random variable distributed over all future states,
S
, and future actions,
A
. In other words, the policy
Z π
is a mapping of state-action pairs to a distribution of expected returns. This representation preserves multimodality and Bellemare et al. suggest that this leads to more stable learning. I recommend that you watch this video to see an example of the distribution of rewards learned during a game of Space Invaders.

The first incarnation of this idea is an algorithm called C51. It asks two key questions. How do you represent the distribution? How do you create the update rule for Equation 4-3? The choice of distribution or function used to model the returns is important. C51 uses a discrete parametric distribution, in other words a histogram of returns. The “51” in C51 refers to the number of histogram bins chosen in the researchers’ implementation and suggests that performance is proportional to the numbers of bins, albeit with diminishing returns.

To produce an estimate of the distribution, Bellemare et al. built an NN with 51 output “bins.” When the environment produces a reward, which falls into one of the bins, the agent can train the NN to predict that bin. This is now a supervised classification problem and the agent can be trained using cross-entropy loss. The “C” in C51 refers to the fact that the agent is predicting classes, or categories.