8. Modern Convolutional Neural Networks — Dive into Deep Learning 1.0.0-beta0 documentation

8.

Modern Convolutional Neural Networks¶

Now that we understand the basics of wiring together CNNs, let’s take a
tour of modern CNN architectures. This tour is, by necessity,
incomplete, thanks to the plethora of exciting new designs being added.
Their importance derives from the fact that not only can they be used
directly for vision tasks, but they also serve as basic feature
generators for more advanced tasks such as tracking
(Zhang et al., 2021), segmentation
(Long et al., 2015), object detection
(Redmon and Farhadi, 2018), or style transformation
(Gatys et al., 2016). In this chapter, most sections
correspond to a significant CNN architecture that was at some point (or
currently) the base model upon which many research projects and deployed
systems were built. Each of these networks was briefly a dominant
architecture and many were winners or runners-up in the ImageNet
competition which has
served as a barometer of progress on supervised learning in computer
vision since 2010. It is only recently that Transformers have begun to
displace CNNs, starting with
Dosovitskiy et al. (2021) and followed by the Swin
Transformer (Liu et al., 2021). We will cover this development later
in the chapter on Attention Mechanisms and Transformers.

While the idea of deep neural networks is quite simple (stack together
a bunch of layers), performance can vary wildly across architectures and
hyperparameter choices. The neural networks described in this chapter
are the product of intuition, a few mathematical insights, and a lot of
trial and error. We present these models in chronological order, partly
to convey a sense of the history so that you can form your own
intuitions about where the field is heading and perhaps develop your own
architectures. For instance, batch normalization and residual
connections described in this chapter have offered two popular ideas for
training and designing deep models, both of which have since been
applied to architectures beyond computer vision, too.

We begin our tour of modern CNNs with AlexNet
(Krizhevsky et al., 2012), the first large-scale
network deployed to beat conventional computer vision methods on a
large-scale vision challenge; the VGG network
(Simonyan and Zisserman, 2014), which makes use of a number of
repeating blocks of elements; the network in network (NiN) that
convolves whole neural networks patch-wise over inputs
(Lin et al., 2013); GoogLeNet that uses networks with
multi-branch convolutions (Szegedy et al., 2015); the
residual network (ResNet) (He et al., 2016), which remains
some of the most popular off-the-shelf architectures in computer vision;
ResNeXt blocks (Xie et al., 2017) for sparser
connections; and DenseNet (Huang et al., 2017) for
a generalization of the residual architecture. Over time many special
optimizations for efficient networks were developed, such as coordinate
shifts (ShiftNet) (Wu et al., 2018). This culminated in the
automatic search for efficient architectures such as MobileNet v3
(Howard et al., 2019). It also includes the
semi-automatic design exploration of
Radosavovic et al. (2020) that led to the
RegNetX/Y which we will discuss later in this chapter. The work is
instructive insofar as it offers a path to marry brute force computation
with the ingenuity of an experimenter in the search for efficient design
spaces. Of note is also the work of Liu et al. (2022) as it
shows that training techniques (e.g., optimizers, data augmentation, and
regularization) play a pivotal role in improving accuracy. It also shows
that long-held assumptions, such as the size of a convolution window,
may need to be revisited, given the increase in computation and data. We
will cover this and many more questions in due course throughout this
chapter.