VGG Neural Networks: The Next Step After AlexNet

AlexNet came out in 2012 and was a revolutionary advancement; it improved on traditional Convolutional Neural Networks (CNNs) and became one of the best models for image classification… until VGG came out.

Photo by Taylor Vick on Unsplash

AlexNet. When AlexNet was published, it easily won the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) and proved itself to be one of the most capable models for object-detection out there. Its key features include using ReLU instead of the tanh function, optimization for multiple GPUs, and overlapping pooling. It addressed overfitting by using data augmentation and dropout. So what was wrong with AlexNet? Well nothing was, say, particularly “wrong” with it. People just wanted even more accurate models.

The Dataset. The general baseline for image recognition is ImageNet, a dataset that consists of more than 15 million images labeled with more than 22 thousand classes. Made through web-scraping images and crowd-sourcing human labelers, ImageNet even hosts its own competition: the previously mentioned ImageNet Large-Scale Visual Recognition Challenge (ILSVRC). Researchers from around the world are challenged to innovate methodology that yields the lowest top-1 and top-5 error rates (top-5 error rate would be the percent of images where the correct label is not one of the model’s five most likely labels). The competition gives out a 1,000 class training set of 1.2 million images, a validation set of 50 thousand images, and a test set of 150 thousand images; data is plentiful. AlexNet won this competition in 2012, and models based off of its design won the competition in 2013.

Configurations of VGG; depth increases from left to right and the added layers are bolded. The convolutional layer parameters are denoted as “conv<receptive field size> — <number of channels>”. Image credits to Simonyan and Zisserman, the original authors of the VGG paper.

VGG Neural Networks. While previous derivatives of AlexNet focused on smaller window sizes and strides in the first convolutional layer, VGG addresses another very important aspect of CNNs: depth. Let’s go over the architecture of VGG:

Input. VGG takes in a 224×224 pixel RGB image. For the ImageNet competition, the authors cropped out the center 224×224 patch in each image to keep the input image size consistent.
Convolutional Layers. The convolutional layers in VGG use a very small receptive field (3×3, the smallest possible size that still captures left/right and up/down). There are also 1×1 convolution filters which act as a linear transformation of the input, which is followed by a ReLU unit. The convolution stride is fixed to 1 pixel so that the spatial resolution is preserved after convolution.
Fully-Connected Layers. VGG has three fully-connected layers: the first two have 4096 channels each and the third has 1000 channels, 1 for each class.
Hidden Layers. All of VGG’s hidden layers use ReLU (a huge innovation from AlexNet that cut training time). VGG does not generally use Local Response Normalization (LRN), as LRN increases memory consumption and training time with no particular increase in accuracy.

The Difference. VGG, while based off of AlexNet, has several differences that separates it from other competing models:

Instead of using large receptive fields like AlexNet (11×11 with a stride of 4), VGG uses very small receptive fields (3×3 with a stride of 1). Because there are now three ReLU units instead of just one, the decision function is more discriminative. There are also fewer parameters (27 times the number of channels instead of AlexNet’s 49 times the number of channels).
VGG incorporates 1×1 convolutional layers to make the decision function more non-linear without changing the receptive fields.
The small-size convolution filters allows VGG to have a large number of weight layers; of course, more layers leads to improved performance. This isn’t an uncommon feature, though. GoogLeNet, another model that uses deep CNNs and small convolution filters, was also showed up in the 2014 ImageNet competition.

Performance of VGG at multiple test scales. Image credits to Simonyan and Zisserman, the original authors of the VGG paper.

Results. On a single test scale, VGG achieved a top-1 error of 25.5% and a top-5 error of 8.0%. At multiple test scales, VGG got a top-1 error of 24.8% and a top-5 error of 7.5%. VGG also achieved second place in the 2014 ImageNet competition with its top-5 error of 7.3%, which they decreased to 6.8% after the submission.

Now what? VGG is an innovative object-recognition model that supports up to 19 layers. Built as a deep CNN, VGG also outperforms baselines on many tasks and datasets outside of ImageNet. VGG is now still one of the most used image-recognition architectures.

I’ve attached some further resources below that may be interesting