How to choose the size of the convolution filter or Kernel size for CNN?

Infrastructure for deep learning

Convolution is basically a dot product of kernel (or filter) and patch of an image (local receptive field) of the same size. Convolution is quite similar to correlation and exhibits a property of translation equivariant that means if we move or translate the input and apply the convolution to it, it will act in the same manner as we first apply convolution and then translated an image.

During this learning process of CNN, you find different kernel sizes at different places in the code, then this question arises in one’s mind that whether there is a specific way to choose such dimensions or sizes. So, the answer is no. In the current Deep Learning world, we are using the most popular choice that is used by every Deep Learning practitioner out there, and that is 3×3 kernel size. Now, another question strikes your mind, why only 3×3, and not 1×1, 2×2, 4×4, etc. Just keep reading and you will getthe most crisp reason behind this in next few minutes!!

Basically, We divide kernel sizes into smaller and larger ones. Smaller kernel sizes consists of 1×1, 2×2, 3×3 and 4×4, whereas larger one consists of 5×5 and so on, but we use till 5×5 for 2D Convolution. In 2012, when AlexNet CNN architecture was introduced, it used 11×11, 5×5 like larger kernel sizes that consumed two to three weeks in training. So because of extremely longer training time consumed and expensiveness, we no longer use such large kernel sizes.

One of the reason to prefer small kernel sizes over fully connected network is that it reduces computational costs and weight sharing that ultimately leads to lesser weights for back-propagation. So then came VGG convolution neural networks in 2015 which replaced such large convolution layers by 3×3 convolution layers but with a lot of filters. And since then, 3×3 sized kernel has became as a popular choice. But still, why not 1×1, 2×2 or 4×4 as smaller sized kernel?

1×1 kernel size is only used for dimensionality reduction that aims to reduce the number of channels. It captures the interaction of input channels in just one pixel of feature map. Therefore, 1×1 was eliminated as the features extracted will be finely grained and local that too with no information from the neighboring pixels.
2×2 and 4×4 are generally not preferred because odd-sized filters symmetrically divide the previous layer pixels around the output pixel. And if this symmetry is not present, there will be distortions across the layers which happens when using an even sized kernels, that is, 2×2 and 4×4. So, this is why we don’t use 2×2 and 4×4 kernel sizes.

Therefore, 3×3 is the optimal choice to be followed by practitioners until now. But it is still the most expensive parts!

Bonus: Further digging into it, I found an another interesting approach that was used in Inception V3 CNN architecture launched by Google during the ImageNet Recognition Challenge that replaced 3×3 convolution layer by 1×3 layer followed by 3×1 convolution layer, which is actually splitting down the 3×3 convolutions into a series of one dimensional convolution layer. And it came out to be quite cost-friendly!!

Thanks for giving it a read. I found this as a most common question put up by novice in deep learning (including me….;), since a clear and crisp reason behind using a specific kernel sizes is not generally covered in most of the learning courses. It’s my first article on Medium, so if you like it, do not forget to give a clap!! Have a nice day!!