What are convolutional neural networks?

Mục Lục

What is convolutional neural network (CNN or convnet)?

A convolutional neural network (CNN or convnet) is a subset of machine learning. It is one of the various types of artificial neural networks which are used for different applications and data types. A CNN is a kind of network architecture for deep learning algorithms and is specifically used for image recognition and tasks that involve the processing of pixel data.

There are other types of neural networks in deep learning, but for identifying and recognizing objects, CNNs are the network architecture of choice. This makes them highly suitable for computer vision (CV) tasks and for applications where object recognition is vital, such as self-driving cars and facial recognition.

Inside convolutional neural networks

Artificial neural networks (ANNs) are a core element of deep learning algorithms. One type of an ANN is a recurrent neural network (RNN) that uses sequential or time series data as input. It is suitable for applications involving natural language processing (NLP), language translation, speech recognition and image captioning.

The CNN is another type of neural network that can uncover key information in both time series and image data. For this reason, it is highly valuable for image-related tasks, such as image recognition, object classification and pattern recognition. To identify patterns within an image, a CNN leverages principles from linear algebra, such as matrix multiplication. CNNs can also classify audio and signal data.

A CNN’s architecture is analogous to the connectivity pattern of the human brain. Just like the brain consists of billions of neurons, CNNs also have neurons arranged in a specific way. In fact, a CNN’s neurons are arranged like the brain’s frontal lobe, the area responsible for processing visual stimuli. This arrangement ensures that the entire visual field is covered, thus avoiding the piecemeal image processing problem of traditional neural networks, which must be fed images in reduced-resolution pieces. Compared to the older networks, a CNN delivers better performance with image inputs, and also with speech or audio signal inputs.

Types of AI apps explained
Convolutional neural network, a subset of machine learning, is a type of artificial neural network.

CNN layers

A deep learning CNN consists of three layers: a convolutional layer, a pooling layer and a fully connected (FC) layer. The convolutional layer is the first layer while the FC layer is the last.

From the convolutional layer to the FC layer, the complexity of the CNN increases. It is this increasing complexity that allows the CNN to successively identify larger portions and more complex features of an image until it finally identifies the object in its entirety.

Convolutional layer. The majority of computations happen in the convolutional layer, which is the core building block of a CNN. A second convolutional layer can follow the initial convolutional layer. The process of convolution involves a kernel or filter inside this layer moving across the receptive fields of the image, checking if a feature is present in the image.

Over multiple iterations, the kernel sweeps over the entire image. After each iteration a dot product is calculated between the input pixels and the filter. The final output from the series of dots is known as a feature map or convolved feature. Ultimately, the image is converted into numerical values in this layer, which allows the CNN to interpret the image and extract relevant patterns from it.

Pooling layer. Like the convolutional layer, the pooling layer also sweeps a kernel or filter across the input image. But unlike the convolutional layer, the pooling layer reduces the number of parameters in the input and also results in some information loss. On the positive side, this layer reduces complexity and improves the efficiency of the CNN.

Fully connected layer. The FC layer is where image classification happens in the CNN based on the features extracted in the previous layers. Here, fully connected means that all the inputs or nodes from one layer are connected to every activation unit or node of the next layer.

All the layers in the CNN are not fully connected because it would result in an unnecessarily dense network. It also would increase losses and affect the output quality, and it would be computationally expensive.

How convolutional neural networks work

A CNN can have multiple layers, each of which learns to detect the different features of an input image. A filter or kernel is applied to each image to produce an output that gets progressively better and more detailed after each layer. In the lower layers, the filters can start as simple features.

At each successive layer, the filters increase in complexity to check and identify features that uniquely represent the input object. Thus, the output of each convolved image — the partially recognized image after each layer — becomes the input for the next layer. In the last layer, which is an FC layer, the CNN recognizes the image or the object it represents.

With convolution, the input image goes through a set of these filters. As each filter activates certain features from the image, it does its work and passes on its output to the filter in the next layer. Each layer learns to identify different features and the operations end up being repeated for dozens, hundreds or even thousands of layers. Finally, all the image data progressing through the CNN’s multiple layers allow the CNN to identify the entire object.

A table depicting the differences between a convolutional neural network vs. recurrent neural network.

CNNs vs. neural networks

The biggest problem with regular neural networks (NNs) is a lack of scalability. For smaller images with fewer color channels, a regular NN may produce satisfactory results. But as the size and complexity of an image increases, the need for computational power and resources also increases which necessitates a larger and more expensive NN.

Moreover, the problem of overfitting also arises over time, wherein the NN tries to learn too many details in the training data. It may also end up learning the noise in the data, which affects its performance on test data sets. Ultimately, the NN fails to identify the features or patterns in the data set and thus the object itself.

In contrast, a CNN uses parameter sharing. In each layer of the CNN, each node connects to another. A CNN also has an associated weight; as the layers’ filters move across the image, the weights remain fixed — a condition known as parameter sharing. This makes the whole CNN system less computationally intensive than an NN system.

Benefits of using CNNs for deep learning

Deep learning is a subset of machine learning that uses neural networks with at least three layers. Compared to a network with just one layer, a network with multiple layers can deliver more accurate results. Both RNNs and CNNs are used in deep learning, depending on the application.

For image recognition, image classification and computer vision (CV) applications, CNNs are particularly useful because they provide highly accurate results, especially when a lot of data is involved. The CNN also learns the object’s features in successive iterations as the object data moves through the CNN’s many layers. This direct (and deep) learning eliminates the need for manual feature extraction (feature engineering).

CNNs can be retrained for new recognition tasks and built on preexisting networks. These advantages open up new opportunities to use CNNs for real-world applications without increasing computational complexities or costs.

As seen earlier, CNNs are more computationally efficient than regular NNs since they use parameter sharing. The models are easy to deploy and can run on any device, even smartphones.

Deep neural network diagram
Diagram illustrating how a neural network processes data.

Applications of convolutional neural networks

Convolutional neural networks are already used in a variety of CV and image recognition applications. Unlike simple image recognition applications, CV enables computing systems to also extract meaningful information from visual inputs (e.g., digital images) and then take appropriate action based on this information.

The most common applications of CV and CNNs are used in fields such as the following:

Healthcare. CNNs can examine thousands of visual reports to detect any anomalous conditions in patients, such as the presence of malignant cancer cells.
Automotive. CNN technology is powering research into autonomous vehicles and self-driving cars.
Social media. Social media platforms use CNNs to identify people in a user’s photograph and help the user tag their friends.
Retail. E-commerce platforms that incorporate visual search allow brands to recommend items that are likely to appeal to a shopper.
Facial recognition for law enforcement. Generative adversarial networks (GANs) are used to produce new images that can then be used to train deep learning models for facial recognition
Audio processing for virtual assistants. CNNs in virtual assistants learn and detect user-spoken keywords and process the input to guide their actions and respond to the user.

Learn how to build a machine learning model in 7 steps and see how automated machine learning improves project efficiency.