What is Feature Pyramid Network (FPN)?

💡 Overview

Feature Pyramid Network is designed to combine the features from different levels of a convolutional network, in order to better detect objects at different scales.

🧠

KEY TAKEAWAYS

  • A feature pyramid network (FPN) is a neural network used in computer vision for object detection.
  • The first step in the network is to create a convolutional feature pyramid (C1 -C7) from the input image.

  • C1 (conv1) is the first feature map with half size of the image input

  • C7 (conv7) is the output of the last convolutional layer.

  • For eaxample, C2 is good at detecting small objects, C3 is good at detecting medium sized objects, and C4 is good at detecting large objects, and so forth.

  • C2 is detail-aware (zoom in) whereas C5 is context-aware (zoom out).

  • In order, to make benefit from context-aware features, we need to fuse the features from adjacent layers .

  • That’s why we use a feature pyramid network (FPN)

🤔 What is Feature Pyramid Network?

Feature Pyramid Network is designed to combine the features from different levels of a convolutional network, in order to better detect objects at different scales.

You probably already heard that FPN builds high-level semantic feature maps at all scales. What does that mean?

It means we combine feature maps from different levels of the network, and then use those feature maps to create a higher-level feature map.

By doing so, we create a strong semantic feature map at each scale.

Feature Pyramid Network (FPN) diagramFeature Pyramid Network (FPN) diagram

As you can see, P2/P3 feature maps highlight the details (low-level semantic) whereas P5/P6 feature maps highlights the context (high-level semantic).

FPN Feature Maps (P2, P3, P4, P5)FPN Feature Maps (P2, P3, P4, P5)

💠 Feature Pyramid Network (FPN) Types

There are at least four types of FPN: FPN, BiFPN, NAS-FPN, and BiFPN.

Check the Feature Pyramid Network (FPN) Comparison to see which one is the best for your use case.

Feature Pyramid Network (FPN) is also knows as Neck in object detection architectureFeature Pyramid Network (FPN) is also knows as Neck in object detection architecture

✍️ Some other observations:

  • FPN is used in both One-Stage Object and Two-Stage Object Detection Architectures

  • The model diagram corresponds to the One-Stage Object Detection Architecture

  • The (P3-P5) layers are also referred as the Convolutional (C3-C5) Layers in some papers

  • P7out is simply referred as P7 in other papers

  • There are other more FPN sophisticated design. YOLO-ReT proposed a new design. Check out my YOLO-ReT post for more details

comparing-modelscomparing-models

🎯 Actionable resources for FPN

You can train one of the many object detection models using FPN, in this notebook: Getting Started in Object Detection Notebook

👨‍💻 Code snippet

# VFNet Model

if selection == 0:

model_type = models.mmdet.vfnet

backbone = model_type.backbones.resnet50_fpn_mstrain_2x

# RetinaNet Model

if selection == 1:

model_type = models.mmdet.retinanet

backbone = model_type.backbones.resnet50_fpn_1x

# YOLOX Model

if selection == 4:

model_type = models.mmdet.yolox

backbone = model_type.backbones.yolox_s_8x8

...

If you want to have a peek at the code on how FPN is used in the RetinaNet in the MMDetection library, check out this code snippet:

Source: MMDetection RetinaNet Configuration File

model

=

dict

(

type

=

'RetinaNet'

,

backbone

=

dict

(

type

=

'ResNet'

,

depth

=

50

,

num_stages

=

4

,

out_indices

=

(

0

,

1

,

2

,

3

)

,

frozen_stages

=

1

,

norm_cfg

=

dict

(

type

=

'BN'

,

requires_grad

=

True

)

,

norm_eval

=

True

,

style

=

'pytorch'

,

init_cfg

=

dict

(

type

=

'Pretrained'

,

checkpoint

=

'torchvision://resnet50'

)

)

,

neck

=

dict

(

type

=

'FPN'

,

in_channels

=

[

256

,

512

,

1024

,

2048

]

,

out_channels

=

256

,

start_level

=

1

,

add_extra_convs

=

'on_input'

,

num_outs

=

5

)

,

.

.

.

  • In the above code snippet, the backbone is a ResNet50. The neck is a FPN

  • in_channels (List[int]): Number of input channels per scale. in_channels=[256, 512, 1024, 2048]

  • out_channels (int): Number of output channels (used at each scale). out_channels=256

  • start_level (int): Index of the start input backbone level used to build the feature pyramid. start_level=1

  • num_outs (int): Number of output scales (P3 to P7). num_outs=5

📚 References

📰 Paper for more details.

MMDetection Repo

MMDetection Documentation

IceVision Repo

IceVision Documentation