STN Explained | Papers With Code

spatial transformer networks uses an explicit procedure to learn invariance to translation, scaling, rotation and other more general warps, making the network pay attention to the most relevant regions. STN was the first attention mechanism to explicitly predict important regions and provide a deep neural network with transformation invariance.

Taking a 2D image as an example, a 2D affine transformation can be formulated as followed, where A denotes a $ 2 \times 3 $ learneable affine matrix:

\begin{align}
A = f_\text{loc}(U)
\end{align}
\begin{align}
x_i^s = A x_i^t
\end{align}

Here, $U$ is the input feature map, and $f_\text{loc}$ can be any differentiable function, such as a lightweight fully-connected network or convolutional neural network. $x_{i}^{s}$ is coordinates in the output feature map, while $x_{i}^{t}$ is corresponding coordinates in the input feature map and the $ A $ matrix is the learnable affine matrix. After obtaining the correspondence, the network can sample relevant input regions using the correspondence.
To ensure that the whole process is differentiable and can be updated in an end-to-end manner, bilinear sampling is used to sample the input features.

STNs focus on discriminative regions automatically
and learn invariance to some geometric transformations.