Convolution is a core building block in computer vision. Early algorithms employ convolutional filters to blur images, extract edges, or detect features. It has been heavily exploited in modern neural networks due to its efficiency and generalization ability, in comparison to fully connected models .
However, it makes modeling long range relations challenging,[1] propose to adopt axial-attention , which not only allows efficient computation, but recovers the large receptive field in stand-alone attention models. The core idea is to factorize 2D attention into two 1D attentions along height- and width-axis sequentially. Its efficiency enables us to attend over large regions and build models to learn long range or even global interactions.
They augment the positional terms to be context-dependent, making our attention position-sensitive, with marginal costs. They show the effectiveness of our axial-attention models on ImageNet ,they build an Axial-ResNet by replacing the 3 × 3 convolution in all residual blocks with their position sensitive axial-attention layer, and we further make it fully attentional by adopting axial-attention layers in the stem. As a result, their Axial-ResNet attains state-of-the-art results among stand-alone attention models on ImageNet.
Given an input feature map
With height h, width w, and channels d_{in}, the output at position o = (i, j), y_o ∈ R d^{out} , is computed by pooling over the projected input as:
where N is the local m × m square region centered around location o = (i, j), and queries
keys
values
Are all linear projections of the input
And
Are all learnable matrices,the learnable vector
Is the added relative positional encoding that measures the compatibility from location p = (a, b) to location o = (i, j).
1. Microsoft Azure Machine Learning x Udacity — Lesson 4 Notes
2. Fundamentals of AI, ML and Deep Learning for Product Managers
3. Roadmap to Data Science
4. Work on Artificial Intelligence Projects
They call this design position-sensitive self-attention, which captures long range interactions with precise positional information at a reasonable computation overhead.
[1] propose to adopt axial-attention in stand alone self-attention, ensuring both global connection and efficient computation. They first define an axial-attention layer on the width-axis of an image as simply a one dimensional position-sensitive self-attention
And use the similar definition for the height-axis.
One axial-attention layer propagates information along one particular axis. To capture global information, they employ two axial-attention layers consecutively for the height-axis and width-axis, respectively. Both of the axial-attention layers adopt the multi-head attention mechanism computed by applying N singlehead attentions in parallel on input and then obtaining the final output by concatenating the results from each head.
To transform a ResNet to an Axial-ResNet, they replace the 3 × 3 convolution in the residual bottleneck block by two multi-head axial attention layers (one for height-axis and the other for width-axis).
Optional striding is performed on each axis after the corresponding axial-attention layer. The two 1×1 convolutions are kept to shuffle the features. This forms axial-attention block, which is stacked multiple times to obtain Axial-ResNets.
They demonstrate the effectiveness of their model on four large-scale datasets. In particular, the model outperforms all existing stand-alone self-attention models on ImageNet.
- Huiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh Chen .Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation
Credit: BecomingHuman By: Nabil MADALI