Methods for self-supervised learning generally formulate some kind of supervised signal from an unlabeled dataset. In the case of natural language processing models, an unlabeled corpus is used to predict the next word from a context or the hidden word etc. These tasks are called pretext tasks.

Figure 1: Loss based on pseudo label generated from images
The advantage of these models is to provide a pre-trained model on a large corpus of images and then use it to process downstream tasks with a clear improvement in performance compared to training them from scratch.

What is the problem with self-supervised learning?
In theory, we can clearly say that we could train a model with an infinite number of images, right? Not really. Indeed, to assign these images to pseudo Pi labels , we need to group them according to common characteristics. To do this there are essentially two methods.

1. Why Corporate AI projects fail?

2. How AI Will Power the Next Wave of Healthcare Innovation?

3. Machine Learning by Using Regression Model

4. Top Data Science Platforms in 2021 Other than Kaggle

Clustering-based approaches, which will create groups of images based on the similarity of their characteristics. We will use existing models like a ResNet50 for example to extract the features of each image and then apply a clustering on these features to create K groups of images.
Noise contrastive estimation based approaches, t hese approaches allow to maximize the semantic similarity between two similar images and on the contrary to minimize the similarity between two different images. If you are interested in learning more about the topic, you can read this article or the original research paper.
Both approaches suffer from problems related to the computation time and resources required, which limit the approach to training models on small image corpora.

Most clustering methods require a forward pass which must be performed locally and limits the use of this method.
Methods based on Noise Contrastive Estimation compare images in pairs. They use strong data augmentation methods (random crop resizing, color distortion, rotation etc…). We quickly understand that comparing images by pair on large corpus becomes very expensive.
Motivated by these problems SwAV brings an answer to the computation time problem, using an online clustering, it also introduces the notion of multi crop policy which allows to introduce multiple views of the same image without add a huge computational cost. This method has drastically increased the performances as we will see in the next part.

SwAV solves computational problems by taking the advantages of contrastive loss without using pairwise image comparison, which allows it to be scalable. To do so, it uses a clustering method that guarantees consistency for transformations of the same image. It guarantees that the transformations from the same image will be in the same cluster.

Contrastive Loss
Figure 2: Contrastive Loss
In the original definition of contrastive loss, one has to compute the similarity between each pair of images which quickly becomes intractable. To solve this problem, SwAV avoids comparing every pair of images by mapping the image features to a set of trainable prototype vectors.

Online Cluster Assignments and Contrasting Them
SwAV introduces an online clustering mechanism based on the assignment to a code that will be shared between the different batches. The advantage of this method is that it is totally scalable, one can imagine being able to apply it to an unlimited number of images. Nevertheless, in reality some constraints exist. We are going to go into a little more detail in the functioning of the clustering done by SwAV.

Figure 3: SwAV method
SwAV starts by taking an image X , then selects two transformations t1 and t2 from the set of transformations T applicable to an image. Then from these images it applies a nonlinear transformation f_theta . In our case f_theta can be the features generated by a ResNet50 projected to an unit sphere for example to obtain z1 and z2 respectively from x1 and x2 .

Now, we try to assign a prototype to each zt feature. We then setup a “swapped” prediction problem with the following loss function:

Figure 4: Swap Loss function
Intuitively, their method compares the features zt and zs using the intermediate codes qt and qs . If these two features capture the same information, it should be possible to predict the code from the other feature. A similar comparison appears in contrastive learning where features are compared directly.

To do so, we will try to assign the prototype c_k which maximizes the softmax of dot product between z_t and c_k . This function is the cross entropy loss between the code of other augmented image and the probability obtained by taking a softmax of the dot z_t and c_k .

Figure 5: loss function
where the function l(z, q) measures the fit between features z and a code q , as detailed later.

If we generalize the loss function to the set of image pairs for all batches, we obtain the following loss function.

Figure 6: Global loss function
This loss function does not allow to perform an online training on a large number of images, indeed it requires to compute the loss on all pairs of images and batches, forcing us to do the training locally. To solve this problem Mathilde Caron and al. had the idea to apply their function and assign prototypes to the batch level allowing a generalization of their method.

The codes are computed using the prototypes C such that all the examples in a batch are equally partitioned by the prototype . The equipartition constraint is very important here as it ensures that the codes for different images in a batch are distinct , thus preventing the trivial solution where every image has the same code.

Given B feature vectors Z = [z₁, z₂, . . . , z_B] , we are interested in mapping them to the prototypes C = [c₁, . . . , c_K] . This mapping or the codes are represented by Q = [q₁, . . . , qB] , and Q is optimized to maximize the similarity between the features and the prototypes, i.e. To explain it more simply, Q is the dot product between Z and the weight of shallow layer which represent C .

Figure 7: Maximize similarity between prototype and features
where H is the entropy function and ε is a parameter that controls the smoothness of the mapping. This equation represents the optimal transport problem, a well known problem already solved by Sinkhorn-Knopp algorithm.

Asano et al . [2] enforce an equal partition by constraining the matrix Q to belong to the transportation polytope. They work on the full dataset, and we propose to adapt their solution to work on minibatches by restricting the transportation polytope to the minibatch:

Figure 8: Constraint Matrix Q to assure each image is assigned to a different prototype
where 1_K denotes the vector of ones in dimension K . These constraints enforce that on average each prototype is selected at least B/K times in the batch.

Figure 9: Optimal solution for Q transport matrix
The optimal solution for the Q matrix is defined by Figure 9, where u and v are renormalization vectors in R^K and R^B respectively. The renormalization vectors are computed using a small number of matrix multiplications using the iterative Sinkhorn-Knopp algorithm.

What are the prototypes ?

Figure 10: a simplify view of SwAV
The C prototypes are represented by a shallow layer the weights of this dense layer represent the prototypes that will be learned during the back-propagation step. The output of this shallow layer represents our Q matrix, which we transform into a Q* (the solution of problem defined in Figure 9) matrix that represents the probability of being assigned to the prototype C_k (i.e. to cluster k ) which maximize the similarity between prototype and features. To find this matrix Q* , their used the Sinkhorn-Knopp describe in the next section.

Sinkhorn-Knopp algorithm
What does the Sinkhorn-Knopp algorithm actually do? It transforms a non-negative matrix Q , into a doubly stochastic matrix Q* , I will come back to the term doubly stochastic matrix in more detail. To do this, the algorithm looks for two diagonal matrices u and v such that Q* = uQv .

A doubly stochastic matrix , is simply a matrix which all the sum of rows or lines are equal to one .

In our case, the Q matrix represent the dot product between the prototype C and the feature extracted for some images Z . It means, that the Q matrix represent the probability of matching with one of the prototypes.

Here is the pseudo code of this algorithm.

Figure 11: Pseudo code of Sinkhorn-Knopp algorithm
Multi-crop: Augmenting views with smaller images
Figure 12: Multi-crop
We can quickly understand the interest of having multiple views of the same image from different augmentation methods to facilitate the training of the model. However, adding these image peers will increase the time complexity and the memory quadratically.

Mathilde Caron et al . had the idea of introducing low-resolution images in the training in addition to the two high-resolution images . The advantage is that this method allows to increase the learning base without increasing too much the computation time and the memory.

Figure 13: Multi-crop loss
Results
Figure 14: performance
This table summarizes the performance of a ResNet50 trained with supervision or with SwAV. We can see that SwAV is better in all cases .

For the linear classification the metric is the top-1 accuracy, except for VOC07 where the authors used the mAP (mean Average Precision). For the object detection part, they used the classical metrics AP_50(Average Precision with IoU (Intersection over Union) = 0.5) for VOC07+12 and AP (Average Precision) for COCO.

Figure 15: Design space design
Today we see that in many computer vision applications the improvements are due to the use of convolved networks more or less dense, deep, wide, or by adding residual blocks. But these architectures are essentially defined empirically by hand. To solve this problem Ilija Radosavovic et al. defined spaces containing parameterizations of convoluted networks (width, depth, …), then through their analysis succeeded in defining a subspace named Regenet. This new space is a lower dimensional space containing simple regularized networks. The core insight of the RegNet parametrization is surprisingly simple: widths and depths of good networks can be explained by a quantized linear function. They analyze the RegNet design space and arrive at interesting findings that do not match the current practice of network design. The RegNet design space provides simple and fast networks that work well across a wide range of flop regimes. Under comparable training settings and flops, the RegNet models outperform the popular EfficientNet models while being up to 5× faster on GPUs.

Designing Network Design Spaces
A design space is a large, possibly infinite, population of model architectures. The core insight from is that we can sample models from a design space, giving rise to a model distribution, and turn to tools from classical statistics to analyze the design space. In this work, they propose to design progressively simplified versions of an initial, unconstrained design space. We refer to this process as design space design .

How to get a model distribution?

To obtain a model distribution the authors chose to randomly draw n=500 parameterizations and to train these 500 models from a design space. To gain in efficiency we will train the models with few epochs and in low compute (i.e. with reasonable input dimensions).

Now that we have our model distribution we need to define an error function to characterize the performance of our models.

Figure 16: error empirical distribution function (EDF)
F (e) gives the fraction of models with error less than e . This function will allow us to evaluate the quality of a model distribution from our space.

To summarize:

(1) we generate distributions of models obtained by sampling and training n models from a design space
(2) we compute and plot error EDFs to summarize design space quality
(3) we visualize various properties of a design space and use an empirical bootstrap to gain insight
(4) we use these insights to refine the design space
The AnyNet Design Space
The authors use a neural network composed of two parts:

A head: which is composed with Fully Connecter Layer and a softmax function to predict a class
A body: consists of 4 stages operating at progressively reduced resolution. Each stage consists of a sequence of identical blocks.
Figure 17: Network architecture
The stem and head part are fixed, they only tried to optimize the feature extraction (i.e. backbone).

The body consists of 4 stages, each stage i contains a number di of blocks, of width wi , as well as other parameters related to the convolutional network like bottleneck ratio bi , and group width gi . We immediately realize that the number of possible combinations is very large, we have 16 degrees of freedom for each stage .

To obtain valid models, they perform log-uniform sampling of di ≤ 16, wi ≤ 1024 and divisible by 8, bi ∈ {1,2,4}, and gi ∈ {1,2,…,32} . They repeat the sampling until we obtain n = 500 models in their target complexity regime (360MF to 400MF), and train each model for 10 epochs.

There are (16·128·3·6)^4 ≈ 10^18 possible model configurations in the AnyNetX design space. Instead of analyzing the complete space, they looked for factors without influence on the empirical error capable of simplifying the initial space. For example they created the AnyNetXb space which shares a bottleneck index bi=b between the different stages. They observed that the empirical error did not vary between the initial AnyNetX space and the new AnyNetXb space. They repeated this operation to create other simpler and smaller subspaces.

Figure 18: EDF AnyNetXa vs AnyNetXb
Finally, we are able to find an AnyNetXe space with a cumulative reduction of O(10⁷) compared to the initial AnyNetXa space.

The RegNet Design Space
Figure 19: 20 best models for AnyNetXe
Figure 19 shows the relationship between the place of the block and the width wi , we can see that despite the fact that there is a large variance between the results obtained by the different models, we are able to define a linear function able to represent the relationship between the place of the block (i.e. the depth di ) and the width wi .

Figure 20: relation between width and block index j
The figure 20, introduce the linear function which capture the link between the width wi and the index of block j . d is the depth, j the index of block, w0 the initial width, w_alpha > 0 (i.e. positive to indicate that the width increase when the index j increase) the slope.

To quantify uj , they introduced a new parameter wm > 0, then for each block we have to compute sj based on the previous result of uj in Figure 20.

Then, to quantize uj , we simply round sj (denoted by ⌊sj⌉) and compute quantized per-block widths wj via:

We can convert the per-block wj to our per-stage format by simply counting the number of blocks with constant width, that is, each stage i has block width wi = w0*wm^i and number of blocks di with the following formula.

This parameterization have been tested by fitting to models from AnyNetX . The resulting design space is referred to as RegNet .

RegNet is not a single but a model space constrained by linear quantization that is supposed to contain a powerful set of models while having an optimized number of parameters ensuring inference with acceptable speed and better scalability.

Figure 21: Comparison between RegNet and AnyNet
As shown in Figure 21, the models selected within the RegNet space are always better than the models from AnyNet in terms of cumulative empirical error.

The RegNetY space
Figure 22: Squeeze and Excitation block
The space of RegNetY takes the characteristics of RegNetX by integrating a modification of the stages, for each succession of block in a stage we add a block called squeeze and excitation . The purpose of this block is to learn a weight indicating the importance of a channel in a block di .

This SE (Squeeze and Excitation) block is composed with the following layer:

A global average pooling layer to squeeze each channel to an unique number
A fully connected layer followed by a ReLU function adds the necessary nonlinearity and reduces the dimension of the input by a ratio which is an hyper-parameter
A second fully connected layer followed by a Sigmoid activation return a weight for each channel, and assign a constraint that the sum of weight is equal to 1.
At last, we weight each feature map of the convolutional block based on the result of our SE block.
This architecture provided a significant improvement over the RegNetX space (i.e. Figure N).

Figure 23: empirical cumulative error between RegNetX and RegNetY
Model
SEER is a model trained with the SwAV method which allows to train a model on a large corpus without label, it uses an architecture based on the RegNetY space following the parametrization in figure 24. In this case the corpus was composed by1B random, public and non-EU Instagram images

Figure 24: RegNetY parametrization
It has 4 stages with stage depths (2, 7, 17, 1) and stage widths (528, 1056, 2904, 7392), leading to a total of 695.5M parameters .

For the training, they used the following configuration:

RegNetY-256GF with SwAV using 6 crops per image of resolutions 2 × 224 + 4 × 96
Multi-crop data augmentation
During pre-training, they use a 3-layer multi-layer perceptron (MLP) projection head of dimensions 10444 × 8192, 8192 × 8192 and 8192 × 256
16K prototypes
Temperature τ set to 0.1
Perform 10 iterations of Sinkhorn algorithm
Use a weight decay of 10^−5
LARS optimizer and O1 mixed-precision optimization from Apex library
Train their model with stochastic gradient descent with batch size of 8192
Performance
Finetuning Large Pretrained Models

Methodology:

Pretrain 6 RegNet architectures of different capacities, namely RegNetY –{8,16,32,64,128,256}GF , on 1B random, public and non-EU Instagram images with SwAV
Finetune these models on the task of image classification on ImageNet, using the standard 1.28M training images with labels
Evaluate on 50k images in the standard validation set
The table above shows that ResNetY-128 and 256 outperform the SOTA methods with a Top-1 accuracy of 83.8 and 84.2 respectively on imageNet .

Low-shot learning

Low-shot learning is a method consisting in pre-training a model on a large base of images in our case and fine tuning it on a small fraction of the task we want to solve.

Methodology:

Use a the same pre-trained models
Consider two datasets for low-shot learning , namely ImageNet and Places205
Assume a limited access to the dataset during transfer learning , both in terms of labels and images
They compare their approach with semi-supervised approaches and self-supervised pre-training on low- shot learning. Their model is fine-tuned on either 1% or 10% of ImageNet, and does not access the rest of ImageNet images . As opposed to the other methods, the other methods use all the images from ImageNet during pre-training or fine-tuning.

Nonetheless, this approach achieves a top-1 accuracy of 77.9% with only 10% of ImageNet, which is competitive with these methods (2% gap). On 1% of the data, i.e, 10K images, the gap increases significantly but note that the other methods are using the full ImageNet from pre-training.

Detection and segmentation

Method:

Train a Mask-RCNN model on the COCO benchmark with pretrained RegNetY-64GF and RegNetY-128GF as backbones
For both downstream tasks and architectures, our self-supervised pre-training outperforms supervised pretraining by 1.5 − 2 AP points . However, the gap in performances between different architectures is small (0.1 − 0.5 AP) compared to what we observed on ImageNet.

During this article, I have introduced many concepts. We can make a summary of all these concepts.

SwAV , a self-supervised learning approach which opens the door to training generic computer vision models on a large number of unlabeled data
RegNet an effective design space according to those principles, and a family of SOTA models for many computer vision task, highly scalable
SEER a RegNetY based model having used SwAV to pre-train a generic computer vision model that can be fine tuned on downstream tasks by beating SOTA models
This new method opens the door to the training of generic computer vision models as found in natural language processing with GPT-3 , trained on billions of images capable of pushing computer vision even further on many tasks and significantly improving performance.