So now the GPT model is generating images and pixel values, it was originally used to generate text but now it is doing pixel generation. And this is done via one pixel by pixel, like a language model but with pixels.
And the spatial relationship is none-existing, this is really cool, as there is no convolution operation.
So cute images, that is what it generated, and the quality is amazing. The investigation is really different.
How much can we push the generative models?
1. AI for CFD: Intro (part 1)
2. Using Artificial Intelligence to detect COVID-19
3. Real vs Fake Tweet Detection using a BERT Transformer Model in few lines of code
4. Machine Learning System Design
The actual object is, what if we train a model to generate images? And the classification tasks? And it was AMAZING!
This is so cool, basically, it is telling us the features that we can use to generate images, they are really good for classification tasks as well.
We can pre-train via generation task, this is so cooooool. In general, the pre-training methods are done in language models on the text. But now they are pre-training on image generation. This is so cool.
They downsample the image and start to input one pixel by pixel, so the problem space is smaller, but this is needed since the training of these images will take much more time.
We want the model to generate, or predict the next sequence. Attention is given to the most recent pixel, and we only take information about what it has seen.
Basically one direction and rely on the previous predictions. Both autoregressive, as well as BERT, have been used in language modeling. Now they are just applied to image data.
And remember there is no need for labels for this task, and we can do this via unsupervised settings.
Now there are two ways to use this pre-trained model, either one we can train an added head or we can only train the classification head.
We are judging, how good are the representation is, the learned or in this case the generated classification. Linear prob is one way to see how is the representation linearly separable.
The pre-training was done on image-net as well as some images on the internet, and linear probing works really well.
CIFAR 96 percent is actually really good.
That one layer the linear layer is only thing that does for training, and it seems like the 20th layer learns the best representation. (This is hella interesting result).
The first layer is going to take care about the low-level features, and the final layers are going to take care about the accurate features.
The middle point is more or less global information. And that information is GOOD for classification.
And there are three different scales, we can use the large one in reasonable computing power. The larger model does reach lower validation loss and the general trend.
As loss goes down the linear accurate is good, which indicates that the learned representation is much more linearly separable. The larger models they just seem to be better.
This is really really cool, and they have made very amazing findings. And fine-tuning is good at the final layer.
And generative loss is critical, we should put them in the loss term.