How a language model that was ‘too dangerous to be released’ can be used to enable a host of deep-learning powered programming tools.
The area of Natural Language Processing has exploded over the last few years and has proven to be one of the most useful subsets of Artificial Intelligence (AI). NLP is a broad field of AI that enables applications like sentiment analysis, language translation, question-answering, voice assistants and much more! However, are there even more applications of NLP than the ones we have today?
I believe so. The last frontier of NLP is applying NLP techniques and AI to solve a more niche yet powerful task — understanding programming languages at a semantical level. This would enable a whole host of tools like auto-code commenting, programming language translation and explanations for programming languages. However, achieving this would require using cutting edge techniques like Neural Language Models, like OpenAI’s GPT-2, to achieve the best possible performance in this particular task.
At the most basic level, RNNs have loops to transfer information from one step of the network to the next. If we examine RNNs further, we can see that they form a chain-like structure where the hidden state contains information from all the steps leading up to the current state. This hidden state is also passed into the network along with the specified input to take into consideration previous data.
To learn more and read about a cool application of RNN’s check out this article I wrote.
Traditionally, RNNs and m, ore specifically, LSTM networks have been the go-to choice for language models. An RNN, which is a more linear training pattern, attention mechanisms are able to find non-linear relationships between words in a sentence that may not be in sequential order. An RNN architecture relies heavily on the fact that the words surrounding a word are the most important words in relation to that word, which may not be accurate in more long sentences.
However, there has been a rise in other types of language models known as transformers by the popularity of Open AI’s GPT-2 and Google’s BERT transformer models. Transformers have now become the go-to choice for NLP applications due to their remarkable performance and more efficient training. In a ground-breaking paper, “attention is all you need,” google established that attention is the only mechanism needed to achieve high results in NLP tasks instead of the more traditional RNN approach. Check out the research paper.
Transformers are comprised of encoder and decoder blocks (some models are exclusively encoder based and some are fully decoder based). Transformers are made of a stack of decoders, encoders of a mix of the two. In each stack, the encoder or decoder block takes embeddings as an input. The most interesting and unique aspect of transformer models is the self-attention layers found within every block. To understand the self-attention layer, we need to understand what attention is in this context. In this case, the attention mechanism plays “attention” on certain parts of a sentence that are most relevant to the task at hand. By learning to place “attention” at the right parts of a sentence, transformer models are able to much more memory efficient and produce better results by discarding all irrelevant information.
To understand attention a little better, let’s take a look at an example:
As we can see in this sentence, when referring to the “it” in the first sentence, the transformer model places the attention on “animal.” When dealing with the “it” in the second sentence, the model places most of the attention on “street” instead of “animal.” This effectively shows how the model places attention based on the other tokens in a sentence.
The popularity of transformers has grown in large part due to the popularity of general language models. Neural Language Models like GPT2 and BERT are general language models that have been trained on a huge corpus in a process known as pretraining. Then, people can use the model and fine-tune it by training the model on a smaller corpus in a specific domain. However, to understand why GPT-2 is so powerful, we need to understand what sets it apart from other transformer-based language models like BERT. In fact, GPT-2 was deemed so powerful that the full GPT2 model was considered too dangerous to release to the public. After seeing some of the examples of GPT2 at play, I wouldn’t be surprised if you started worrying about AI being potentially dangerous.
There are two fundamental differences between GPT-2 and BERT. The first is the fact that GPT-2 relies solely on Decoder blocks, whereas BERT relies on encoder blocks only. This difference in blocks results in a very fundamental difference in the application of the self-attention mechanism. The encoder block has a self-attention, whereas the decoder has a masked self-attention mechanism. Masked self-attention differs from self-attention as the masking refers to not allowing the model to peak at token to the right of the position (words that come after the word being guessed).
Another difference between GPT-2 and BERT is that GPT-2 only outputs one token at a time, whereas BERT can output multiple tokens at once. In this aspect, GPT-2 is more like an RNN, which also only outputs a token at a time. While it may seem that this makes GPT-2 inferior, this difference in architecture makes GPT-2 better certain tasks and aids in unsupervised training.
In fact, the extra-large version of GPT-2 with 1.5 billion parameters was able to write a very convincing new story simple from a two-sentence prompt about unicorns (even human journalists would have a difficult time writing about unicorns):
Based on the impressive results of GPT-2 on natural language, it would be foolish not to think that understanding programming languages at a semantic level can be done better using neural language models that have been fine-tuned for programming languages. In fact, this approach has been tested in two research papers that I will link down below. As of right now, it seems as though repurposing general language models like GPT-2 for programming languages have produced the best results when compared to traditional techniques like using an RNN or LSTM model, not to mention the significantly reduced training time and data required. Not only that, an AI-powered software tools company known as Tab 9 is using Open AI’s GPT 2 model to provide better code auto-completion suggestions.
With all this being said, I would like to leave you off with a potentially interesting question: Is applying AI to expedite the development of code that powers AI, the beginning of an era when AI can just program new and better AI?