Saturday, March 6, 2021
  • Setup menu at Appearance » Menus and assign menu to Top Bar Navigation
Advertisement
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News
No Result
View All Result
NikolaNews
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News
No Result
View All Result
NikolaNews
No Result
View All Result
Home Neural Networks

Attention is all you need. An explanation about transformer | by Pierrick RUGERY | Jul, 2020

September 18, 2020
in Neural Networks
Attention is all you need. An explanation about transformer | by Pierrick RUGERY | Jul, 2020
586
SHARES
3.3k
VIEWS
Share on FacebookShare on Twitter

Embedding aims at creating a vector representation of words. Words that have the same meaning will be close in terms of euclidian distance. For example, the word bathroom and shower are associated with the same concept, so we can see that the two words are close in Euclidean space, they express similar senses or concept.

For the encoder, the authors decided to use an embedding of size 512 (i.e each word is modeled by a vector of size 512).

You might also like

Deploy AI models -Part 3 using Flask and Json | by RAVI SHEKHAR TIWARI | Feb, 2021

Labeling Service Case Study — Video Annotation — License Plate Recognition | by ByteBridge | Feb, 2021

5 Tech Trends Redefining the Home Buying Experience in 2021 | by Iflexion | Mar, 2021

Part 2: Positional Encoding

The position of a word plays a determining role in understanding the sequence we try to model. Therefore, we add positional information about the word within the sequence in the vector. The authors of the paper used the following functions (see figure 2) to model the position of a word within a sequence.

figure 2: positional encoding functions

We will try to explain positional encoding in more detail. Let us take an example.

The big yellow cat
1 2 3 4

We note the position of the word in the sequence p_t € [1, 4].
d_model is the dimension of the embedding, in our case d_model = 512, i is the dimension(i.e the dimension of vector). We can now rewrite the two postionnal equation

figure 3: rewrite equations

We can see that the wavelength (i.e. frequency) lambda_t decreases as the dimension increases, this forms a progression along the wave from 2pi to 10000.2pi.

figure 4: the wavelength for different dimension

In the case of this model the information of the absolute position of a word in a sequence is added directly to the initial vector. To do this the encoding position must have the same size as the initial vector d_model.

1. Microsoft Azure Machine Learning x Udacity — Lesson 4 Notes

2. Fundamentals of AI, ML and Deep Learning for Product Managers

3. Roadmap to Data Science

4. Work on Artificial Intelligence Projects

If you want to better understand the notion of relative position is how the sinusoidal function allows you to have this notion of relative position, I recommend this post.

Part 3: Attention mechanism

Scaled Dot-Product Attention

figure 5: Scaled Dot-Product Attention

Let’s start by explaining the mechanism of attention. The main purpose of attention is to estimate the relative importance of the keys term compared to the query term related to the same person or concept. To that end, the attention mechanism takes query Q that represents a vector word, the keys K which are all other words in the sentence, and value V represents the vector of the word.

In our case, V is equal to Q (for the two self-attention layers). In other words, the attention mechanism gives us the importance of the word in a specific sentence.

Let’s show an example of what this function does.

Let’s take the following sequence for example: “The big yellow cats”

When we compute the normalized dot product between the query and the keys, we get a tensor that represents the relative importance of each other word for the query.

tensor([0.0864, 0.5847, 0.1607, 0.1683]) #example for query big

To go deeper in mathematics, we can try to understand why the authors used dot product to calculate the relation between two words.

A word is represented by a vector in an Euclidian space, in this case a vector of size 512.

Example: “big” -> [0.33, 0.85,……………., -0.74]

When computing the dot product between Q and K.T, we compute the product between the orthogonal projection of Q in K. In other words, we try to estimate how the vectors (i.e words between query and keys) are aligned and return a weight for each word in the sentence.

Then, we normalize the result squared of d_k, because ON A large scale the magnitude of Q and K can grow bY pushing the softmax function in regions where it has extremely small gradients. To counter this effect, we scale the dot product by 1/squaRe(d_k). The softmax function regularizes the terms and rescales them between 0 and 1(i.e transform the dot product to a probability law), the main goal is to normalize the whole weight between 0 and 1.

Finally, we multiply the result( i.e weights) by the value (i.e all words) to reduce the importance of non-relevant words and focus only on the most important words.

Multi Head Attention

figure 6: Multi Head Attention

IThe Transformer model uses the Multi-Head Attention mechanism, it’s simply a projection of Q, K and V in h Linear Spaces.

On each of these projected versions of queries, keys and values we then perform the attention function in parallel, producing dv -dimensional output values. These are concatenated and projected again, which gives the final values, as depicted in Figure 6.

During the training phase, the Multi-Head Attention mechanism has to learn the best projection matrices (WQ, WK, WV).

The output of the Multi-Head Attention mechanism, h attention matrix for each word, are then concatenated to produce one matrix per word. This Attention architecture allows us to learn more complex dependencies between words without adding any training time thanks to the linear projection which reduces the size of each word vector. (in this paper we have 8 projections in space of size 64, 8*64 = 512 the initial size of vector)

figure 7: transformer architecture

In this part, we are going to describe how the encoder and the decoder work to translate an english sentence to a french sentence part by part.

Part 1: Encoder

  1. Use embedding to convert a sequence of tokens to a sequence of vectors.
figure 8: Embedding

The embedding part, convert word sequences to vectors, in our case each sentence is converted to a vector of size 512.

2. Add position information in each word vector

figure 9: positional encoding

The great strength of recurrent neural networks is their ability to learn complex dependencies between sequences and to remember. Transformers use positional coding to introduce the relative position of a word within a sequence.

3. Apply Multi Head Attention

figure 10: attention mechanism

4. Use Feed Forward

Part 2: Decoder

  1. Use embedding to convert french sentence to vectors
figure 11: decoder embedding

2. Add positional information in each vector word

figure 13: positional encoding

3. Apply Multi Head Attention

figure 14: multi head attention

4. Feed Forward network

5. Use Multi Head Attention with encoder output

figure 15: multi head attention encoder/decoder

In this part, we can see that the Transformer uses an output from the encoder and the input from the decoder, this allows it to determine how the vectors which encode the sentence in English are related to the vectors which encode the sentence in French.

6. Feed forward again

7. Linear + softmax

These two blocks compute the probability for the next word, at the output the decoder return the highest probability as the next word.

In our case the next word after “LE” is “GROS”.

Results

The authors of the research paper compared the architecture of the transformers and other state of the art model in 2017.

As, you can see the transformer model outperforms all models on BLEU test, this test evaluates the algorithm on a translation task. It compared the diference bewteen the translation provided by the algorithm and humans.

figure 16: bleu score for transformer

Credit: BecomingHuman By: Pierrick RUGERY

Previous Post

How to Close a Sale: 12 Tips for Success

Next Post

Proximity matters: Using machine learning and geospatial analytics to reduce COVID-19 exposure risk

Related Posts

Deploy AI models -Part 3 using Flask and Json | by RAVI SHEKHAR TIWARI | Feb, 2021
Neural Networks

Deploy AI models -Part 3 using Flask and Json | by RAVI SHEKHAR TIWARI | Feb, 2021

March 6, 2021
Labeling Service Case Study — Video Annotation — License Plate Recognition | by ByteBridge | Feb, 2021
Neural Networks

Labeling Service Case Study — Video Annotation — License Plate Recognition | by ByteBridge | Feb, 2021

March 6, 2021
5 Tech Trends Redefining the Home Buying Experience in 2021 | by Iflexion | Mar, 2021
Neural Networks

5 Tech Trends Redefining the Home Buying Experience in 2021 | by Iflexion | Mar, 2021

March 6, 2021
Labeling Case Study — Agriculture— Pigs’ Productivity, Behavior, and Welfare Image Labeling | by ByteBridge | Feb, 2021
Neural Networks

Labeling Case Study — Agriculture— Pigs’ Productivity, Behavior, and Welfare Image Labeling | by ByteBridge | Feb, 2021

March 5, 2021
8 concepts you must know in the field of Artificial Intelligence | by Diana Diaz Castro | Feb, 2021
Neural Networks

8 concepts you must know in the field of Artificial Intelligence | by Diana Diaz Castro | Feb, 2021

March 5, 2021
Next Post
Proximity matters: Using machine learning and geospatial analytics to reduce COVID-19 exposure risk

Proximity matters: Using machine learning and geospatial analytics to reduce COVID-19 exposure risk

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recommended

Plasticity in Deep Learning: Dynamic Adaptations for AI Self-Driving Cars

Plasticity in Deep Learning: Dynamic Adaptations for AI Self-Driving Cars

January 6, 2019
Microsoft, Google Use Artificial Intelligence to Fight Hackers

Microsoft, Google Use Artificial Intelligence to Fight Hackers

January 6, 2019

Categories

  • Artificial Intelligence
  • Big Data
  • Blockchain
  • Crypto News
  • Data Science
  • Digital Marketing
  • Internet Privacy
  • Internet Security
  • Learn to Code
  • Machine Learning
  • Marketing Technology
  • Neural Networks
  • Technology Companies

Don't miss it

Cyberattack shuts down online learning at 15 UK schools
Internet Security

Cyberattack shuts down online learning at 15 UK schools

March 6, 2021
Facebook enhances AI computer vision with SEER
Machine Learning

Facebook enhances AI computer vision with SEER

March 6, 2021
Microsoft Exchange zero-day vulnerabilities exploited in attacks against US local governments
Internet Security

Microsoft Exchange zero-day vulnerabilities exploited in attacks against US local governments

March 6, 2021
Hands-on Guide to Interpret Machine Learning with SHAP –
Machine Learning

Hands-on Guide to Interpret Machine Learning with SHAP –

March 6, 2021
$100 in crypto for a kilo of gold: Scammer pleads guilty to investor fraud
Internet Security

$100 in crypto for a kilo of gold: Scammer pleads guilty to investor fraud

March 6, 2021
Revolution by Artificial Intelligence, Machine Learning and Deep Learning in the healthcare industry
Machine Learning

Revolution by Artificial Intelligence, Machine Learning and Deep Learning in the healthcare industry

March 6, 2021
NikolaNews

NikolaNews.com is an online News Portal which aims to share news about blockchain, AI, Big Data, and Data Privacy and more!

What’s New Here?

  • Cyberattack shuts down online learning at 15 UK schools March 6, 2021
  • Facebook enhances AI computer vision with SEER March 6, 2021
  • Microsoft Exchange zero-day vulnerabilities exploited in attacks against US local governments March 6, 2021
  • Hands-on Guide to Interpret Machine Learning with SHAP – March 6, 2021

Subscribe to get more!

© 2019 NikolaNews.com - Global Tech Updates

No Result
View All Result
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News

© 2019 NikolaNews.com - Global Tech Updates