Natural Language Processing (NLP) is a subfield of Artificial Intelligence, which deals with processing, understanding, and modeling Human Language.
The main challenge in modeling Human Language is that the language construct is in the form of text (sentences and words). Identifying the context of a text word is of paramount importance in the case of Text Analytics and modeling text-based intelligent systems.
Hence, we need to convert the text into Machine Interpretable form, say a bunch of numeric vectors that represent the input text. Such a mathematical representation should represent —
- the actual meaning of the word
- semantic meaning aka context of the word
Word Embeddings are the mathematical representation of words that models the actual and semantic meaning of the word. The concept of embeddings arises from a branch of Natural Language Processing called — “Distributional Semantics”. It is based on the simple intuition that —
“Words that occur in similar contexts tend to have similar meanings.”
In other words, a word’s meaning is given by the words that it appears frequently with.
A Word Embedding is an encoded, vector representation of words, such that the vectors closer to each other in vector space are similar in their semantic meaning.
Word Embeddings help to model the Distribution Semantics in the Text Analytics domain. Using such word representations, AI models can be built to solve real-time text use-cases such as Information Retrieval, Document Classification, Question Answering, Named Entity Recognition, and Text Parsing.
There are several methods to generate the Word Embeddings such as:
- Dimensionality Reduction
- Neural Network
- Co-occurrence Matrix.
Let us quickly look into each of the methods and see where GloVe fits in.
1. Why Corporate AI projects fail?
2. How AI Will Power the Next Wave of Healthcare Innovation?
3. Machine Learning by Using Regression Model
4. Top Data Science Platforms in 2021 Other than Kaggle
I. Dimensionality Reduction Methods:
First, let us try to understand why we need Dimensionality Reduction methods.
A One-Hot Encoded Vector is the simplest form of Word Embedding. Let us see an example —
Here, Vocabulary Size is 6.
Vocabulary = [‘The’, ‘cat’, ‘sat’, ‘on’, ‘the’, ‘mat’]
Each word is One-Hot encoded and represented as a vector.
However, the real-world sentences are not as simple as the one shown here. Each sentence might be of different lengths, with few words appearing frequently and few words very rarely. Hence, if we use a One-Hot vector to represent such a vocabulary, then the resulting matrix would be sparse and multidimensional in nature. This is also known as ‘Curse of D
“Word Representations suffer from the inherent Curse of Dimensionality due to its multidimensional representation in word vector space.”
This makes it nearly impossible to use them in most of the Language Models and hence the concept of Dimensionality Reduction is heavily used in the early days to make sense of the Word Representations.
The idea is very simple — make a word vector representation, say in the form of multiple One-Hot vectors. Then deploy a Dimensionality Reduction algorithm such as Matrix Factorization using Singular Value Decomposition (SVD) to arrive at meaningful conclusions.
Latent Semantic Analysis (LSA) is an extension of SVD, which makes use of a “Document-Term Matrix”. LSA is built on the assumption of the Distributional Hypothesis that words that are close in meaning will occur in similar pieces of text.
The major drawback with such Dimensionality Reduction based methods are two folds —
- they are computationally expensive
- they don’t consider the global context in which the same word occurs.
For e.g, the word ‘club’ when occurs with club-sandwich means different from ‘football club’ or clubhouse. Such a difference in context is not captured by both SVD and LSA.
The next development in the Word Embeddings came in the form of Feed-Forward Neural Network-based methods.
II. Neural Network-based Methods:
These methods make use of a Feed-Forward Neural Network to learn the word representations — basically, a single hidden layer Neural Network is used and the hidden layer weights are used as a proxy representation for the input word, using a simple dot product between two Word vectors. (Dot product is a way to represent Cosine Similarity and hence it is used here as a similarity modeling method).
A few of the Neural Network based methods are:
- Continuous Bag of Words (CBoW) and
- Skip-Gram models.
A typical Word2Vec Neural Model is given in the above diagram — here, each word is processed via an Input Layer, followed by a single Hidden Layer. Once the training is complete, the weights of the hidden layer will be used as a proxy representation for the input word.
Such a Neural Language Model is capable of capturing the semantic and syntactic relationship among the word vectors. In other words, Neural Network-based models capture the word analogies better than the Frequency-based method such as Matrix Factorization. For e.g, the vectors Woman → Man and Queen → King are semantically related and hence will appear closer to each other in the word vector space.
In fact, the Euclidean distance between Woman and Queen would be the same as the distance between Man and King. Hence, we can represent the semantic relationship algebraically as follows:
Queen-Woman = King-Man
Queen-Woman+Man = King
Both, Skip-Gram and Continuous Bag of Words models exercise the Shallow Local Context Window to capture the word representations.
Skip-gram model → Given the word, predict the surrounding context.
Continuous Bag of Words (CBoW) → Given the context (a bunch of words) predicts the word.
The major drawbacks of such Neural Network based Language Models are:
- High Training & Testing time
- Inability to capture the statistical information at a global level (they are not considering the count or co-occurrence of words in a global context).
III. Co-occurrence aka Count based Methods:
A few of the co-occurrence or count based methods of generating Word Vectors are:
- Term Frequency — Inverse Frequency (TF-IDF) Vectorization
- Term-Document Frequency Vectorization
Such count-based methods capture the statistical information better than the Neural Language Models.
The Best of Both Worlds:
We have seen that the Dimensionality Reduction-based methods (such as SVD, LSA) capture statistical information but they perform poorly on predictive tasks such as solving an analogy.
In contrast, the Neural Network based Language models exercise Shallow Local Context Window methods. Due to the local context captured, such models are usually good at predictive tasks such as solving an analogy, however, the statistical information is not sufficiently captured.
So, what we need is a model that captures the best of both worlds.
Enter GloVe — a co-occurrence-based model that also captures the global context by using Conditional Probability.