This is the fourth part of the metric series where, we will discuss about evaluation of the ML/DL model using metric NLP model are little tricky to evaluate because the output of these model is text/sentence/paragraph. So, we have to check the syntactical, semantic and well as the context of the output of the model So we uses different types of techniques to evaluate these model.
1. How to automatically deskew (straighten) a text image using OpenCV
2. Explanation of YOLO V4 a one stage detector
3. 5 Best Artificial Intelligence Online Courses for Beginners in 2020
4. A Non Mathematical guide to the mathematics behind Machine Learning
1. N-Gram
In ML gram is refers as word and N is an integer. So N-gram refers to the count of words in a sentence, if a sentence is made-up of 10 words it will be called as 10-gram. It is used to predict the probability of the next word in the sentence depending how the model is trained i.e. om Bigram trigram and so on using the probability of occurrence of the word related to its previous word.
Fig.1 N-Gram
Jupyter Notebook Link
2. BELU Score
BELU stand for the BiLingual Evaluation Understudy — it was invented to evaluate the language translation from one language to another.
Steps to calculate BELU are as follows:
Step 1: From the predicted word assign the value 1 if the word matches with the training set else assign 0.
Step 2: Normalise the count so that it is has range of [0–1] i.e. total count/no. of words in reference sentence.
Fig.2 Belu Score
BELU with N-grams
As I have mentioned that we assign 1 if the predicted word matches with the training set and there is a case where the words are repeated and it can have have BELU score of 1. In this case we use combination of the words i.e. N-gram to extract the order or the word in sentence. We also limit the number of times to count each word based on the highest number of times it appears in any reference sentence, which helps us avoid unnecessary repetition of words. Finally, we try to mitigate the loss of detail in the sentence produced by the sentence at least equal to the reference sentences or sentence in the training or larger than it set by introducing brevity_penality.
Fig.3 BELU Score with N-gram and Brevity Penality
Problem with the BELU Score are as follows:
It does not consider meaning.
It does not consider sentence structure.
It does not handle morphological rich language.
Jupyter Notebook Link
3. Cosine
It is a metric which is used to define the similarity between two documents earlier commonly used approach which was use to match similar documents is based on counting the maximum number of common words between the documents. But this approach has flaw i.e. the size of the document increases, the number of common words tend to increase even if the two documents are unrelated. As a result cosine come into existence which removed the flaw of “ count the common word or the euclidean distance.
Mathematically, cosine is the measurement of the angle between two vectors projected in a multi-dimensional space. In this context with the NLP, the two vectors arrays of word counts of associated with the two documents. Cosine calculate the direction instead of the magnitude where as the euclidean distance calculate magnitude. It is is advantageous because even if the two similar documents are far apart by the Euclidean distance because of the size (like, the word ‘cricket’ appeared 50 times in one document and 10 times in another) they could still have a smaller angle between them. Smaller the angle, higher the similarity.
Fig. 4 Cosine Similarity
Projection of Cosine Similarity is shown below:
Fig. 5 Cosine Similarity Projection
Suppose if you have another set of documents on a completely different topic, say ‘food’, you want a similarity metric that gives higher scores for documents belonging to the same topic and lower scores when comparing docs from different topics. So we need to consider the semantic meaning i.e. words similar in meaning should be treated as similar. For Example, ‘President’ vs ‘Prime minister’, ‘Food’ vs ‘Dish’, ‘Hi’ vs ‘Hello’ should be considered similar. For this, converting the words into respective word vectors, and then, computing the similarities can address this problem by soft cosine.
Fig. 6 Soft Cosine Similarity
Jupyter Notebook Link
4. Jaccard Index
Jaccard Index is defined as the Jaccard similarity coefficient — used to understand the similarity or diversity between two finite sample set and have range [0, 1]. If the data is missing in the sample set it is replaced by zero, mean or the missing data is produced by the k-nearest algorithm or Expectation Maximisation Algorithm (EM algorithm).
Fig. 7 Jaccard Index
Jupyter Notebook Link
5. Word Error Rate (WER)
WER is derived from the Levenshtein distance i.e. working at word level instead of the phoneme level. Word Error Rate is one of the most common metric use to compare the accuracy of the transcript produced by speech recognition APIs as well as machine translation system. Instead of working with the phoneme level WER work with the word level. It is very important metric when we compare one system with other system and as well as evaluating the subsystem but it does not provide the nature of translation error.
This problem is solved by first aligning the recognized word sequence with the reference (spoken) word sequence using dynamic string alignment. Examination of this issue is seen through a theory called the power law that states the correlation between perplexity and word error rate.
Fig. 8 Word Error RateFig. 9 Weigjted Word Error Rate
Jupyter Notebook Link
6. ROUGE
ROGUE stands for Recall- Oriented Understudy for Gisting Evaluation is collection of the metric for evaluation of the transcripts produced by the machine i.e. generation of summarise text and as well as the text generation by NLP model based on overlapping of N-grams.
The metrics which are available in ROGUE are as follows:
ROUGE-N: Overlap of N-grams between the system and reference summaries.
ROUGE-1 refers to the overlap of unigram(each word) between the system and reference summaries.
ROUGE-2 refers to the overlap of bigrams between the system and reference summaries.
ROUGE-L: Longest Common Subsequence (LCS) based statistics. Longest common subsequence problem takes into account sentence level structure similarity naturally and identifies longest co-occurring in sequence n-grams automatically.
ROUGE-W: Weighted LCS-based statistics that favors consecutive LCSes .
ROUGE-S: Skip-bigram based co-occurrence statistics. Skip-bigram is any pair of words in their sentence order.
ROUGE-SU: Skip-bigram plus unigram-based co-occurrence statistics.
7. NIST
NIST stands for National Institute of Standard and Technology situate in the US. This metric is use to evaluate the quality of text produced by the ML/DL model. NIST is based on BELU score -calculate n-gram precision by giving equal weight to each one where as NIST calculates how much information is present in N-gram i.e. when model produces correct n-gram and the n-gram is rare them it will ne given more weight. In simple words we can say that more weight or credit is given to the n-gram which is correct and rare to produce as compared to the n-gram which is correct and easy to produce.
Example, if the trigram “task is completed” is correctly matched, it will receive lower weight as compared to the correct matching of trigram “Goal is achieved”, as this is less likely to occur.
8. SQUAD
SQUAD refers to Stanford Question Answering Dataset. It is collection of the dataset which includes the Wikipedia article and question related to it. The NLP model are trained on this dataset and try to answer the questions. SQUAD consist of 100,000+ question answer pairs and 500+ articles from Wikipedia. Though it is not defined as a metric but it is used to judge the model usability and predictive power to analyze the text and answer question which is very crucial in NLP applications like chatbot, voice assistance chatbots etc.
The key feature of SQUAD are as follows:
It is closed dataset i.e. question and answer are always a part of the dataset and in series like Name of the spacecraft was Apollo 11.
Most of the answer i.e. almost 75% are less than or equal to 4.
Finding an answer can be simplified as finding the start index and the end index of the context that corresponds to the answers
9. MACRO
MACRO stands for Machine Reading Comprehension Dataset similar to SQUAD, MACRO also consist of 1,1010,916 anonymized question collected from Bing’s Query Log’s with answers purely generated by humans. It also contains human written 182,669 question answers — extracted from 3,563,535 documents.
NLP models are trained on this dataset and try to perform the following tasks:
Answer the question based of the passage. As mentioned above the question can be anonymized or original question. So custom or generalized word2vec, doc2vector or Gloves are required to train the model- Question Answering
Rank the retrieved passage given in the question — Passage Ranking.
Predict whether the question can be answered from the given set of passages if yes then extract and synthesize the predicted answer like a human.