## Explanation and code walkthrough of simple but crucial metrics to evaluate machine learning models.

This article will cover several simple but helpful classification metrics for evaluating models in Python, using a coronavirus testing example throughout to discuss the real-world implications of evaluating tests and models. The general code walkthrough is dispersed through the article, and covers ROC curves, confusion matrices, and precision, recall, accuracy, and F1 scores.

Before delving into a technical, wordy definition of binary classification, let’s consider a real-world, especially relevant application of the system. Right now, the global coronavirus pandemic that has halted all forms of social and economic interaction— not only are we isolated, but our economy suffers with us. The general consensus among world leaders, scientists, and civilians is that without reliable COVID-19 testing, society cannot return to normal.

Let’s imagine that scientists suddenly discover a plausible treatment for the virus. Before issuing the testing, scientists must ensure that the tests are able to accurately and reliably determine whether or not individuals carry the virus. There are four outcomes to the test’s effectiveness:

**True positive**→ The patient is diagnosed as having the virus, and truly does have it (the test works).**True negative**→ The patient is diagnosed as not having the virus, and truly does not have it (the test works).**False positive**→ The patient is diagnosed as having the virus, but actually does not have it (the test does not work).**False negative**→ The patient is diagnosed as not having the virus, but actually does have it (the test does not work).

The test therefore has two diagnoses, or a binary output: patient carries coronavirus, or patient does not carry coronavirus. The goal is to accurately classify patients into either category 100% of the time. We can confidently assume that scientists wouldn’t approve a test without completely evaluating and ensuring the effectiveness of their test — this philosophy is crucial, even in machine learning. We will keep revisiting the coronavirus testing example in explaining classification metrics, because it operates very similar to how a binary classification system works.

2. Using Artificial Intelligence to detect COVID-19

3. Real vs Fake Tweet Detection using a BERT Transformer Model in few lines of code

4. Machine Learning System Design

Classification is the process of sorting data into categories — the same way, binary classification is a useful machine learning technique of splitting data into two known categories, or “classifications”. The methodology is simple: a model is trained with a set of predictor variables (x-values) and a response variable (y-value) with two outcomes: 0 or 1. In terms of our coronavirus example: an individual’s symptoms, medical history, and age would serve as predictor variables. The test result would be the y-variable, with “negative” and “positive” for coronavirus being stored as “0” and “1” according to the binary system.

The purpose of evaluating a classification model is to ensure it classifies data the way it’s supposed to. For example, a coronavirus test needs to be checked and double-checked to ensure that it correctly identifies patients carrying the virus. But how do we know that the model is working, if the model is predicting future outcomes?

We “check our work” by splitting our dataset. A classification model is trained using only 80–90% of the dataset. This “teaches” the model associations between each predicting factor and its corresponding output in the data, so in the future, data containing those predictors can be sorted into an appropriate category. The 10–20% of the data saved serves as our test data — this is how the model can be “checked” for accuracy and reliability. We apply the model to the test data and allow it to make classifications — because we already know what the “correct answers” are, we can check if it accurately classified the data or not. This is done through various methods of evaluation, several of which I will explain below.

## The ROC Curve

The “Receiving Operating Characteristic” or ROC curve is used to illustrate the diagnostic ability of a binary classification system — in short, it tells us how good at classifying data a model is.

Created using the sci-kit learn package, the curve is plotted like a classic line graph, where y = False Positive Rate, and x=True Positive Rate. These are the same True/False positives as defined before in our coronavirus example: the False Positive Rate (FPR) is the rate at which our model inaccurately identifies a positive result, and the True Positive Rate (TPR) is the rate at which our model accurately identifies a positive result. The goal is to minimize FPR and maximize TPR, with our ROC curving closer to the upper left corner.

All ROC graphs also include a diagonal dotted line with a slope of 0.5. This is known as the “no skill” line, reflecting a 50/50 guess classification; a model closer to the no-skill line indicates a weaker classifier. Essentially, if this model was predicting coronavirus and tended close to the “no-skill” line, it would be very randomly classifying patients into either category (like flipping a coin). ROC graphs are perfect for an immediate, quick understanding of how good your model is — if that curve tends close to your no-skill line, something definitely needs to be fixed.

# ROC curves follow this general code:

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split

from sklearn.metrics import roc_curve#creating training variables (X = data, y = predicted column)trainX, testX, train_y, test_y = train_test_split(X, y, test_size=0.2)#fitting our logistic regression

model = LogisticRegression(solver='lbfgs')

model.fit(trainX, trainy)#save predicted values and predicted value probabilitiespred_y = model.predict(test_x)

pred_prob = lgr.predict_proba(test_x)[:, 1]#plot false and true positive rates, using roc_curve functionfpr, tpr, thresholds = roc_curve(test_y, pred_prob)

plt.plot(fpr, tpr)

## The Confusion Matrix

A confusion matrix goes beyond the ROC curve by illustrating exactly how and where your model is classifying data. Confusion matrices are useful where ROC curves lack: an ROC curve outputs one, long decimal number that defines the diagnostic ability of your model — it’s easy to understand that a higher number means a better true positive predictor. A confusion matrix provides insight into the other three possible outcomes of your model’s classification: True Negative, False Positive, and False Negative.

Take this confusion matrix, for example, using the same data and classifications as the previous ROC curve. The previous ROC curve returned a value of about 0.69, or 69% diagnostic ability [not pictured in visualization]. By creating a confusion matrix (using sci-kit learn), we see that 57% of that diagnostic ability actually comes from identifying true negatives, and only 11% from identifying true positives. It also reveals that of the misclassifications, 20% are false negatives — all information we hadn’t gleaned from just the ROC curve.

These results posit wider implications the ROC curve was able to identify. Let’s use our coronavirus example again: say the coronavirus test had a 69% diagnostic ability, which scientists decide is reasonable. A confusion matrix would supports the 69% diagnostic ability, but would also reveal to what extent the false outcomes, or errors, needed to be fixed. If the coronavirus test was making more false negatives than false positives, i.e. missing positive cases, then the test would be extremely inadequate at diagnosing patients, and dangerous to put in use. However, we wouldn’t know this information without evaluating the test using a confusion matrix, and visibly seeing the sheer number of false negative outcomes.

#Confusion Matrices use this general formula:from sklearn.metrics import classification_report, confusion_matrix

from sklearn import svm, datasets

from sklearn.metrics import plot_confusion_matrix

from sklearn import metrics#create cm using previous logistic regression test_y and pred_y

cm = metrics.confusion_matrix(test_y, pred)

## Precision, Recall, Accuracy, and F1 Scores

Some alternate measures of model reliability employing the same true/false positive/negative principles are precision, recall, accuracy, and F1 scores. Each describes a different quality of the model, and which score to maximize depends on the type of model.

**Precision:** identifies the number of correctly predicted positive outcomes out of the total number of predicted positive outcomes (or TP/(TP+FP)). Models with high precision protect against false positives.

**Recall: **identifies the number of correctly predicted positive outcomes out of the actual total positive outcomes (or TP / (TP + FN)). High recall tests protect against false negatives.

**Accuracy: **returns the overall proportion of cases model identified correctly (or (TP + TN) / Total). Because accuracy is such a general metric, it used more often when there are even class distributions and when the costs of having FP and FN are similar.

**F1: **returns the “harmonic mean between precision and recall”, or an in-between when there is uneven class distributions (taking the mean between precision and recall accounts for the incidence of uneven numbers of “0s and 1s”, essentially).

#Accuracy, Precision, Recall, and F1 all employ the same sklearn library to output metrics.from sklearn import metricsprint("Accuracy:",metrics.accuracy_score(test_y, pred))

print("Precision:",metrics.precision_score(test_y, pred))

print("Recall:",metrics.recall_score(test_y, pred))

print("F1:", metrics.f1_score(test_y, pred))

The use of just three metrics, the ROC curve, the confusion matrix, and the precision/recall/accuracy/F1 scores reveal insights about our model that would have otherwise gone unnoticed through using general model scoring functions like model.score(). As in coronavirus testing, a model or test should not be presumed adequate without a proper evaluation of its performance, across various metrics. Each of the three methods described employ simple code, yet all provide valuable different perspectives from which the model could be improved or entirely remade.