Today, we’re going to dive into the final steps of our machine learning life cycle. And this is where we face the reality check: How good is our current model, does it already add value to our client’s problem and is it ready to be deployed to production?
In our previous articles we covered the process of data collection & data preparation, model evaluation and model training. Now, we address the procedure of validating our model performance, getting feedback from our client and deploy the model into productive use.
There is one central question: What benefits does our current model already offer to our client? We have to validate the model performance at this stage and find out where we stand now.
A brief update on the actual task to be solved: Our client is a dance federation and our task is to build an AI model capable of classifying images as aesthetic or unaesthetic. The dance federation wants to automatically identify aesthetic images to use them for marketing.
The first step of model validation is ensuring that the validation and test accuracy are close to each other. Otherwise, we know that the model cannot generalize to unseen data.
Next, we have to do some further analysis to get a feeling for the weak points in the behavior of the model. This is important because the training data is only an approximation of the problem to be solved. In production, the model has to cope with very different data, and we want to find out how the model would perform on that.
We can use a confusion matrix to gain more detailed insights into the performance of our model. A typical confusion matrix looks like this:
Let’s say we have a multiclass classification problem with a, b, c and d as possible classes. The confusion matrix tells us basically two things:
- We see the accuracy per class. This is the diagonal from upper-left to bottom-right in the matrix. So our model perfectly predicts classes c and d, but has some problems with a and b.
- If the model makes false predictions, we can see whether there is a bias towards certain classes. Class b inputs were wrongly predicted as class a, reducing the accuracy of class b to 0.67.
A confusion matrix is great for multiclass prediction problems to analyze the bias towards certain classes. It’s like a signpost pointing towards how the dataset can be improved to fix the misbehavior in the next trainings. In our case with the dance images we have to solve a binary classification problem (images are either aesthetic or not).
1. Why Corporate AI projects fail?
2. How AI Will Power the Next Wave of Healthcare Innovation?
3. Machine Learning by Using Regression Model
4. Top Data Science Platforms in 2021 Other than Kaggle
We can use the confusion matrix to calculate an F1 score to judge the overall correctness of the predictions. The F1 score is a balanced combination of Precision and Recall. If we have multiple trained models with nearby equal accuracies, the F1 score can help to pick the best performing model.
Before we present the model to our client, we also check manually to get a feeling about the model performance. We inspect the predictions for some selected examples. The goal is to get an intuition what the model really learnt and how the model generalizes when confronted with totally unknown and unexpected inputs.
This is important in order to avoid weird results. For example, after some testing we could find that our aesthetics model is biased towards people wearing hats. This could be a result of too many aesthetic training images with people wearing hats. We should fix this problem in the dataset before the next model training.
Or another example: The model has learned that a certain color mood is an indication of aesthetics because the training data contains many images with an Instagram filter.
Having the predictions manually inspected by an experienced machine learning practitioner is often the only way to notice and fix such problems in the dataset.
In this step we usually compare the current iteration and its progress with the previous deployed model (unless this is the first iteration).
We sum up our experiences (accuracy, F1 score and the overall behavior of the manual inspection) and develop countermeasures for wrong predictions: What misconduct can be attributed to which aspects of the data? How can we change this, e.g. by adding new images, removing old ones, augmenting images, or varying the colours of the images randomly.
Then we present a non-technical overall status to our client, address the problems of the model and offer countermeasures. This is often one of the most difficult tasks in machine learning projects as we have to translate technical insights into non-technical advices how the model could be further improved. Eventually, we agree on the next steps and go into the next iteration to implement the changes as discussed.
If the model already adds value for the client, it can be integrated into as prototype or even in production. Generally, the model should be deployed as soon as possible, since we can then get valuable insights and feedback how the model actually performs with real data. The deployed model also is the performance base line for the following training iteration.
Once we deployed the first model we have gone through the machine learning life cycle, from conception to deployment. More iterations follow for various reasons, e.g. if users get unexpected results or there is a model shift (new classes should be added and others are not used any more).
Machine learning is a continuous cycle where the progress occurs in iterations.