Machine learning projects start like any other technology project. There is a problem or a need, and we begin to explore the task and discuss possible approaches to solve it. However, the execution of AI projects is fundamentally different from traditional technology ventures, because they are of more iterative and explorative nature. This is why every machine learning project is carried out in a life cycle process.
Rule of Thumb: Start Small, Fail Fast
Machine learning projects always involve a high degree of uncertainty in terms of workload and result quality. To minimize the risk and investment for
our customers we strictly follow the “start small, fail fast” philosophy.
This means we build a feature complete system with the minimal possible workload to get a fast feedback if the model and available data play well together. Then we improve the data and model in iterations (one iteration is a complete life cycle run) to raise the result quality to the needed level.
Let’s look at the life cycle phases.
1. Data Collection
Machine learning models should solve a given problem on the basis of data. Therefore everything starts with collecting enough samples with proper metadata.
Quality, quantity, and the balance of the data are the decisive points in data collection. The more data we have and the better the quality and balancing is, the better the model will learn and predict accurately.
1. Fundamentals of AI, ML and Deep Learning for Product Managers
2. The Unfortunate Power of Deep Learning
3. Graph Neural Network for 3D Object Detection in a Point Cloud
4. Know the biggest Notable difference between AI vs. Machine Learning
The quality of the samples is important because wrong or misleading samples or metadata (called noisy data) will confuse the model and dramatically lower the prediction quality. We can improve the quality with data cleaning (phase 2 of the life cycle).
Having balanced data means to have roughly the same amount of training data for each class. Unbalanced training data can lead to biased models as classes are not represented equally.
In computer vision projects we often face a lack of training data (correctly labeled images). To increase quantity and improve balance of the data we might be able to use data synthesis to create training data programmatically ourselves. This process can be very complex and there are various methods to do this.
Another common method to create more data is called data augmentation. We create additional data by modifying existing samples, e.g. through random cropping, adding noise, changing colors or brightness.
2. Data Preparation
Let’s say we have collected enough data, then we need to create a structure we can feed the model with.
We clean the data by identifying noise, false or misleading data and correct or remove it from the training set. Additionally, we preprocess the data to normalize it. In our cases this mostly mean scaling or cropping images, converting them into a relevant format and creating a folder structure we can use for training.
Collecting, cleaning, and preprocessing data are our biggest and most time-consuming challenges. It is not unusual to spend a major portion of the project time for these tasks.
3. Model Evaluation and Training
During model evaluation we take a closer look at different models and model architectures in order to find out which architectures work well with certain data and certain problems.
There are models that work well with text, e.g. translation, term classification. Other models work well with images, e.g. classification models, detection models, or localization models. Our experience, best practice orientation, and scientific research lead us to the appropriate model for our current project.
Before we start training the model we split the training data set into actual training data (the majority of the data, let’s say 75%), validation data (10%), and test data (15%). The actual distribution can vary depending on the amount of data available. Training data and validation data is used for model training. The test data is used after the training to validate the model performance with unseen data.
To train a model in the field of computer vision is more complex and time-consuming than text-based machine learning tasks. This is because we use deep and complex models, and the needed data for these models tend to be very large, up to terabytes. Calculation is therefore very time consuming.
4. Model Validation
After finishing the training as described above, we assess the quality of the model. We work with the model to understand its behavior: which aspects are already solved very well, and which are not. By inspecting the visual data we interprete necessary changes to the training set in order to optimize result quality. An adjustment could be for example to collect or synthesize more data from a specific category.
Sometimes we even have to change the model architecture, especially if we find that the model either cannot grasp the task or just memorizes the training set (under- and overfitting).
5. Comparison and Feedback
In this step it is time to share the progress we’ve made so far with our customer. We present our findings on the quality and condition of the model, we show what worked and what did not work. A good teamwork with our customer is significant here. Together, we discuss possible improvements of the model, for example gathering more data and where to get this data from. In close cooperation, we plan the next iteration of model training.
The deployment of our current model version acts as the quality base line for the following training iteration. If the model already adds value for the customer, it can be integrated in his prototype or even in production. Meanwhile, we begin the next iteration of training, the life cycle starts again.