Our usual domain of expertise at Cord is in working with video data, but we recently came across a problem where the only available data was in images. We thus couldn’t rely on the normal spatiotemporal correlations between frames that are reliably present in video data to improve the efficiency of the annotation process. We could, however, still use principles of algorithmic labelling to automate labelling of the data. Before we get into that, the problem was as follows:
Company A wants to build a deep learning model that looks at a plate of food and quantifies the calorie count of that plate. They have an open source dataset that they want to use as a first step to identify individual ingredients on the plate. The dataset they want to use is labelled with an image level classification, but not with bounding boxes around the “food objects” themselves. Our goal is to re-label the dataset such that every frame has a correctly placed bounding box around each item of food.
Instead of drawing these bounding boxes by hand we will label the data using algorithmic labelling.
Why Algorithmic Labelling?
So before we talk about solving this with algorithmic labelling, let’s look at our existing options to label this dataset. We can:
- go ahead and hand label it ourselves. It takes me about six seconds to draw a bounding box, and with ~3000 images, it will take me about five hours to label all the images manually.
- send the data elsewhere to be labeled. An estimated outsourced cost will likely be around $0.15 per image with total cost about $450. It will additionally take some time to write a spec and get a round trip of the data through to an external provider.
If we look at the cost/time tradeoffs of algorithmic labelling against our first two options, it might not seem like a slam dunk. Writing a good program will take time, maybe initially even more time than you would be spending annotating the data yourself. But it comes with very important benefits:
- Once you have an algorithm working, it is both reusable for similar problems, and extensible to fit slightly altered problems.
- The initial temporal cost of writing a program is fixed, it does not increase with the amount of data you have. Writing a good label algorithm is thus scalable.
- Most importantly, writing label algorithms improves your final models. The data science process does not start once you have a labelled dataset, it starts once you have any data at all. Going through the process of thinking through an algorithm to label your data will give you insight into the data that you will be missing if you just send it to an external party to annotate.
2. Generating neural speech synthesis voice acting using xVASynth
3. Top 5 Artificial Intelligence (AI) Trends for 2021
4. Why You’re Using Spotify Wrong
With algorithmic labelling there is a strong positive externality of actually understanding the data through the program that you write. The time taken is fixed but the program, and your insight, exists forever.
With all this in mind, let’s think through a process we can use to write a program for this data. The high level steps will be
- Examine the dataset,
- Write and test a prototype
- Run the program out of sample and review
Examine the dataset
The first step to any data science process should be to get a look at the data and take an inventory of its organisational structure and common properties and invariants that it holds. This usually starts with the data’s directory structure:
We can also inspect the images themselves: