Summary: Transfer Learning (TL) may be the most important aid to adoption of deep learning in the last several years. This new LEEP measure predicts the accuracy of the transfer and should make TL faster, cheaper, and better.
What is the single most important innovation in deep learning in the last several years? There might be several candidates. You might argue for tensorflow specific chips, or BERT architecture for improving NLP. I would argue that in this age of data science implementation it would be a discovery that would make our deep learning models excel on all three of the faster-better-cheaper scale and that would be transfer learning.
Transfer learning has become a major force for the adoption of all sorts of deep learning applications from image classification to NLP. Back in the bad old days, say before 2017, before we learned to transfer learning, the adoption of DL models was constrained by these well-known barriers.
- Complexity leading to extended time to success and an abnormally high outright failure rate.
- Extremely large quantities of labeled data needed to train.
- Large amounts of expensive specialized compute resources.
- Scarcity of data scientist qualified to create and maintain the models.
The advantages of starting with a pretrained model are significant.
- Its tuning parameters have already been tested and found to be successful, eliminating the experimentation around setting the hyperparameters.
- The earlier or shallower layers of a CNN are essentially learning the features of the image set such as edges, shapes, textures and the like. Only the last one or two layers of a CNN are performing the most complex tasks of summarizing the vectorized image data into the classification data for the 10, 100, or 1,000 different images they are supposed to These earlier shallow layers of the CNN can be thought of as featurizers, discovering the previously undefined features on which the later classification is based.
In simplified TL the pre-trained transfer model is simply chopped off at the last one or two layers. Once again, the early, shallow layers are those that have identified and vectorized the features and typically only the last one or two layers (the head) need to be replaced.
The output of the truncated ‘featurizer’ front end is then fed to a standard classifier like an SVM or logistic regression to train against your specific images.
Today, the two most common methods of TL are these:
Retrain Head Classifier Only: Keeps the feature extractor layers of the base model and adds a new head classifier retraining the new head classifier only from scratch using the new target data.
Fine Tune the Entire Model: Replaces the base model head classifier with a new head then fine tunes the entire model, both feature extractor and new head classifier using the new target data.
Some Practical Limitations
That was the dream. Take any one of dozens of large scale pretrained DL classifiers and use it for your own application. Of course there were some limitations that were understood at an intuitive level. The main condition being that the base model had been trained on objects that looked at least mostly like the ones you were trying to classify.
So objects found around the house, or buildings, or plants could logically be transferred to other similar objects, but if you tried to transfer a model based on desks and chairs to say molecular or genomic images that logically wouldn’t work.
Over the last few years there have been several methods or calculations put forth to try to predict how well the base model’s training could be transferred to your own. The principal ones being Negative Conditional Entropy (NCE) and the H Score. Both attempt to predict how the accuracy of the new model will be affected by the training of the base model. That is, will it transfer well or not.
LEEP (Log Expected Empirical Prediction) Score
The major advance described here is a just announced new scoring technique to predict how well the transfer learning will work providing a clear interpretation scoring and a very simple calculation that requires only one forward pass through the target data set. LEEP comes to us from researchers at AWS, Cuong Nguyen, Tal Hassner, Matthias Seeger, and Cedric Archambeau.
Not only is it easier and faster to calculate, but it is also much easier to interpret with higher scores mean better transferability. Level 5 in this transfer test indicates most likely to transfer with least additional introduced error. This test was conducted transferring CIFAR10 to CIFAR100.
The LEEP value also predicts more accurately than the NCE or H Score methods.
The value of this innovation in measurement is in making TL easier and more accurate. And since TL is the key to more and more adoption, this is a major breakthrough. To read how to make the actual LEEP computation and more detail see their full paper here.
Other articles by Bill Vorhies.
About the author: Bill is Contributing Editor for Data Science Central. Bill is also President & Chief Data Scientist at Data-Magnum and has practiced as a data scientist since 2001. His articles have been read more than 2.1 million times.
[email protected] or [email protected]