Welcome to My Week in AI! Each week this blog will have the following parts:
- What I have done this week in AI
- An overview of an exciting and emerging piece of AI research
Absorbing Best Practices
This week I attended the Spark + AI summit, hosted by Databricks. This conference offered lots of informative and useful talks, mostly on the topics of data engineering and productionizing machine learning models. I found two talks to be particularly enlightening : ‘Accelerating MLFlow Hyper-parameter Optimization Pipelines with RAPIDS’ by John Zedlewski from NVIDIA, and ‘Scaling up Deep Learning by Scaling Down’ by Nick Pentreath from IBM.
Training Models Rapidly
GPUs and can be used in place of the standard Python data science libraries (pandas, scikit-learn, PyTorch, Matplotlib). None of the standard libraries, with the exception of PyTorch, have built in GPU support and instead compute on CPU, which takes a significant amount of time. RAPIDS libraries that use GPU instead allow machine learning model development to happen at a tiny fraction of the typical computation time.
Models that took an hour to train using scikit-learn were trained in less than 5 minutes when using cuML, RAPIDS’ corresponding library. When I heard this statistic, I was astounded by the amount of time saved — , and on top of the speed, the libraries are very easy to use. RAPIDS has a corresponding library for each of Python’s data science libraries, and each of RAPIDS’ versions has the same functions as the equivalent Python version. For example, to use RAPIDS’ version, you just replace each instance of pandas in your code with cudf,and similarly with the other libraries. The talk went on further to demonstrate hyperparameter sweep with Hyperopt, and how RAPIDS integrates with this to make the sweep extremely fast compared to a grid search in scikit-learn. RAPIDS is a toolkit that I plan to explore further as computation time is a significant frustration for me (as it is for many Data Scientists!).
1. Natural Language Generation:
The Commercial State of the Art in 2020
2. This Entire Article Was Written by Open AI’s GPT2
3. Learning To Classify Images Without Labels
4. Becoming a Data Scientist, Data Analyst, Financial Analyst and Research Analyst
Optimizing Models for Production
Pentreath’s talk was on running deep learning models for inference on edge devices like mobile phones. These devices typically have limited resources, so the models have to be scaled down in order to run efficiently. Pentreath presented four main ways of doing this : architecture improvement, model pruning, quantization and model distillation. Each of the four techniques leads to significant efficiency improvements, however their effect on accuracy varies. Architecture improvement and model distillation typically cause a decrease in accuracy, whereas model pruning and quantization can often cause an increase in accuracy. I think it is easy for models to become bloated, so these techniques can be useful for managing memory and computation time regardless of whether or not the models are being run on edge devices.
A cheaper and more accurate BERT
The research I’m highlighting this week also focuses on scaled-down models. This week’s paper is titled, ‘ALBERT — A Lite BERT for Self-supervised Learning of Language Representations’ by Lan et al.¹ and presents a successor to the famous BERT. This research was presented at the ICLR conference in April 2020. The authors demonstrated two ways to reduce the training time and memory consumption of BERT, whilst also attaining superior accuracy on benchmark tasks.
This optimized architecture, ALBERT, uses two parameter reduction techniques: factorized embedding parameterization and cross-layer parameter sharing. Factorized embedding parameterization splits the vocabulary embedding matrix into two smaller matrices so that the vocabulary embedding is no longer connected to the size of the hidden layers in the model. Cross-layer parameter sharing means all parameters are shared across each layer, so the number of parameters does not necessarily grow as the network becomes deeper.
Furthermore, the researchers used sentence-order prediction loss in training the model instead of the next-sentence prediction loss used in training BERT. Next-sentence prediction loss is a binary classification loss used to predict if two sequences of text appear sequentially in a dataset. The aim of using this loss originally was to improve BERT’s performance on downstream tasks, such as natural language inference, by focusing on topic prediction and coherence prediction. However, studies have found that it is unreliable. The loss proposed by Lan et al. focused only on coherence prediction, and helped to train an ALBERT model that is consistently more accurate on downstream tasks than BERT.
Other important takeaways are that an ALBERT configuration analogous to BERT-large has 1/18th the number of parameters and trains in less than 2/3 the amount of time, and that ALBERT can achieve state-of-the-art accuracy on three standard NLP benchmarks — GLUE, RACE and SQuAD.
Overall, seeing the advances made in NLP research since BERT was released has been very exciting and for me, and NLP tasks are much easier when I can use such powerful and optimized pretrained models.