Deep Learning (DL) models are revolutionizing the business and technology world with jaw-dropping performances in one application area after another — image classification, object detection, object tracking, pose recognition, video analytics, synthetic picture generation — just to name a few.
However, they are like anything but classical Machine Learning (ML) algorithms/techniques. DL models use millions of parameters and create extremely complex and highly nonlinear internal representations of the images or datasets that are fed to these models.
A lot of theory and mathematical machines behind the classical ML (regression, support vector machines, etc.) were developed with linear models in mind. However, practical real-life problems are often nonlinear in nature and therefore cannot be effectively solved using those ML methods.
A simple illustration is shown below and although it is somewhat over-simplified, it conveys the idea. Deep learning models are inherently better to tackle such nonlinear classification tasks.
However, at its core, structurally, a deep learning model consists of stacked layers of linear perceptron units and simple matrix multiplications are performed over them. Matrix operations are essentially linear multiplication and addition.
It can also be shown, mathematically, that the universal approximation power of a deep neural network — the ability to approximate any mathematical function to sufficient degree — does not hold without these nonlinear activation stages in between the layers.
Evidently, it does so by comparing its predictions to the ground truth (labeled images for example) and turning the parameters of the model. The difference between the prediction and the ground truth is called the ‘ classification error ‘.
These two components — activation functions and nonlinear optimizers — are at the core of every deep learning architecture. However, there is considerable variety in the specifics of these components and in the next two sections, we go over the latest developments.
Physical structure of a typical neuron consists of a cell body, an axon for sending messages to other neurons, and dendrites for receiving signals from other neurons.
In artificial neural networks, we extend this idea by shaping the outputs of neurons with activation functions. They push the output signal strength up or down in a nonlinear fashion depending on the magnitude. High magnitude signals propagate further and take part in shaping the final prediction of the network whereas the weakened signal die off quickly.
Some common activation functions are described below.
In the logistic function, a small change in the input only causes a small change in the output as opposed to the stepped output. Hence, the output is smoother than the step function output.
While, sigmoid functions were one of the first ones used in early neural network research, they have fallen in favor recently. Other functions have been shown to produce the same performance with less iterations. However, the idea is still quite useful for the last layer of a DL architecture (either as stand-alone or as a softmax function) for classification tasks. This is because of the output range of [0,1] which can be interpreted as probability values.
Tanh is a non-linear activation function that compresses all its inputs to the range [-1, 1]. The mathematical representation is given below,
ReLU is a non-linear activation function which was first popularized in the context of a convolution neural network (CNN). If the input is positive then the function would output the value itself, if the input is negative the output would be zero.
The function doesn’t saturate in the positive region, thereby avoiding the vanishing gradient problem to a large extent. Furthermore, the process of ReLu function evaluation is computationally efficient as it does not involve computing exp(x) and therefore, in practice, it converges much faster than logistic/Tanh for the same performance (classification accuracy for example). For this reason, ReLU has become de-facto standard for large convolutional neural network architectures such as Inception, ResNet, MobileNet, VGGNet, etc.
In this variant of ReLU, instead of producing zero for negative inputs, it will just produce a very small value proportional to the input i.e 0.01 , as if the function is ‘leaking’ some value in the negative region instead of producing hard zero values.
Because of the small value (0.1) proportional to the input for the negative values, the gradient would not saturate. If the input is negative, gradient would be 0.01 times the input, this ensures neurons doesn’t die. So, the apparent advantages of Leaky ReLU are that it doesn’t saturate in the positive or negative region, it avoids ‘dead neurons’ problem, it is easy to compute, and it produces close to zero-centered outputs.
Although, you are more likely to come across one of the aforementioned activation functions, in dealing with common DL architectures, it is good to know about some recent developments where researchers have proposed alternative activation functions to speed up large model training and hyperparameter optimization tasks.
Swish is such a function, proposed by the famous Google Brain team ( where they searched for optimum activation function using complex reinforcement learning techniques).
Google team’s experiments show that Swish tends to work better than ReLU on deeper models across a number of challenging data sets. For example, simply replacing ReLUs with Swish units improves top-1 classification accuracy on ImageNet by 0.9% for Mobile NASNetA and 0.6% for Inception-ResNet-v2. The simplicity of Swish and its similarity to ReLU makes it easy for practitioners to replace ReLUs with Swish units in any neural network.
As stated earlier, a deep learning model works by gradually reducing the prediction error with respect to a training dataset by adjusting the weights of the connections.
The idea is to construct a cost function (or loss function) which measures the difference between the actual output and predicted output from the model. Then gradients of this cost function, with respect to the model weights, are computed and propagated back layer by layer. This way, the model knows the weights, responsible for creating a larger error, and tunes them accordingly.
The cost function of a deep learning model is a complex high-dimensional nonlinear function which can be thought of an uneven terrain with ups and downs. Somehow, we want to reach to the bottom of the valley i.e. minimize the cost. Gradient indicates the direction of increase. As we want to find the minimum point in the valley we need to go in the opposite direction of the gradient. We update parameters in the negative gradient direction to minimize the loss.
Learning rate controls how much we should adjust the weights with respect to the loss gradient. Learning rates are randomly initialized. Lower the value of the learning rate, slower will be the convergence to global minima. A higher value for learning rate will not allow the gradient descent to converge.
Basically, the update equation for weight optimization is,
Here, α is the learning rate, C is the cost function and w and ω are the weight vectors. We update the weights proportional to the negative of the gradient (scaled by the learning rate).
There are a few variations of the core gradient descent algorithm,
Stochastic gradient descent uses a single datapoint (randomly chosen) to calculate the gradient and update the weights with every iteration. The dataset is shuffled to make it randomized. As the dataset is randomized and weights are updated for each single example, the cost function and weight update are generally noisy.
Mini-batch gradient is a variation of stochastic gradient descent where instead of a single training example, a mini-batch of samples is used. Mini batch gradient descent is widely used and converges faster and is more stable. Batch size can vary depending on the dataset and generally are 128 or 256. The data per batch easily fits in the memory, the process is computationally efficient, and it benefits from vectorization. If the search (for minima) is stuck in a local minimum point, some noisy random steps can take them out of it.
The idea is momentum is borrowed from simple physics, where it can be thought of as a property of matter which maintains the inertial state of an object rolling downhill. Under gravity, the object gains momentum (increases speed) as it rolls further down.
Here is a detailed article on various sub-types and formulations of NAG.
Many a times, the dataset exhibits sparsity and some parameters need to be updated more frequently than others. This can be done by tuning the learning rate differently for different sets of parameters and AdaGrad does precisely that.
AdaGrad perform larger updates for infrequent parameters and smaller updates for frequent parameters. It is well suited when we have sparse data as in large scale neural networks. For example, GloVe word embedding (an essential encoding scheme for Natural Language Processing or NLP tasks) uses AdaGrad where infrequent words required a greater update and frequent words require smaller updates.
Below we show the update equation for AdaGrad where in the denominator, all the past gradients are accumulated (sum of squares term).
In this article, we went over two core components of a deep learning model — activation function and optimizer algorithm. The power of a deep learning to learn highly complex pattern from huge datasets stems largely from these components as they help the model learn nonlinear features in a fast and efficient manner.
Both of these are areas of active research and new developments are happening to enable training of ever larger models and faster but stable optimization process while learning over multiple epochs using terabytes of data sets.