Machine Learning (ML) development is an iterative process in which the accuracy of predictions made by the models is continuously improved by repeating the training and evaluation phases. In each of these iterations, certain parameters are tweaked continuously by developers. Any parameter manually selected based on learning from previous experiments qualify to be called a model hyper-parameter. These parameters represent intuitive decisions whose value cannot be estimated from data or from ML theory. The hyper-parameters are knobs that you tweak during each iteration of training a model to improve the accuracy in the predictions made by the model. The hyper-parameters are variables that govern the training process itself. They are often specified by practitioners experienced in machine learning development. They are often tuned independently for a given predictive modeling problem.
Building an ML model is a long process that requires domain knowledge, experience and intuition. In ML, hyper-parameter optimization or tuning is the problem of choosing a set of optimal hyper-parameters for a learning algorithm. We may not know the best combination of values for hyper-parameters in advance for a given problem. We may use rules of thumb, copy values used on other problems, or search for the best value by trial and error. When a machine learning algorithm is tuned for specific problems by changing the higher level APIs for optimization, we need to tune the hyper-parameters also to discover the parameters that results in a model with higher accuracy in prediction. Hyper-parameter tuning is often referred to as searching the parameter space for optimum values. With Deep Learning models, the search space is usually very large, and a single model might take days to train. The common Hyper-parameters are:
- Epochs – A full training pass over the entire dataset such that each example has been seen once.
- Learning rate – A scalar used to train a model via gradient descent. During each iteration, the gradient descent algorithm multiplies learning rate by the gradient. Resulting product is called the gradient step.
- Momentum in Stochastic Gradient Descent – The coefficient of friction controlling the rate at which the descent happens, when it goes towards the bottom.
- Regularization method – Regularization is used to prevent overfitting by the model. Different kinds of regularization include L1 regularization (Lasso) and L2 regularization (Ridge)
- Regularization Rate – The penalty on a model’s complexity. The scalar value ƛ specifies the importance of the regularization function relative to the loss function. Raising the value of ƛ reduces over-fitting at the cost of model accuracy.
- Early Stopping – Regularization by early stopping callback function tests a training condition for every epoch and if a set number of epochs elapses without showing any improvement, then it automatically stops the training.
- The patience parameter is the number of epochs to check for improvement.
- K in k-means clustering – Number of clusters to be discovered
- C and Sigma – For Support Vector Machines
- Number of hidden layers – For Neural Networks
- Number of units per layer – For Neural Networks
- max_depth – Maximum depth of a tree in Random Forest method
- n_estimators- Number of trees in the Random Forest. More number of trees gives better performance
Model optimization using hyper-parameter tuning is a search problem to identify the ideal combination of these parameters. The commonly used methods for optimization using hyper-parameters are; Grid search, Random search and Beyesian optimization. In Grid search, a list of all possible values for each hyper-parameter in a specified range is constructed and all possible combinations of these values are tried sequentially. In grid search, the number of experiments to be carried out increases drastically with the number of hyper-parameters. Rather than training on all possible configurations, in Random search method the network is trained only on a subset of the configurations. Choice of the configurations to be trained is randomly picked up and only the best configuration is trained in each iteration. In Beyesian optimization, we are using ML techniques to figure out the hyper-parameters. It predicts regions of the hyper-parameter space that might give better results. Gaussian process is the technique used and it finds out the optimal hyper parameters from the results of the previously conducted experiments with various types of parameter configurations.