Missing values in a dataset impact the analysis, training and predictions significantly based on a few different factors. Let’s dig deep in and see.
In statistics, missing data or missing values occur when no data value is stored for the variable in an observation or instance. when considering a single instance it could be a corrupted or unrecorded data point but when considering the whole data set, it could have a pattern or in the other hand completely random.
For many reasons, it is important to categorize the pattern of missing values.
- Univariate or multivariate: only one attribute has missing values or more than one attribute has missing values.
2. Monotone and non-monotone: if a variable has completely missed after a certain instance and not recorded at all, then it is monotone.
3. Connected and unconnected: if any observed data point can be reached from any other observed data point by moving in horizontal or vertical paths.
4. Planned and Random
The above categories can be determined by observation but when the dataset gets bigger and moves from kilobytes to a few hundred megabytes, it becomes impossible to visualize and come to and conclusion.
Determine the randomness of missing data
There three main stages of missing values, Missing at Random(MAR), Missing Completely at Random(MCAR), Missing Not at Random(MNAR). When it comes to determining these, in 1988, the Journal of the American Statistical Association has published a method called “Little’s test of missing completely at random”. Which is useful for testing the assumption of missing completely at random for multivariate, partially observed quantitative data.
Missing values have to be handled based on their distributions to achieve reasonable accuracy in the training model. To confirm the randomness of missing values, Little’s MCAR test should be performed. The potential bias due to missing data depends on the mechanism causing the data to be missing. The analytical methods applied to amend the missingness are tested using the chi-square test of MCAR for multivariate quantitative data. It tests whether there exists a significant difference between the means of different missing-value patterns.
When handling missing values in Machine Learning, most of the time the preprocessing part address missing data, Usually if there are more than 20% of the data is missing in one attribute then it would be harder to fill or recover those values without affecting the training. The best solution would be to drop such kind of attributes and fill the rest of the attributes using a proper mechanism.
Here are some examples of using Little’s MCAR method:
- CKD dataset from UCI repository
When considering the bellow dataset the last 4 attributes have to be taken out because of the high missing value proportion.
The littles MCAR test results,
When considering the above table it clearly shows the P.value of the test is zero, since it’s less than 0.05, it says that the missing data is Missing Completely at Random(MCAR).
1. Natural Language Generation:
The Commercial State of the Art in 2020
2. This Entire Article Was Written by Open AI’s GPT2
3. Learning To Classify Images Without Labels
4. Becoming a Data Scientist, Data Analyst, Financial Analyst and Research Analyst
2. Credit approval data-set from UCI repository
Missing value handling Methods
- Delete all instances with at least a missing value: If you do this method then, the size of the dataset should be large enough to represent the population without those instances.
- Recover redoing the experiment or contacting the subject: This is one of the high-cost methods to recover the values, redoing an experiment takes time and cost money, moreover if it’s a public opinion kind of dataset, then acts like GDPR restrict future uses to contact the subject in some cases.
- Domain Experts Guessing: By consulting the domain experts it is possible to guess the missing values with their experience but this method is a hectic and highly cost method due to the high hourly rates of experts.
- Fill with mean: This is one of the easiest and most primary methods of imputing missing values, by filling each missing value with the average of the attribute. This is not always good to use because it reduces the variability of the data but in some cases makes sense.
- Fill with median: Similar to the above method the problem is it reduces the variability of the data. Moreover, if the missing values are distributed then it could increase the noise of the data as well.
- Using Machine Learning algorithms: There are two different ways of using ML for imputation, first the multiple-regression analysis, by creating a regression equation using the rest of the data it is possible to predict missing values. This method needs more complete instances to train the model. Finally, the most common and popular method is the Multiple Imputation. where there are multiple missing values based on the correlations for the missing data and then averages the simulated datasets by incorporating random errors in your predictions
Here is one of my favourite missing value imputer, KNNImputer. The KNNImputer class provides imputation for filling in missing values using the k-Nearest Neighbors approach. Usually, the euclidean distance metric that supports missing values is used to find the nearest neighbours but depending on the dataset you can use preferred distancing method better for the dataset as you want, in advance cases, you can use a weighted average for each instance and then consider the centroid of the included points. The Missing values will be imputed from the attribute average of the nearest predefined number of neighbours. One of the key problems is finding the missing values for outliers, so even though you don’t have all the points required nearby, you can take the average or weighted average of points nearby to that specific instance. for that, you have to define the maximum distance from 2 instances to be considered.
The main benefit of the above method is, it doesn’t have to train the dataset and predict where it uses the distances and Spatiality of the data points.
For an easy implementation try: Sklearn.impute.KNNImputer
In addition, if you prefer more sophisticated methods, the neural networks and other tree structures are always there to try.