After understanding the Financial Inclusion in Africa competition page and different sections in it, let’s solve the problem provided.
1.Load The Dataset
Make sure you have downloaded the dataset provided in the competition. You can download the dataset here.
Import important python packages.
Load the train and test dataset.
Let’s observe the shape of our datasets.
train data shape : (23524, 13)
test data shape : (10086, 12)
The above output shows the number of rows and columns for train and test dataset. We have 13 variables in the train dataset, 12 independent variables and 1 dependent variable. In the test dataset, we have 12 independent variables.
We can observe the first five rows from our data set by using the head() method from the pandas library.
It is important to understand the meaning of each feature so you can really understand the dataset. You can read the VariableDefinition.csv file to understand the meaning of each variable presented in the dataset.
The SubmissionFile.csv gives us an example of how our submission file should look. This file will contain the uniqueid column combined with the country name from the Test_v2.csv file and the target we predict with our model. Once we have created this file, we will submit it to the competition page and obtain a position on the leaderboard.
2.Understand The Dataset
We can get more information about the features presented by using the info() method from pandas.
The output shows the list of variables/features, sizes, if it contains missing values and data type for each variable. From the dataset, we don’t have any missing values and we have 3 features of integer data type and 10 features of the object data type.
If you want to learn how to handle missing data in your dataset, I recommend you read How to handle missing data with python’ by Jason Brownlee.
I won’t go further on understanding the dataset because I have already published an article about exploratory data analysis (EDA) with the financial Inclusion in Africa dataset. You can read and download the notebook for EDA in the link below.
3.Data Preparation for Machine Learning
Before you train the model for prediction, you need to perform data cleaning and preprocessing. This is a very important step; your model will not perform well without these steps.
The first step is to separate the independent variables and target(bank_account) from the train data. Then transform the target values from the object data type into numerical by using LabelEncoder.
The target values have been transformed into numerical datatypes, 1 represents ‘Yes’ and 0 represents ‘No’.
I have created a simple preprocessing function to handle
The processing function will be used for both train and test independent variables.
Preprocess both train and test dataset.
Observe the first row in the train data.
Observe the shape of the train data.
Now we have more independent variables than before (37 variables). This doesn’t mean all these variables are important to train our model. You need to select only important features that can increase the performance of the model. But I will not apply any feature selection technique in this article; if you want to learn and know more about feature selection techniques, I recommend you read the following articles:
4. Model Building and Experiments
A portion of the train data set will be used to evaluate our models and find the best one that performs well before using it in the test dataset.
Only 10% of the train dataset will be used for evaluating the models. The parameter stratify = y_train will ensure an equal balance of values from both classes (‘yes’ and ‘no’) for both train and validation set.
I have selected five algorithms for this classification problem to train and predict who is most likely to have a bank account.
From these algorithms, we can find the one that performs better than the others. We will start by training these models using the train set after splitting our train dataset.
After training five models, let’s use the trained models to predict our evaluation set and see how these models perform. We will use the evaluation metric provided on the competition page. The statement from the competition page stated that:
The evaluation metric for this challenge will be the percentage of survey respondents for whom you predict the binary ‘bank account’ classification incorrectly.
This means the lower the incorrect percentage we get, the better the model performance.
XGBoost classifier performs better than other models with 0.110 incorrect.
Let’s check the confusion matrix for XGB model.
Our XGBoost model performs well on predicting class 0 and performs poorly on predicting class 1, it may be caused by the imbalance of data provided(the target variable has more ‘No’ values than ‘Yes’ values). You can learn the best way to deal with imbalanced data here.
One way to increase the model performance is by applying the Grid search method as an approach to parameter tuning that will methodically build and evaluate a model for each combination of algorithm parameters specified in a grid.
The above source code will evaluate which parameter values for min_child_weight, gamma, subsample and max_depth will give us better performance.
Let’s use these parameter values and see if the XGB model performance will increase.
Error rate of the XGB model: 0.10837229069273269
Our XGB model has improved from the previous performance of 0.110 to 0.108.
5.Making the first submission
After improving the XGBoost model performance, let’s now see how the model performs on the competition test data set provided and how we rank on the competition leaderboard.
First, we make predictions on the competition test data set.
Then we create a submission file according to the instruction provided in the SubmissionFile.csv.
Let’s observe the sample results from our submission dataFrame.
Save results in the CSV file.
We named our submission file a first_submission.csv. Now we can upload it to the Zindi competition page by clicking the submit button and selecting the file to upload., You also have an option to add comments for each submission.
Then click the submit button to upload your submission file. Congratulations, you just made your first Zindi submission! The system will evaluate your results according to the evaluation methods for this competition.
Now you can see the performance of our XGB model is 0.109 on the test dataset provided.
You can also see your position on the Leaderboard.
In this article, I have given an overview of how to make your first submission to a Zindi competition. I suggest you take further steps to handle the imbalance of data and find alternative feature engineering and selection techniques you can apply to increase your model performance, or trying other machine learning algorithms. If you get stuck, don’t forget to ask for help on the discussion boards!
You can access the notebook for this article in the link below.
You can watch an interview with the CEO of Zindi Africa, Celina Lee on AI Kenya Podcast.
If you learned something new or enjoyed reading this article, please share it so that others can see it. Feel free to leave a comment too. Till then, see you in the next post! I can also be reached on Twitter @Davis_McDavid
One last thing: Read more articles like this in the following links.