please find more on the link: https://github.com/imesh059/Credit-approval-Prediction
The Credit approval is one of the critical things for a bank to handle since most of the applicants are not in the approval end or the rejection end, where borderline applicants should have to be evaluated properly. Since a pitch from the applicant or a discussion could lead to a miss judgement depending on the experience of the evaluation officer. In real-world, there are few different levels of credit cards and approving each at a given level plays a critical role in motivating the customers to spend more, and retain in with the company. The initial rejection without justification would give a bad image of the company or the bank and in the long run, it affects the name of the bank which is an intangible asset. Therefore, for this project, the credit approval data-set from UCI(cite) database has been used.
Predicting the Successful approval or not could be the final task but to achieve that from the raw data set there was a step by step approach.
The attributes in a data-set have a different type of entries (Floats, Numbers, Categorical, binary data) to train them all attributes should change to numeric. Therefore, the binary attributes were encoded as (1 and 0) and multi-categorical attributes were encoded using ”ONE-HOT ENCODING”. This is where the integer encoded variable is removed and a new binary variable is added for each unique integer value.
1. AI for CFD: Intro (part 1)
2. Using Artificial Intelligence to detect COVID-19
3. Real vs Fake Tweet Detection using a BERT Transformer Model in few lines of code
4. Machine Learning System Design
If the integer encode uses to train then it tries to miss-lead the model and create a relationship between the variation of integers which doesn’t exist at all in some cases. After One-Hot Encoding, the 15 attributes were distributed to 38 attributes.
Ex: A4: (u, y, l, t) => A4_u, A4_y, A4_l, A4_t
Missing Value Handling
One of the most important things in prepossessing is handling the missing values, where first have to have a clear understanding about how the missing values have been spread throughout the data set. In this step, missing values have to be handled based on their distributions to achieve reasonable accuracy. Here, to confirm the randomness of missing values, Little’s MCAR test was performed. If the potential bias due to missing data depends on the mechanism causing the data to be missing, and the analytical methods applied to amend the missingness. The chi-square test of MCAR for multivariate quantitative data proposed, which tests whether there exists a significant difference between the means of different missing-value patterns.
Feature Engineering: Remove Outliers
Outliers are unusual values in your data set, and they can distort statistical analyses and violate their assumptions therefor, the outliers have properly identified and removed from the data set. Since in categorical data its impossible to find outliers the continues numerical data has been used for this.
Initially, in visualization the distribution of the 15 base attributes with the Class=” A16″ has been considered. As it shows in Figure 3 most of the attributes are well distributed among 2 classes. when considering “A8” and “A11”. those attributes have a clear bias which helps a lot in classification.
Visualize with dimensional reduction
Here we have considered, Multidimensional scaling (MDS), Spectral embedding for non-linear dimensionality reduction, Locally Linear Embedding (LLE), Isomap Embedding, TSNE and PCA. Figure 4 clearly shows the distribution among the highest variant features of each manifold.
The genetic algorithm is a method for solving both constrained and unconstrained optimization problems that are based on natural selection, the process that drives biological evolution. The genetic algorithm repeatedly modifies a population of individual solutions. At each step, the genetic algorithm selects individuals at random from the current population to be parents and uses them to produce the children for the next generation. Over successive generations, the population “evolves” toward an optimal solution. You can apply the genetic algorithm to solve a variety of optimization problems that are not well suited for standard optimization algorithms, including problems in which the objective function is discontinuous, non-differentiable, stochastic, or highly nonlinear. The genetic algorithm can address problems of mixed-integer programming, where some components are restricted to be integer-valued.
The genetic algorithm uses three main types of rules at each step to create the next generation from the current population:
1. The genetic algorithm uses three main types of rules at each step to create the next generation from the current population:
Selection rules select the individuals, called parents, that contribute to the population at the next generation.
2. Crossover rules combine two parents to form children for the next generation.
3. Mutation rules apply random changes to individual parents to form children.
Implementation of Genetic Algorithm
The algorithm we implemented consist of 4 different steps.
1 Initialization of hyper-parameters
2 selection hyper-parameters for each generation.
4 Mutation of generations
Initially, 6 parameters were selected to optimize which are, earning_rate, n_estimators, max_depth, min_child_weight, colsample_bytree, and gamma. Thereafter, those parameters were randomly initialized and set the limits which it can vary.
Then in the second step, we used 10 fold accuracy to evaluate the model and check the fitness of the model, Then based on the fitness level the parent has been selected.
There are various methods to define crossover in the case of genetic algorithms, such as single-point, k-point crossover and uniform crossover etc. In here the uniform crossover has been considered which select parameters for the child independently from the parent.
finally, in mutation, change parent parameters in random amounts and which will make it unpredictable and make perfect for the algorithm, but there will be a limit for the change of a parameter.
When applying the algorithm we have considered all the prier prepossessing steps we have considered before applying other algorithms.
when considering the above figure 11, it shows that the best model changes and improves over generations.
Finally, after implementing all the algorithms and obtained the results there was an option to select the best algorithm which is suitable for this data set. For that we have considered the F1 score,10 fold accuracy, Recall and procession of each algorithm and selected the highest. Where XGB with granitic optimization and the neural network trained with full data set came hand in hand.
In conclusion, initially, the raw data has been taken and passed through the prepossessing which consist of encoding the categorical data and replacing the missing values. Thereafter, data has been visualized in many different ways using dimensionality reduction methods, removed outliers and checked the correlation matrix and checked the features with the domain understanding. Next, 11 different models were trained using Classical Machine Learning models, A neural network has been trained with the full data set and the PCAed data set and finally, two models (XGB classifier and CatB classifier) have trained using genetic Optimization for 100 generations. Finally based on the F1 score and accuracy the models were selected which are XGB with granitic optimization and the neural network trained with the full data set.