Sunday, March 7, 2021
  • Setup menu at Appearance » Menus and assign menu to Top Bar Navigation
Advertisement
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News
No Result
View All Result
NikolaNews
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News
No Result
View All Result
NikolaNews
No Result
View All Result
Home Neural Networks

Richter’s Predictor— Data Challenge from DrivenData | by data4help

October 14, 2020
in Neural Networks
Richter’s Predictor— Data Challenge from DrivenData | by data4help
586
SHARES
3.3k
VIEWS
Share on FacebookShare on Twitter

Before moving on, it is important for us to validate that our step taken had a positive effect on the overall performance of the model. We do that by quickly running the new data in our baseline random forest model. Our accuracy is now 73.14, which represents a slight improvement from our baseline model!

Our performance has increased. That tells us that we took a step in the right direction.

You might also like

Deploy AI models -Part 3 using Flask and Json | by RAVI SHEKHAR TIWARI | Feb, 2021

Labeling Service Case Study — Video Annotation — License Plate Recognition | by ByteBridge | Feb, 2021

5 Tech Trends Redefining the Home Buying Experience in 2021 | by Iflexion | Mar, 2021

Arguably the most important set of variables within this challenge is the information on where the house is located. This makes sense intuitively: if the house is located closer to the epicenter, than we would also expect a higher damage grade.

The set of location variables provided within this challenge are threefold. Namely, we get three different geo-identifier with different kind of granularity. For simplicity, we tended to regard the three different identifier as describing a town, district and street (see below).

These geo-identifiers in their initial state are given in a simple numeric format, as can be seen below.

These integer do not prove to by very useful since even though in a numeric format, they do not exhibit any correlation with the target (see graphic below). Meaning that a higher number of the identifier is not associated with higher or lower damage. This fact makes it difficult for the model to learn from this variable.

In order to create a more meaningful variable for the model to learn from these variables, we apply a powerful tool, oftentimes used in data science challenges, called encoding. Encoding is normally used when transforming categorical variables into a numeric format. On first glance we might think that this does not apply to our case, since the geo-identifier is given as a numeric variable. However this understanding of encoders is shortsighted, since whether something represents a categorical variable does not depend on the format, but on the interpretation of the variable. Hence, the variable could gain greatly in importance when undergoing a transformation!

There are a dozen different encoding methods, which are nicely summarized in this blogpost. The most promising method for our case would be something called target encoding. Target encoding replaces the categorical feature with the average target variable of this group.

Unfortunately, it is not that easy. This method may work fine for the first geo-identifier (town), but has some serious drawbacks for the more granular second and third geo-identifier (district and street). The reason is that there are multiple districts and streets which only occur in a very small frequency. In these cases, mean target variable of a group with a small sample size is not representative for the target distribution of the group as a whole and would therefore suffer from high variance as well as high bias. This problem is quite common when dealing with categorical variables with a high cardinality.

One workaround for this problem is a mixture between Empirical Bayes and the shrinkage methodology, motivated by paper [1]. Here, the mean of a subgroup is the weighted average of the mean target variable of the subgroup and the mean of the prior.

In our example that would mean that the encoded value for a certain street is the weighted average between the mean target variable of the observations of this street and the mean of the district this street is in. (one varaiable level higher). This method shrinks the importance of the potentially few observations for one street and takes the bigger picture into account, thereby reducing the overfitting problem shown before when we had only a couple of observations for a given street.

The question may now arise how we are determining the weighting factor lambda. Using the methodology of the paper in [1], lambda is defined as:

Where m is defined as the ratio of the variance within the group (street) divided by the variance of the main group (district). That formula makes intuitive sense when we consider a street with a few observations which differ massively in their damage grade. The mean damage grade of this street would therefore suffer from high bias and variance (high sigma). If this street is in a low variance district (low tau), it would be sensible to drag the mean of the street into the direction of the district. This is essentially what the m coefficient captures.

It is worth mentioning that the overall model performance in-sample will drop when applying the Empirical Bayes-shrinkage method compared to using a normal target encoder. This is not surprising since we were dealing with an overfitted model before.

Lastly, we run our model again in order to see whether our actions improved the overall model performance. The resulting F1 score of 76.01% tells us that our changes results in an overall improvement.

At this point, it is fair to ask ourselves whether we need all the variables we currently use in our prediction model. If possible, we would like to work with as few features as possible (parsimonious property) without losing out too much in our scoring variable.

One benefit of working with tree models is ability to display feature importance. This metrics indicates how important each feature is for our prediction making. The following code and graph displays the variables nicely.

fimportance = main_rmc_model["model"].feature_importances_
fimportance_df = pd.DataFrame()
fimportance_df.loc[:, "f_imp"] = fimportance
fimportance_df.loc[:, "col"] = dummy_train.columns
fimportance_df.sort_values(by="f_imp", ascending=False, inplace=True)
fig, ax = plt.subplots(1, 1, figsize=(12, 24))
ax = sns.barplot(x="f_imp", y="col",
data=fimportance_df,
palette="GnBu_d")
ax.tick_params(axis="both", which="major", labelsize=20)
ax.set_xlabel("Feature Importance in %", fontsize=24)
ax.set_ylabel("Features", fontsize=24)
path = (r"{}feature_importance.png".format(output_path))
fig.savefig(path, bbox_inches='tight')

As we can see from the graph above, the most important variables to predict the damage grade of a house is the average damage grade of the different geo-locations. This makes sense, since the level of destruction of one house is likely to be correlated with the average damage of the houses around.

06.1 Low importance of binary variables

The feature importance also shows that nearly all binary variables have a low feature importance, meaning they are providing the model with little to no predictive information. In order to understand that better we take a look into the average of all binary variables, which is a number between zero and one.

binary_variables = variable_dict["binary_variables"]
mean_binary = pd.DataFrame(dummy_train.loc[:, binary_variables].mean())
mean_binary.loc[:, "type"] = mean_binary.index
fig, ax = plt.subplots(1, 1, figsize=(12, 12))
ax = sns.barplot(x="type", y=0,
data=mean_binary,
palette="GnBu_d")
ax.tick_params(axis="both", which="major", labelsize=16)
ax.set_xticklabels(mean_binary.loc[:, "type"], rotation=90)
path = (r"{}binaries_mean.png".format(output_path))
fig.savefig(path, bbox_inches='tight')

As can be seen above, nearly all variables have a mean below ten percent. That implies that most rows are equal to zero, a phenomenon we normally describe as sparsity. Furthermore, it is visible that the binary variables with an average above ten percent have also a higher feature importance within our prediction model.

This finding is in line with the fact that tree models, and especially bagging models like the currently used Random Forest, do not work well with sparse data. Furthermore, it can be said that a binary variable which is nearly always zero (e.g. has_secondary_usage_school), simply does not carry that much meaning given the low correlation with the target.

Using cross-validation, we find that keeping features which have an importance of minimum 0.01%, leaves us with the same F1 score compared to using all features. This leaves us with 53 variables in total. This number, relative to the amount of rows we have (260k) seems reasonable and therefore appropriate for the task.

One of our key learnings in this challenge was how to handle the massive imbalance of the target variable. Namely, not to touch it at all!

When looking at the chart below, we can see that the first damage grade does not nearly appear as often as the second damage grade. It may be tempting now to apply some over- or undersampling to the data in order to better balance the data and to show the model an equal amount of each damage grade. The main problem with this approach is that the test data comes from the same (imbalanced) distribution as the training data, meaning that improving the accuracy score for the lowest damage grade, through sampling methods, comes with the costs of a lower accuracy of the highest occurring, and therefore more important damage grade two.

07.1 Performance & Concluding Remarks

Following all steps of this blogpost (with a few minor tweaks and hyperparameter tuning) led us to place 34 out of 2861 competitors.

We are overall quite happy with the placement, given the amount of work we put in. This challenge touched on many different aspects and taught us a lot. Data Science Challenges are a perfect learning opportunity since they are very close to real life problems.

We are looking forward to our next one!

References

[1] Micci-Barreca, Daniele. (2001). A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems.. SIGKDD Explorations. 3. 27–32. 10.1145/507533.507538.

Credit: BecomingHuman By: data4help

Previous Post

German authorities raid FinFisher offices

Next Post

Atlassian Smarts adds machine learning layer across the company’s platform of services – TechCrunch

Related Posts

Deploy AI models -Part 3 using Flask and Json | by RAVI SHEKHAR TIWARI | Feb, 2021
Neural Networks

Deploy AI models -Part 3 using Flask and Json | by RAVI SHEKHAR TIWARI | Feb, 2021

March 6, 2021
Labeling Service Case Study — Video Annotation — License Plate Recognition | by ByteBridge | Feb, 2021
Neural Networks

Labeling Service Case Study — Video Annotation — License Plate Recognition | by ByteBridge | Feb, 2021

March 6, 2021
5 Tech Trends Redefining the Home Buying Experience in 2021 | by Iflexion | Mar, 2021
Neural Networks

5 Tech Trends Redefining the Home Buying Experience in 2021 | by Iflexion | Mar, 2021

March 6, 2021
Labeling Case Study — Agriculture— Pigs’ Productivity, Behavior, and Welfare Image Labeling | by ByteBridge | Feb, 2021
Neural Networks

Labeling Case Study — Agriculture— Pigs’ Productivity, Behavior, and Welfare Image Labeling | by ByteBridge | Feb, 2021

March 5, 2021
8 concepts you must know in the field of Artificial Intelligence | by Diana Diaz Castro | Feb, 2021
Neural Networks

8 concepts you must know in the field of Artificial Intelligence | by Diana Diaz Castro | Feb, 2021

March 5, 2021
Next Post
Atlassian Smarts adds machine learning layer across the company’s platform of services – TechCrunch

Atlassian Smarts adds machine learning layer across the company’s platform of services – TechCrunch

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recommended

Plasticity in Deep Learning: Dynamic Adaptations for AI Self-Driving Cars

Plasticity in Deep Learning: Dynamic Adaptations for AI Self-Driving Cars

January 6, 2019
Microsoft, Google Use Artificial Intelligence to Fight Hackers

Microsoft, Google Use Artificial Intelligence to Fight Hackers

January 6, 2019

Categories

  • Artificial Intelligence
  • Big Data
  • Blockchain
  • Crypto News
  • Data Science
  • Digital Marketing
  • Internet Privacy
  • Internet Security
  • Learn to Code
  • Machine Learning
  • Marketing Technology
  • Neural Networks
  • Technology Companies

Don't miss it

Check to see if you’re vulnerable to Microsoft Exchange Server zero-days using this tool
Internet Security

Check to see if you’re vulnerable to Microsoft Exchange Server zero-days using this tool

March 7, 2021
How Optimizing MLOps can Revolutionize Enterprise AI
Machine Learning

How Optimizing MLOps can Revolutionize Enterprise AI

March 6, 2021
Cyberattack shuts down online learning at 15 UK schools
Internet Security

Cyberattack shuts down online learning at 15 UK schools

March 6, 2021
Facebook enhances AI computer vision with SEER
Machine Learning

Facebook enhances AI computer vision with SEER

March 6, 2021
Microsoft Exchange zero-day vulnerabilities exploited in attacks against US local governments
Internet Security

Microsoft Exchange zero-day vulnerabilities exploited in attacks against US local governments

March 6, 2021
Hands-on Guide to Interpret Machine Learning with SHAP –
Machine Learning

Hands-on Guide to Interpret Machine Learning with SHAP –

March 6, 2021
NikolaNews

NikolaNews.com is an online News Portal which aims to share news about blockchain, AI, Big Data, and Data Privacy and more!

What’s New Here?

  • Check to see if you’re vulnerable to Microsoft Exchange Server zero-days using this tool March 7, 2021
  • How Optimizing MLOps can Revolutionize Enterprise AI March 6, 2021
  • Cyberattack shuts down online learning at 15 UK schools March 6, 2021
  • Facebook enhances AI computer vision with SEER March 6, 2021

Subscribe to get more!

© 2019 NikolaNews.com - Global Tech Updates

No Result
View All Result
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News

© 2019 NikolaNews.com - Global Tech Updates