Wednesday, March 3, 2021
  • Setup menu at Appearance » Menus and assign menu to Top Bar Navigation
Advertisement
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News
No Result
View All Result
NikolaNews
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News
No Result
View All Result
NikolaNews
No Result
View All Result
Home Data Science

New Algorithm For Density Estimation and Noise Reduction

April 14, 2020
in Data Science
New Algorithm For Density Estimation and Noise Reduction
585
SHARES
3.3k
VIEWS
Share on FacebookShare on Twitter

KernelML – Hierarchical Density Factorization

The purpose, problem statement, and potential applications came from this post on datasciencecentral.com. The goal is to approximate any multi-variate distribution using a weighted sum of kernels. Here, a kernel refers to a parameterized distribution. This method of using a decaying weighted sum of kernels to approximate a distribution is similar to a Taylor series where a function can be approximated, around a point, using the function’s derivatives. KernelML is a particle optimizer that uses parameter constraints and sampling methods to minimize a customizable loss function. The package uses a Cythonized backend and parallelizes operations across multiple cores with the Numba. KernelML is now available on the Anaconda cloud and PyPi (pip). Please see the KernelML extention on the documentation page.

You might also like

The Effect IoT Has Had on Software Testing

Why Cloud Data Discovery Matters for Your Business

DSC Weekly Digest 01 March 2021

  • Approximate any empirical distribution
  • Build a parameterized density estimator
  • Outlier detection and dataset noise reduction

This solution I came up with was incorporated into a python package, KernelML. The example code can be found here.

My solution uses the following:

  1. Particle SwarmGenetic Optimizer
  2. Multi-Agent Approximation using IID Kernels
  3. Reinforcement Learning

Most kernels have hyper-parameters that control the mean and variation of the distribution. While these parameters are potentially differentiable, I decided against using a gradient-based method. The gradients for the variance parameters can potentially vanish, and constraining the variance makes the parameters non-differentiable. It makes sense to use a mixed integer or particle swarm strategy to optimize the kernels’ hyper-parameters. I decided to use a uniform distribution kernel because of its robustness to outliers in higher dimensions.

Over the past year, I’ve independently developed an optimization algorithm to solve non-linear, constrained optimization problems. It is by no means perfect, but building it from scratch allowed me to 1) make modifications based on the task 2) better understand the problem I was trying to solve.

My initial approach used a multi-agent strategy to simultaneously fit any multi-variate distribution. The agents, in this case, the kernels, were independent and identically distributed (IID). I made an algorithm, called density factorization, to fit an arbitrary number of agents to a distribution. The optimization approach and details can be found here. The video below shows a frame-by-frame example for how the solution might look over the optimization procedure.

This algorithm seemed to perform well on non-sparse, continuous distributions. One problem was that the algorithm used IID kernels which is an issue when modeling skewed data. Every kernel has the same 1/K weight, where K is the number of kernels. In theory, hundreds of kernels could be optimized at once, but this solution lacked efficiency and granularity.

I decided to use a hierarchical, reinforcement style approach to fitting the empirical multi-variate distribution. The initial reward, R_0, was the empirical distribution, and the discounted reward, R_1, represented the data points not captured by the initial multi-agent algorithm at R_0. Equation (1) shows the update process for the reward.

Where p is the percentage of unassigned data points, R is the empirical distribution at step t, U is the empirical distribution for the unassigned data points, and lambda is the discount factor. The reward update is the multiplication of p and lambda.

This works because, by definition, the space between data points increases as the density decreases. As data points in under-populated regions are clustered, the cluster sizes will increase to capture new data points. The reward update is less than the percentage of unassigned data point which allows the denser regions to be represented multiple times before moving to the less dense regions.

The algorithm uses pre-computed rasters to approximate the empirical density which means that the computational complexity depends on the number of dimensions, not the data points. The example below shows how the estimated and empirical distribution might look for the 2-D case.

After fitting the initial density factorization algorithm, the reward is updated by some discount factor to improve the reward for the data points that have not been captured. The plot below shows how the empirical distribution might look after a few updates.

The samples in each cluster must be greater than the minimum-leaf-sample parameter. This parameter prevents clusters from accidentally modeling outliers by chance. This is mostly an issue in higher dimensional space due to the curse of dimensionality. If a cluster does not meet this constraint, it is pruned from the cluster solution. This process continues until 1) a new cluster solution does not capture new data points or 2) >99% of the data points have been captured (this threshold is also adjustable).

As the input space increases in dimensionality, the Euclidean space between data points increases. For example, for an input space that contains uniform random variables, the space between space points increases by a factor of sqrt(D), where D is the number of dimensions.

To create a presentable example, the curse of dimensionality will be simulated in 2-D. This can be achieved by creating an under-sampled (sparse) training dataset and an over-sampled validation dataset. Two of the clusters were moved closer together to make cluster separation more difficult.

The density can be estimated by counting the number of clusters assigned to a data point. The solution is parameterized so it can be applied to the validation dataset after training. The plot below shows the histogram of the density estimate after running the model on the training dataset.

The density can be used to visualize the denser areas of the data. The green rings show the true distributions’ two standard deviation threshold. The plot below visualizes the density for the training dataset.

Outliers can be defined by a percentile, i.e. 5th, 10th, etc., of the density estimate. The plot below shows the outliers defined by the 10th percentile. The green rings show the true distributions’ two standard deviation threshold

The plot below shows the histogram of the density estimate for the validation dataset.

The plot below visualizes the density for the validation dataset. The green rings show the true distributions’ two standard deviation threshold

The outliers, defined by the 10th percentile, are visualized below. The green rings show the true distributions’ two standard deviation threshold

This particular use case was focused on outlier detection. However, the algorithm also provided cluster assignments and density estimates for each data point. Other outlier detection methods, i.e., local outlier factor (LOF), can produce similar results in terms of outliers detection. Local outlier factor is dependent on the number of nearest neighbors and the contamination parameter. While it is easy to tune LOF’s parameter in 2-D, it is not so easy in multiple dimensions. Hierarchical density factorization provides a robust method to fit multi-variate distributions without the need for extensive hyper-parameter tuning. While the algorithm does not depend on the number of data points, it is still a relatively slow algorithm. Many improvements can be made to improve the efficiency and speed. The example notebook includes a comparison to LOF and a multivariate example using the Pokemon dataset.

Originally posted here.


Credit: Data Science Central By: Rohan Kotwani

Previous Post

Ripple Executive Says Automation, Machine Learning and Digital Assets Transforming International Payments Infrastructure

Next Post

Singapore allows schools to resume Zoom use for home-based learning

Related Posts

The Effect IoT Has Had on Software Testing
Data Science

The Effect IoT Has Had on Software Testing

March 3, 2021
Why Cloud Data Discovery Matters for Your Business
Data Science

Why Cloud Data Discovery Matters for Your Business

March 2, 2021
DSC Weekly Digest 01 March 2021
Data Science

DSC Weekly Digest 01 March 2021

March 2, 2021
Companies in the Global Data Science Platforms Resorting to Product Innovation to Stay Ahead in the Game
Data Science

Companies in the Global Data Science Platforms Resorting to Product Innovation to Stay Ahead in the Game

March 2, 2021
Importance of Data Science in Modern Age
Data Science

Importance of Data Science in Modern Age

March 2, 2021
Next Post
Singapore allows schools to resume Zoom use for home-based learning

Singapore allows schools to resume Zoom use for home-based learning

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recommended

Plasticity in Deep Learning: Dynamic Adaptations for AI Self-Driving Cars

Plasticity in Deep Learning: Dynamic Adaptations for AI Self-Driving Cars

January 6, 2019
Microsoft, Google Use Artificial Intelligence to Fight Hackers

Microsoft, Google Use Artificial Intelligence to Fight Hackers

January 6, 2019

Categories

  • Artificial Intelligence
  • Big Data
  • Blockchain
  • Crypto News
  • Data Science
  • Digital Marketing
  • Internet Privacy
  • Internet Security
  • Learn to Code
  • Machine Learning
  • Marketing Technology
  • Neural Networks
  • Technology Companies

Don't miss it

6 Ways Machine Learning Can Improve Supply Chain’s Bottom Line
Machine Learning

6 Ways Machine Learning Can Improve Supply Chain’s Bottom Line

March 3, 2021
Malaysia Airlines suffers data security ‘incident’ spanning nine years
Internet Security

Malaysia Airlines suffers data security ‘incident’ spanning nine years

March 3, 2021
URGENT — 4 Actively Exploited 0-Day Flaws Found in Microsoft Exchange
Internet Privacy

URGENT — 4 Actively Exploited 0-Day Flaws Found in Microsoft Exchange

March 3, 2021
This Protein Therapeutics Company Integrates Wet Lab For High-Speed Characterization With Machine Learning Technologies To Guide The Search For Better Antibodies
Machine Learning

This Protein Therapeutics Company Integrates Wet Lab For High-Speed Characterization With Machine Learning Technologies To Guide The Search For Better Antibodies

March 3, 2021
Breadcrumbing Job Applicants: Bad for Employers
Marketing Technology

Breadcrumbing Job Applicants: Bad for Employers

March 3, 2021
Remote work: 5 things every business needs to know
Internet Security

Remote work: 5 things every business needs to know

March 3, 2021
NikolaNews

NikolaNews.com is an online News Portal which aims to share news about blockchain, AI, Big Data, and Data Privacy and more!

What’s New Here?

  • 6 Ways Machine Learning Can Improve Supply Chain’s Bottom Line March 3, 2021
  • Malaysia Airlines suffers data security ‘incident’ spanning nine years March 3, 2021
  • URGENT — 4 Actively Exploited 0-Day Flaws Found in Microsoft Exchange March 3, 2021
  • This Protein Therapeutics Company Integrates Wet Lab For High-Speed Characterization With Machine Learning Technologies To Guide The Search For Better Antibodies March 3, 2021

Subscribe to get more!

© 2019 NikolaNews.com - Global Tech Updates

No Result
View All Result
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News

© 2019 NikolaNews.com - Global Tech Updates