Credit: Data Science Central

In this data science article, emphasis is placed on *science*, not just on data. State-of-the art material is presented in simple English, from multiple perspectives: applications, theoretical research asking more questions than it answers, scientific computing, machine learning, and algorithms. I attempt here to lay the foundations of a new statistical technology, hoping that it will plant the seeds for further research on a topic with a broad range of potential applications.

**1. Introduction**

In a previous article (see here) I attempted to approximate a random variable representing real data, by a weighted sum of simple *kernels* such as uniformly and independently, identically distributed random variables. The purpose was to build Taylor-like series approximations to more complex models (each term in the series being a random variable), to

- avoid over-fitting,
- approximate any empirical distribution (the inverse of the percentiles function) attached to real data, with the distribution attached to the first few terms the weighted sum representation
- easily compute data-driven confidence intervals regardless of the underlying distribution,
- derive simple tests of hypothesis,
- perform model reduction,
- optimize data binning to facilitate feature selection, and to improve visualizations of histograms
- create perfect histograms,
- build simple density estimators,
- perform interpolations, extrapolations, or predictive analytics
- perform clustering and detect the number of clusters.

Why I’ve found very interesting properties about stable distributions during this research project in my previous article, I could not come up with a solution to solve all these problems. The fact is that these weighed sums would usually converge (in distribution) to a normal distribution if the weights did not decay too fast — a consequence of the central limit theorem. And even if using uniform kernels (as opposed to Gaussian ones) with fast-decaying weights, it would converge to an almost symmetrical, Gaussian-like distribution. In short, very few real-life data sets could be approximated by this type of modeling.

*Source for picture: SAS blog*

I also tried with independently but NOT identically distributed kernels, and again, failed to make any progress. By not identically distributed kernels, I mean basic random variables from a same family, say with a uniform or Gaussian distribution, but with parameters (mean and variance) that are different for each term in the weighted sum. The reason being that sums of Gaussian’s, even with different parameters, are still Gaussian, and sums of Uniform’s end up being Gaussian too unless the weights decay fast enough. Details about why this is happening are provided in the appendix.

Now, in this article, starting in the next section, I offer a full solution, using mixtures rather than sums. The possibilities are endless.

**2. Approximations Using Mixture Models**

The problem is specified as follows. You have an univariate random variable *Y* that represents any of your quantitative features in your data set, and you want to approximate it by a mixture of *n* elementary independent random variables called *kernels* and denoted as *X*(*n*, *k*) for *k* = 1, …, *n*, with decreasing probability weights *p*(*n*, *k*) that converge to zero. The approximation of *Y* based on the first *n* kernels, is denoted as *Y*(*n*). By approximation, we mean that the data-generated empirical distribution of *Y* is well approximated by the known, theoretical distribution of *Y*(*n*) and that as *n* tends to infinity, both become identical (hopefully).

Moving forward, *N* denotes your sample size, that is the number of observations; *N* can be be very large, even infinite, but you want to keep *n* as small as possible. Generalizations to the multivariate case is possible but not covered in this article. The theoretical version of this consists in approximating any known statistical distribution (not just empirical distributions derived from data sets) by a small mixture of elementary (also called atomic) kernels.

In statistical notation, we have:

We also want *Y*(*n*) to converge to *Y*, in distribution, as *n* tends to infinity. This implies that for large *n*, the weights *p*(*n*, *k*) must tend to zero as *k* tends to infinity.

**The error term**

There are various ways to define the distance between two distributions, say *Y*(*n*) and *Y*. See here for details; one of the most popular ones is the Kolmogorov-Smirnov metric. Regardless of the metric used, the error term is denoted as *E*(*n*) = ||*Y* – *Y*|(*n*)|. Of course, the problem, for a given value of *n*, is to minimize *E*(*n*). As *n* tends to infinity, by carefully choosing the parameters in the model (that is, the weights, as well the the means and variances of the kernels,) the error *E*(*n*) is supposed to converge to 0. Note that the kernels are independent random variables, but not identically distributed: a mix of kernels with different means and variances is not only allowed, but necessary to solve this optimization problem.

**Kernels and model parameters**

Besides the weights, the other parameters of the models are the parameters attached to each kernel *X*(*n*, *k*). Typically, each kernel *X*(*n*, *k*) is characterized by two parameters: *a*(n, *k*) and *b*(*n*, *k*). In the case of Gaussian kernels, *a*(*n*, *k*) is the mean and *b*(*n*, *k*) is the variance; *b*(*n*, *k*) is set to 1. In the case of Uniform kernels with *Y* taking on positive values, *a*(*n*, *k*) is the lower bound of the support interval, while *b*(*n*, *k*) is the upper bound; in this case, since we want the support domains to form a partition of the set of positive real numbers (the set of potential observations), we use, for any fixed value of *n*, *a*(*n*, 1) = 0 and *b*(*n*, *k*) = *a*(*n*, *k*+1).

Finally,the various kernels should be re-arranged (sorted) in such a way that *X*(*n*, 1) always has the highest weight attached to it, followed by *X*(*n*, 2), *X*(*n*, 3) and so on. The methodology can also be adapted to discrete observations and distributions, as we will discuss later in this article.

**Algorithms to find the optimum parameters**

The goal is to find optimum model parameters, for a specific *n*, to minimize the error *E*(*n*). And then try bigger and bigger values of *n*, until the error is small enough. This can be accomplished in various ways.

The solution consists in computing the derivatives of *E*(*n*) with respect to all the model parameters, and then finding the roots (parameter values that make the derivatives vanish, see for instance this article.) For a specific value of *n*, you will have to solve a non-linear system of *m* equations with *m* parameters. In the case of Gaussian kernels, *m* = 2*n*. For uniform kernels, *m* = 2*n* + 1 (*n* weights, *n *interval lower bounds, plus upper bound for the rightmost interval.) No exact solution can be found, so you need to use an iterative algorithm. Potential modern techniques used to solve this kind of problem include:

You can also use Monte-Carlo simulations, however here you face the curse of dimensionality, the dimension being the number *m* of parameters in your model. In short, even for *n* as small as *n* = 4 (that is, *m* = 8), you will need to test trillions of randomly sampled parameter values (*m*-dimensional vectors) to get a solution close enough to the optimum, assuming that you use raw Monte-Carlo techniques. The speed of convergence is an exponential function of *m*. Huge improvements to this method are discussed later in this section, using some kind of step-wise algorithm to find local optima, reducing it to a 2-dimensional problem. By contrast, speed of convergence is quadratic for gradient-based methods, if *E*(*n*) is convex in the parameter space. Note that here, *E*(*n*) may not be convex though.

**Convergence and uniqueness of solution **

In theory, both convergence and the fact that there is only one global optimum, are guaranteed. It is easy to see that under the constrained imposed here on the model parameters, two different mixture models must have two distinct distributions. In the case of Uniform kernels, this is because the support domains of the kernels form a partition, and are thus disjoint. In the case of Gaussian kernels, as long as each kernel has a different mean, no two mixtures can have the same distribution: the proof is left as an exercise. To put it differently, **any relatively well behaved statistical distribution is uniquely characterized by its set of parameters associated with its mixture decomposition**. When using Gaussian kernels, this is equivalent to the fact that any infinitely differentiable density function is uniquely characterized by its coefficients in its Taylor series expansion. This is discussed in the appendix.

The fact that under certain conditions, some of the optimization algorithms described in the previous subsection, converge to the global optimum, is more difficult to establish. It is always the case with the highly inefficient Monte Carlo simulations. In that case, the proof is pretty simple and proceeds as follows

- Consider the discrete case where
*Y*takes only on positive integer values (for example, your observations consist of counts,) and use the discrete Uniform kernel. - In that case, the solution will converge to a mixture model where each kernel is a binomial random variable, and its associate weight is the frequency of the count in question, in your observed data. This is actually the global optimum, with
*E*(*n*) converging to 0 as*n*tends to infinity. - Continuous distributions can be approximated by discrete distributions after proper re-scaling. For instance, a Gaussian distribution can be perfectly approximated by sequences of increasingly granular binomial distributions. Thus, the convergence to a global optimum, can be derived from the convergence obtained for the discrete approximations.

The stopping rule, that is, deciding when *n* is large enough, is based on how fast *E*(*n*) continue to improve as *n* increases. Initially, for small but increasing values of *n*, *E*(*n*) will drop sharply, but for some value of *n* usually between *n* = 3 and *n* = 10, improvements will start to taper off, with *E*(*n*) very slowing converging to 0. If you plot *E*(*n*) versus *n*, the curve will exhibit an elbow, and you can decide to stop at the elbow. See the elbow rule here.

Finally, let us denote as *a*(*k*) the limit of *a*(n, *k*) as *n* tends to infinity; *b*(*k*) and *p*(*k*) are defined in a similar manner. Keep in mind that the kernels must be ordered by decreasing value of their associated weights. In the continuous case, a theoretical question is whether or not these limits exist. With Uniform kernels, *p*(*n*, *k*), as well as *b*(*n*, *k*) – *a*(*n*, *k*), that is, the length of the *k*-th interval, should both converge to 0, regardless of *k*, as *n* tends to infinity. The limiting quotient represents the value of *Y*‘s density at the point covered by the interval in question. Also, the sum of *p*(*n*, *k*) over all *k*‘s, should still be equal to one, at the limit as *n* tends to infinity. In practice, we are only interested in small values of *n*, typically much smaller than 20.

**Find near-optimum with fast step-wise algorithm**

A near optimum may be obtained fairly quickly with small values of *n*, and in practice this is good enough. To further accelerate the convergence, one can use the following step-wise algorithm, with the Uniform kernel. At iteration *n*+1, modify only two adjacent kernels that were obtained at iteration *n* (that is, kernels with adjacent support domains) as follows:

- Increase the upper bound of the left interval, and decrease the lower bound of the right interval accordingly. Or do the other way around. Note that the cumulative density within each interval, before or after modification, is always equal to 1, since we are using uniform kernels.
- Adjust the two weights, but keep the sum of the two weights unchanged.

So in fact you are only modifying two parameters (degrees of freedom is 2.) Pick up the two intervals, as well as the new weights and lower/upper bounds, in such a way as to minimize E(*n*+1).

**3. Example**

Here, I illustrate some of the concepts explained earlier, with an example based on simulated data. The source code and the data is provided so that my experiment can be replicated, and the technical details understood. The 10,000 data points generated (representing *Y*) are deviates from a skewed, non-symmetrical negative binomial distribution, taking integer values between 0 and 110. Thus we are dealing with discrete observations and distributions. The kernels have discrete uniform distributions, for instance uniform on {5, 6, 7, 8, 9, 10, 11} or on {41, 42, 43, 44}. The choice of a non-symmetrical target distribution (for *Y*) is to illustrate the fact that the methodology also works for non-Gaussian target variables, unlike the classic central limit theorem framework applying to sums (rather than mixtures) and where convergence is always towards a Gaussian. Here instead, convergence is towards a simulated negative binomial.

I tried to find online tools to generate deviates from any statistical distribution, but haven’t found any interesting ones. Instead, I used R to generate the 10,000 deviates, with the following commands:

The first line of code generates the 10,000 deviates from a negative binomial distribution, the second line produces its histogram with 50 bins (see picture below, where the vertical axis represents frequency counts, and the horizontal axis represents values of Y.) The third line of code exports the data to an output file that will first be aggregated and then used as an input for the script that (1) computes the model parameters, and (2) computes and minimizes the error *E*(*n*).

*Histogram for the 10,000 deviates (negative binomial distribution) used in our example*

**Data and source code**

The input data set for the script that processes the data, can be found here. It consists of the 10,000 negative binomial deviates (generated with the above R code), and aggregated / sorted by value. For instance, the first entry (104) means that among the 10,000 deviates, 104 of them have a value equal to 0. The second entry (175) means that among the 10,000 deviates, 175 of them have a value equal to 1. And so on.

The script is written in Perl (you are invited to write a Python version) but it is very easy to read and well documented. It illustrates the raw Monte-Carlo simulations with 4 discrete uniform kernels. So it is very inefficient in terms of speed, but easy to understand, with few lines of code. You can find it here. It produced the distribution (mixture of 4 kernels) that approximates the above histogram, see picture below.

*Approximation of above histogram with mixture model, using 4 uniform kernels*

**Results **

The chart below shows a contour plot for the error E(2), when using *n* = 2 discrete uniform kernels, that is two intervals, with lower bounds of the first interval displayed on the vertical axis, and upper bounds on the horizontal axis. The upper bound of the second (rightmost) interval was set to the maximum observed value, equal to 110. Ignore the curves above the diagonal; they are just a mirror of the contours below. Outside the kernel intervals, densities were kept to 0. Clearly the best kernel (discrete) intervals to approximate the distribution of *Y*, are visually around {1, 2, … , 33} and {34, …, 110} corresponding to a lower bound of 1, and an upper bound of 33 for the first interval; it yields an error E(2) less than 0.45.

The contour plot below was produced using the contour function in R, using this data set as input, and the following code:

The interesting thing is that the error function *E*(*n*), as a function of the mixture model parameters, exhibit large areas of convexity containing the optimum parameters, when *n* = 2. This means that gradient descent algorithms (adapted to the discrete space here) can be used to find the optimum parameters. These algorithms are far more efficient than Monte-Carlo simulations.

*Contour plot showing the area where optimum parameters are located, minimizing E(1)*

I haven’t checked if the convexity property still holds in the continuous case, or when you include the weight parameters in the chart, or for higher values of *n*. It still might, if you use the fast step-wise optimization algorithm described earlier. This could be the best way to go numerically, taking advantage of gradient descent algorithms, and optimizing only a few parameters at a time.

Now I discuss the speed of convergence, and improvements obtained by increasing the number of kernels in the model. Here, optimization was carried via very slow, raw Monte-Carlo simulations. The table below shows the interval lower bounds and weights associated with the discrete uniform kernel, for *n* = 4, obtained by running 2 million simulations. The upper bound of the rightmost interval was set to the maximum observed value, equal to 110. For any given *n*, only simulations performing better than all the previous ones, are displayed: in short, these are the records. Using *n* = 5 does not improve the final error E(*n*). Low errors with *n* = 2, 3, and 4 were respectively 0.41, 0.31, and 0.24. They were obtained respectively at iterations 7,662 (*n* = 2), 96,821 (*n* = 3) and 1,190,575 (*n*=4). It shows how slow Monte-Carlo converges, and the fact that the number of required simulations grows exponentially with the dimension *n*. The Excel spreadsheet, featuring the same table for *n* = 2, 3, 4, and 5, can be found here.

**4. Applications**

The methodology proposed here has many potential applications in machine learning and statistical science. These applications were listed in the introduction. Here, I just describe a few of them in more details.

**Optimal binning**

These mixtures allow you to automatically create optimum binning of univariate data, with bins of different widths and different sizes. In addition, the optimum number of bins can be detected using the elbow rule described earlier. Optimum binning is useful in several contexts: visualization (to display meaningful histograms), in decision trees, and in feature selection procedures. Some machine learning algorithms, for instance this one (see also here,) rely on features that are not too granular and properly binned, to avoid over-fitting and improve processing time. These mixture models are handy tools to help with this.

**Predictive analytics**

Since this methodology creates a simple model to fit with your data, you can use that model to predict frequencies, densities, (including perform full density estimation), intensities, or counts attached to unobserved data points, especially if using kernels with infinite support domains, such as Gaussian kernels. It can be used as a regression technique, or for interpolation or extrapolation, or for imputation (assigning a value to a missing data point), all of this without over-fitting. Generalizing this methodology to multivariate data will make it even more useful.

**Test of hypothesis and confidence intervals**

These mixtures help build data-driven intermediate models, something in-between a basic Gaussian or exponential or whatever fit (depending on the shape of the kernels) and non-parametric empirical distributions. It also comes with core parameters (the model parameters) automatically estimated. Confidence intervals and tests of hypothesis are easy to derive, using the approximate mixture model distribution to determine statistical significance, *p*-values, or confidence levels, the same way you would with standard, traditional parametric distributions.

**Clustering**

Mixture models were invented long ago for clustering purposes, in particular under a Bayesian framework. This is also the case here, and even more so as this methodology gets extended to deal with multivariate data. One advantage is that it can automatically detect the optimum number of clusters thanks to its built-in stopping rule, known as the elbow rule. Taking advantage of convexity properties in the parameter space, to use gradient descent algorithm for optimization, the techniques described in this article could perform unsupervised clustering faster than classical algorithms, and be less computer intensive.

**5. Appendix**

We discuss here, from a more theoretical point of view, two fundamental results mentioned earlier.

**Gaussian mixtures uniquely characterize a broad class of distributions**

Let us consider an infinite mixture model with Gaussian kernels, each with a different mean *a*(*k*), same variance equal to 1, and weights *p*(*k*) that are strictly decreasing. Then the density associated with this mixture is

Two different sets of (*a*(*k*), *p*(*k*)) will result in two different density functions, thus the representation uniquely characterizes a distribution. Also, the exponential functions in the sum can be expanded as Taylor series. Thus we have:

Density functions infinitely differentiable at *y* = 0, can be represented in this way. Convergence issues are beyond the scope of this article.

**Weighted sums fail to achieve what mixture models do**

It is not possible, using an infinite weighted sum of independent kernels of the same family, with decaying weights converging to zero but not too fast, to represent any arbitrary distribution. This fact was established here in the case where all the kernels have the exact same distribution. It is mostly an application of the central limit theorem. Here we generalize this theorem to kernels from a same family of distributions, but not necessarily identical. By contrast, the opposite is true if you use mixtures instead of weighted sums.

With a weighted sum of Gaussian kernels of various means and variances, we always end up with a Gaussian distribution (see here for explanation.) With Uniform kernels (or any other kernel family) we can prove the result as follows:

- Consider a sum of
*n*kernels from a same family. Say*n*(1) of them have (almost) the same parameters, another*n*(2) of them have the same parameters but different from the first group, another*n*(3) of them have the same parameters but different from the first two groups, and so on, with*n*=*n*(1) +*n*(2) + … - Let
*n*tends to infinity, with*n*(1),*n*(2) and so on also tend to infinity. The weighted sum in each group will converge to Gaussian, by virtue of the central limit theorem. - The overall sum across all groups will tend to a sum of Gaussian, and thus must be Gaussian. This depends on how fast the weights are decaying. Details about the decaying rate, for the result to be correct, are provided in my previous article.

By contrast, a mixture or any number of Gaussian kernels with different means, is not Gaussian.

*To not miss this type of content in the future, subscribe to our newsletter. For related articles from the same author, click here or visit www.VincentGranville.com. Follow me on on LinkedIn, or visit my old web page here.*

**DSC Resources**

Follow us: Twitter | Facebook

Credit: Data Science Central By: Vincent Granville