How should an estimator be chosen? The academic training of economists and finance professionals has traditionally favored the minimum variance unbiased estimator (MVUE). Sometimes, the maximum likelihood estimator (MLE) is chosen. From time to time, the method of moments (MOM) or the generalized method of moments (GMM) is used. Because of its subjective nature, Bayesian methods are rarely used.
The problem with this hierarchy is that it is preferencebased and ignores the axiomatic structures that underly the possible choices. The argument to be made here is that the choice of an estimator can have unexpected consequences.
To illustrate this, and before going back to the underlying principles, we are going to look at a simple informationbased lottery situation where the choice of estimator creates a surprising outcome.
Our story begins with an engineer and the owner of a bakery. The owner has a problem. His best baker is a ghost, and the cakes he makes are invisible.
The important thing about the cakes is not that they are invisible, but that they become visible once cut. The cakes have an unusual magical property. If the cakes are cut into two pieces, the large piece turns purple while the small piece turns green. But, if the cakes are cut precisely in half, then both halves glow like gold and float in the air. The gold cakes sell for much more than the ordinary green and purple ones, so much so that the owner hired an engineer to maximize the number of gold cakes.
Specifically, the contract is to build a device or devices that are to be attached to an existing table that will maximize the number of golden cakes.
All of the cakes are six inches in diameter and remain invisible until cut. After baking the cakes, the ghost places them on the large square table. Because ghosts are not subject to limitations like gravity or barriers such as walls, doors, or tables, the ghost places the cake on the table randomly. Not only is the cake placed at a random location, but the locations are uniformly distributed over the surface of the table.
Upon investigation, the engineer determines that the cake releases an invisible light from the decay of the ectoplasm used to ice the cake. The ectoplasm detector determines that the release of ghostly particles is also uniformly distributed over the surface of the cake.
The engineer realizes that if the center of the cake can be found and if a blade could be placed on one corner of the table, treating it as the origin, then a blade cut through the center of the cake would split the cake into two equal pieces.
The engineer was excited. A long blade was easy to forge, even though the blade had to have magical glyphs inscribed in it. Furthermore, attaching the blade at a corner was a straightforward engineering task. Detecting the decaying particles of ectoplasm was simple because detectors already existed, but finding the center of the cake did not have an obvious solution.
So the engineer picked up a statistics textbook. She reads up on statistical estimators and decides to use the MVUE, which in this case, is the sample mean. She excludes the MLE because there cannot be a unique estimator.
The MVUE has two very desirable properties. The first is that there is no way to construct a more accurate estimator than to construct an unbiased estimator for finite samples. Second, by having the smallest variance for the sampling distribution of the estimator, there is no less risky way to construct an unbiased estimator, given no prior information as to where the cake is located.
The engineer calculates how many points can be collected until the cake is cool enough to cut. It turns out to be forty points. The engineer is more than satisfied with the sample size. Being proud of the design, the engineer tells many friends about it, including a clever elf.
The clever elf goes to the bakery owner and recommends cutting the cake in a large open area so the entire town can watch the process for each cake. The elf was right; the town was fascinated. The elf also suggested building a display to show the location of every data point. Like a countdown for a rocket launch, watching a value appear in the data display increased anticipation. That, in turn, increased sales of coffee, cookies, and muffins.
Once enough elves had arrived, the elf started taking gambles with other elves about whether the left side or the right side would be green. The bakery turned into a veritable casino filled with all the elves in town.
That lasted until the clever elf started betting. At first, no one paid any attention to the elf. However, word started getting around that the clever elf was winning a lot of money. A few of the older and wiser elves started paying attention.
They noticed that about onefifth of the time, the clever elf would not gamble at all. In fact, the clever elf often said that he was sick at those times and apologized because he felt he couldn’t concentrate. The older, wiser elves brought in a wizard to detect if magic was being used. It was not. They did notice that when a new data point appeared on the display, the clever elf would input it into his cell phone.
They also noticed that during the other fourfifths of the time, the clever elf would bet as much as possible, even borrowing money from others. They also noticed that the clever elf never lost. The clever elf was winning one hundred percent of the time, or not gambling at all.
Word got around. No one would take his bets and gambling halted when he was around because everyone would mimic his bets. No one would take the other side of the bet.
The clever elf realized that he had outsmarted himself, so he went to the orcs. He explained to the orcs about how much fun the elves were having gambling and so the orcs went to the bakery.
The clever elf changed tactics. The elf realized that he would have to gamble all the time. He just chose to make small bets twenty percent of the time and bet everything eighty percent of the time. The behavior was so obvious that the orcs quickly realized something was going on, and they accused the clever elf of cheating. The clever elf learned a hard lesson about gambling with orcs.
Finally, the clever elf went to the humans. He decided that there were so many humans that if he gambled the same amount on every bet and spread it over enough people, no one would notice. The clever elf won around ninety percent of the time.
Eventually, the clever elf built up enough money that he went to the king and asked to be admitted to the nobility. The king agreed, for a very substantial fee, of course, and the disclosure of how the elf won so much money.
The elf explained that whenever the MVUE was outside the Bayesian posterior, then it had to be in a location where it was physically impossible for that to be the center of location for the cake.
The clever elf started to explain Bayes’ law, but the King stopped him. He said, “show me in pictures about this Bayes’ law.”
The elf explained that Bayesian analysis is built on the likelihood function and prior knowledge. Of course, no one had prior knowledge of where the baker would place the cakes, so each point was equally likely. The prior probability of the cake’s center being at any point on the table was one divided by the area of the table if the center could be on an edge and a reduced area if the entire cake had to be on the table somewhere.
Nonetheless, it is the likelihood that matters here. Because the cake is of known diameter, it is known that the center of the cake must always be within three inches of any observed point. The likelihood function reverses this property. Each point three inches around the first observation is equally likely to be the true center of the location of the cake. The likelihood is for every point inside or on the perimeter of the circle that it is the correct point.
This new circle is called the posterior density of the first observation. This posterior density is also the prior density of the second observation. This process of changing beliefs about where the center of the cake is located is called “posterior updating.”
Once a second observation happens, any point inside the intersection of the two circles is now the posterior distribution of beliefs about possible locations. In this case, it creates a posterior that looks like a lens.
The only points that could be the center of the cake sit inside this lens. This lenslike shape becomes the posterior distribution of the second observation and the prior distribution for the third observation.
It is with the third observation that the Frequentist sample mean, and the Bayesian posterior mean no longer agree. The three observed points are (4.5,8), (9.5,8), and (6.75,8). The reason that there is no difference in the posterior from the second to the third observation is that the valid choices for the center of the cake based only on the third observation will make no changes to the intersection. That brings up an essential element of the multiplicative nature of Bayesian analysis versus the averaging nature of Frequentist analysis.
Only new information about a parameter gets into the posterior density or mass function. If, as here, the prior already has the same or more information content than the likelihood, then the posterior will equal the prior. For a data point to impact the posterior calculations, that data point has to have information not already known about a parameter.
Although it is true that had the third point been observed before the second, then all three points would have impacted the calculation of the second and third posteriors, the joint posterior does not depend on the information in the final observation, only the first and second of the two.
The Frequentist calculation, on the other hand, is based on averaging over the sample space. It has several valuable properties.
First, it is the MVUE. Second, it has good properties over the entire sample space but may not have good properties for a specific sample. Third, if used in a decision, as if the correct point, it minimizes the maximum possible loss that could obtain. The MVUE is based on the impact of the average amount of information given a true model and does not depend on knowing the real value of the parameter.
The difference causes the Bayesian posterior mean to be (7,8), and the Frequentist sample mean to be (6.91667,8). In this case, both are inside the posterior.
The clever elf, shows the King the plot of the posterior for the second cake that was baked, assuming it was centered on (7,8). It is clear that the sample mean (green) is outside the set of possible values (blue, with red for the posterior mean).
The engineer splutters at hearing this. The elf, now called Lord Lolthlorian, ask the engineer if they remembered in their first statistics course of seeing a confidence interval such as [2,12] for a sample that could only contain positive numbers?
The engineer replied, well, of course, but the only thing that matters about a ninetyfive percent confidence interval is that the interval covers the parameter ninetyfive percent of the time or higher. It does not matter than the left bound is impossible; it only matters that the interval covers the parameter often enough.
Yes, and for the MVUE, what matters is that it minimizes quadratic loss over the sample space, while, for the posterior mean, what matters is that it minimizes quadratic loss over the supported parameter space. It could exist that the posterior mean does not exist, such as when there is a hole in the parameter space, and the average of the posterior would be in that hole.
In that case, however, a quadratic loss would be inappropriate for either method. In that case, the Bayesian posterior mean would not be a closed operation over the possible set.
The engineer, recovering her balance, points out that the goal was not to build a casino but to cut cakes well. Lord Lolthlorian agreed. He said, “we have seen a thousand cakes cut, we should look at each cut and see how it worked out. We can ignore the specific cuts and look at the performance of the estimators. First, we should check to verify the data appears as it should.
They make a chart of all 40,000 points to see if they see any unusual holes or patterns. They see none.
They also find the marginal densities along the x and yaxis using kernel density estimation expecting to see a parabola, and that is approximately what they do see. The marginal density along the yaxis is omitted here.
Next, they find the kernel density estimates of the two types of estimators, placing them on the same scale for comparison. They do so along each axis, although the sampling distributions along the yaxis are omitted here for space.
Finally, they construct Tukey’s fivepoint summary with the mean for both dimensions, looking at both the Frequentist sample mean and the Bayesian estimates of the center.

x_bar 
y_bar 
Posterior_c_x 
Posterior_c_y 
Min. 
6.126 
7.192 
6.565 
7.654 
1st Qu. 
6.836 
7.851 
6.945 
7.953 
Median 
7.001 
8.005 
7.001 
8.000 
Mean 
6.999 
8.010 
6.999 
8.002 
3rd Qu. 
7.157 
8.170 
7.052 
8.051 
Max. 
7.671 
8.763 
7.540 
8.452 
This disturbs the engineer. Lord Lolthlorian points out that the Bayesian estimator, subject to the choice of a prior density, automatically makes tradeoffs between accuracy and precision. The Bayesian estimator is intrinsically biased but exchanges that bias for increased precision. The tradeoff happens in a manner such that a Bayesian estimator cannot be firstorder stochastically dominated by another estimator, subject to the prior.
This disturbs the engineer, and they ask, “why isn’t this taught in statistics courses for practitioners?” Lord Lolthlorian responds that there is only so much time in a program to teach statistical methods, and these methods require calculus and numerical integration skills. Also, although it is not as useful as a tool for gambling, lotteries, finance, or cake cutting, the MVUE would be superior in certain inferential problems. If there were a null to falsify, particularly a “sharp null,” the Frequentist method would shine. Also, if the maximum loss were terrible, then it may be superior.
There is a difference in the utility of a tool if one of the options is that everyone in some circumstance would die if a bad sample were obtained instead of some people losing money from a gamble or some cakes being a little lopsided.
Lord Lolthlorian said, as long as we are discussing it, we should look at the properties of the Bayesian and Frequentist estimator when the MVUE is inside the posterior and when it is outside the posterior to gain some information about how either of them is performing as estimators.
When the MVUE is inside the posterior, their sampling distribution is very similar, although the Bayesian estimator has a bit more mass in the center. Only the density along the xaxis is shown for brevity.
It is when the MVUE is outside the posterior that the differences become substantial. A value near the perimeter of the cake, or a run of values on one side of the axis, will tend to pull the sample mean. On the other hand, a value near the perimeter tends to cut down the size of the posterior, making it more precise. Likewise, a run of many values located near together has almost no impact on the posterior density because the points have approximately the same information in them.
The impact of this can be seen on the sampling distributions of both types of estimators based on whether the MVUE has been pulled outside the posterior or not.
The precision of the MVUE deteriorates when it is pulled outside the posterior either by values near the perimeter or runs, although it is equally accurate.
The Bayesian estimator, however, has its precision improved by the added information from runs or the presence of edge values. Because the MVUE was outside the posterior in sevenhundred and ninetyone of the one thousand cases, the effect is a bit more pronounced than if being outside the posterior were a weak effect.
Looking at the sampling distribution of the estimates of the center where all cakes were placed at (7,8) for all 1000 cakes results in a pronounced difference in precision between the two. The MVUE generates a wide, mildly sloping hill, while the Bayesian posterior mean generates a narrow, steep mountain.
The engineer and the owner of the bakery confer. Lord Lolthlorian asks if they are going to switch to Bayesian methods. The engineer replies, “no.”
Lord Lolthlorian remained surprised until the engineer came out thirty minutes later with a six by sixinch table with a metal frame to slowly move the cake from the edge if it is placed partly over an edge and a blade that cuts diagonally across one of the four corners.
The engineer says, “if you cannot control for the natural variation in nature, you change the nature of the problem so that you do not have to deal with this.”
To maximize revenue, the cutting of the golden cake was performed out of sight, and the cake was given as a prize for coffee drinkers that bought lottery tickets while watching the old cake splitter still produce cakes, but with hidden data so no one could replicate the clever elf’s solution.
The clever elf moved to the United States to work in finance. He was offered a job in statistical arbitrage; he declined the position, instead set up a real arbitrage fund.
The elf noted that there were several sources of arbitrage present in the finance market and that some data scientists were unknowingly violating a principle of probability and a principle of macroeconomics. Together, the use of these tools was creating arbitrage opportunities against market makers and hedge funds.
The elf could not help but notice the irony that tools designed to capture statistical arbitrage opportunities were accidentally creating arbitrage opportunities.
A friend that introduced the elf to finance asked: “well, couldn’t we just use the same formulas but with a Bayesian estimator?” Lord Lolthlorian responded, “no.”
The models in use are built on Frequentist axioms, and when one attempts to derive them in a Bayesian framework, the results are not the same. Models that look like do not follow as solutions under Bayesian probability interpretations. Underlying these models is an assumption that the parameters are known. Parameters are random variables in Bayesian thinking. A data scientist cannot just pick up one model and plop it into a Bayesian space; the data scientist has to start over.
His friend complains, “but these are valid Frequentist models, and there is no edge to the cake in a parameter space that is a halfbounded subset of the real numbers.”
The elf replied, “you are thinking about this as a regression and in terms of parameters, but there is an edge. That edge has to be the edge for any model of finance. It requires that the price of an asset, in equilibrium, must equal its discounted cash flows, which must also equal the replacement cost of the firm.”
The friend complained, “but what are the two principles being violated?”
The elf explained that the probability issue is called coherence. A statistic is only coherent if fair gambles could be placed on it. Frequentist statistics are not coherent because that is not their goal. Their goal is protective. It is to control for certain types of errors. That makes the frequencies subject to a loss function rather than being actual probabilities.
Because the sampling distribution of the mean and the median are different, they generate different confidence intervals, different predictive intervals, and different point estimates. They imply different frequencies. The distributions are first conditioned on a loss function. Bayesian predictions minimize the KL divergence between the model and nature, so they are, intrinsically, the closest model to nature that is possible given the information.
That happens because the KL Divergence can be derived directly from Bayes theorem. It is a direct transformation of the Bayesian posterior predictive density and nature’s density.
The macroeconomics principle is a bit more subtle. An example of this can be seen in the savings paradox. If everyone starts saving, then no one consumes, and the return on savings collapses, and investment generates total losses. If everyone consumes, then no seeds are saved for the next season, and everyone dies.
If all data scientists use the same general methodology, without checking the rationality of those models against ground truth and instead do backtesting and crossvalidation, then they create Keynes style rigidities that would not otherwise exist in nature because they have unintentionally adopted highly similar trading rules.
When Long Term Capital Management (LTCM) collapsed, it was a surprise because the markets had been functioning as predicted up to the collapse. What was missing was that LTCM was, unintentionally, controlling the trading rules for the entire system. As such, everything had to go the way they priced everything. The rigidities in the system were their own. When cash flows for the underlying assets diverged from their models, the system collapsed.
“If most data scientists are using the same models and the models do not match the physical reality, then they unintentionally create longrun arbitrage,” explained the elf.
“Want to go out for some cake,” asked the friend.
“I know a great bakery,” said the elf.
David Harris can be found on LinkedIn here
The code to produce the data is:
rm(list = ls())
#clear variables
#grab libraries
library(ggplot2)
library(export)
set.seed(101)
#create repeatable set of random variables
#must be 4/pi times greater than target or more
Number_of_Samples<1400
Sample_size<40
#center of circle
c_x<7
c_y<8
x<matrix(runif(Number_of_Samples*Sample_size,1,1),nrow =1)
y<matrix(runif(Number_of_Samples*Sample_size,1,1),nrow = 1)
Boolean<matrix(rep(0,Number_of_Samples*Sample_size),nrow = 1)
Boolean<ifelse(x**2+y**2<=1,1,0)
x<x[Boolean==1]
y<y[Boolean==1]
#target number
Number_of_Samples<1000
rm(Boolean)
x<x[1:(Number_of_Samples*Sample_size)]*3+c_x
y<y[1:(Number_of_Samples*Sample_size)]*3+c_y
x<matrix(x,nrow = Sample_size,ncol = Number_of_Samples)
y<matrix(y,nrow = Sample_size,ncol = Number_of_Samples)
#initialize and define estimators for x and y axis, top=frequentist, bottom=bayesian
#constructed as a matrix for clean use of apply function
estimators<matrix(rep(0,5*Number_of_Samples),ncol = Number_of_Samples,nrow = 5)
row.names(estimators)<c(“x_bar”,”y_bar”,”Posterior_c_x”,”Posterior_c_y”,”Is_Freq_Est_Possible”)
bayesian_posterior_means<function(variables,s=Sample_size){
#this splits data back into x and y
#variables passed as single matrix to permit using apply family
x<variables[1:s]
y<variables[(s+1):(2*s)]
#it is cleaner to recreate mean x and y than to pass them
x_bar<mean(x)
y_bar<mean(y)
#creates acceptancerjection variables
AR_TRIES<10000
#creates random draws for acceptancerejection
r_c_x<runif(AR_TRIES,min = max(x)3,max = min(x)+3)
r_c_y<runif(AR_TRIES,min = max(y)3,max = min(y)+3)
bayes_feasible<rep(0,AR_TRIES)
#creates a reporting variable as to whether the frequentist sample mean is more than three units from
#at least one observation
freq_feasible<0
#tests each element of random possible solutions for feasibility
for (i in 1:AR_TRIES) {
if(max((xr_c_x[i])**2+(yr_c_y[i])**2)<=9)bayes_feasible[i]<1
}
if(max((xx_bar)**2+(yy_bar)**2)<=9)freq_feasible<1
#posterior means for a uniform, bounded distribution are the marginal means
bayes_x<mean(r_c_x[bayes_feasible==1])
bayes_y<mean(r_c_y[bayes_feasible==1])
return(c(x_bar,y_bar,bayes_x,bayes_y,freq_feasible))
}
#applies bayesian posterior mean construction over the data set
estimators[1:5,]<apply(matrix(rbind(x,y),ncol=Number_of_Samples), 2, bayesian_posterior_means)
#constructs mean squared error of each type of estimator
Frequency_Mean_Squared_Error<sum((estimators[1,]c_x)**2+(estimators[2,]c_y)**2)/Number_of_Samples
Bayesian_Mean_Squared_Error<sum((estimators[3,]c_x)**2+(estimators[4,]c_y)**2)/Number_of_Samples
Frequency_to_Bayesian_Relative_Efficiency<Frequency_Mean_Squared_Error/Bayesian_Mean_Squared_Error
Percentage_of_MVUEs_That_Are_Impossible<(Number_of_Samplessum(estimators[5,]))/Number_of_Samples
Credit: Data Science Central By: David Harris