Credit: Google News
Lost amongst the hype and hyperbole surrounding machine learning today, especially deep learning, is the critical distinction between correlation and causation. Developers and data scientists increasingly treat their creations as silicon lifeforms “learning” concrete facts about the world, rather than what they truly are: piles of numbers detached from what they represent, mere statistical patterns encoded into software. We must recognize that those patterns are merely correlations amongst vast reams of data, rather than causative truths or natural laws governing our world.
As machine learning has expanded beyond its roots in the worlds of computer science and statistics into nearly every conceivable field, the data scientists and programmers building those models are increasingly detached from an understanding of how and why the models they are creating work. To them, machine learning is akin to a black box in which you blindly feed different mixes of training data in one side, twirl some knobs and dials and repeat until you get results that seem to work well enough to throw into production.
Beyond the obvious issues that such models are extraordinary brittle, the larger issue is the way in which these models are being deployed.
It is entirely reasonable to use machine learning algorithms to sift out extraordinarily nuanced patterns in large datasets. Indeed, a very powerful application of machine learning can be around identifying all of the unexpected patterns underlying phenomena of interest in a dataset or to verify that expected patterns exist.
Where things go wrong is when we reach beyond these correlations towards implying causation.
Pattern verification is an especially powerful way of using machine learning models to both to confirm that they are picking up on theoretically suggested signals and, perhaps even more importantly, to understand the biases and nuances of the underlying data. Unrelated variables moving together can reveal a powerful and undiscovered new connection with strong predictive or explanatory power. On the other hand, they could just as easily represent spurious statistical noise or a previously undetected bias in the data.
Bias detection is all the more critical as we deploy machine learning systems in applications with real world impact using datasets we understand little about.
Perhaps the biggest issue with current machine learning trends, however, is our flawed tendency to interpret or describe the patterns captured in models as causative rather than correlations of unknown veracity, accuracy or impact.
One of the most basic tenants of statistics is that correlation does not imply causation. In turn, a signal’s predictive power does not necessarily imply in any way that that signal is actually related to or explains the phenomena being predicted.
This distinction matters when it comes to machine learning because many of the strongest signals these algorithms pick up in their training data are not actually related to the thing being measured.
Partially this is because often the thing we are most interested in cannot be directly observed through any single variable. Predicting most events, from the likelihood a user will buy a given product through the likelihood a given country will collapse into civil war tomorrow, relies on a patchwork of signals, none of which directly measure the actual thing we are interested in.
In essence, in most machine learning, the actual thing we hope to have our model learn cannot be learned directly from the data we are giving it.
This may be because the medium available (such as photographs) do not fully capture the phenomena we hope to have it recognize (such as identifying dogs). A pile of dog photographs cannot build a model to recognize their barks.
More often, it is because the thing we hope to measure (like the conversion of a website visitor into a customer) cannot be directly assessed through any single variable. Instead, we must proxy it through all sorts of unrelated variables that capture bits and pieces of the intangible “thing” we are trying to predict.
These bits and pieces, however predictive they may be, are merely genuinely or spuriously correlated with the thing we’re trying to predict. They do not necessarily cause or even explain how and why that thing occurs.
It is entirely possible to learn that a certain shade of color in a purchase button on a website makes it more likely that users will complete a sales transaction. That pattern may be strong enough that it holds true across demographics and when implemented meaningfully increases sales. The problem is that it is unlikely that that specific color is the triggering factor. The sales increase is instead likely related either to the context in which the button appears on the page or the context of that color to the site’s customer demographic.
Only by moving from correlation to causation and understanding why that pattern is so predictive, can we begin to trust that that pattern will continue to function as expected over time.
Moving from correlation to causation is especially important when it comes to understanding the conditions under which a machine learning model may fail, how long we can expect it to continue being predictive and how widely applicable it may be.
Using machine learning to identify correlative patterns in data is an extremely powerful approach to understanding both the nuances and biases of our data and the unexpected very real patterns that our current theoretical understandings failed to point us towards.
On the other hand, when we attempt to reach past this usage towards treating our models as “discovering” or “learning” causative new “natural laws” or concrete “facts” about the world, we tread upon dangerous ground.
Putting this all together, the ease with which modern machine learning pipelines can transform a pile of data into a predictive model without requiring an understanding of statistics or even programming has been a key driving force in its rapid expansion into industry. At the same time, it has eroded the distinction between correlation and causation as the new generation of data scientists building and deploying these models conflate their predictive prowess with explanatory power.
In the end, as technology places ever more powerful tools in the hands of those without an understanding of how they work, we are creating great business and societal risk if we don’t find ways of building interfaces to these models such that they are able to communicate these distinctions and issues like data bias to their growing user community that lacks an awareness of those concerns.
Credit: Google News