This article was written by Tom Fawcett and Drew Hardin.
Throughout its history, Machine Learning (ML) has coexisted with Statistics uneasily, like an ex-boyfriend accidentally seated with the groom’s family at a wedding reception: both uncertain where to lead the conversation, but painfully aware of the potential for awkwardness. This is caused in part by the fact that Machine Learning has adopted many of Statistics’ methods, but was never intended to replace statistics, or even to have a statistical basis originally. Nevertheless, Statisticians and ML practitioners have often ended up working together, or working on similar tasks, and wondering what each was about. The question, “What’s the difference between Machine Learning and Statistics?” has been asked now for decades.
Machine Learning is largely a hybrid field, taking its inspiration and techniques from all manner of sources. It has changed directions throughout its history and often seemed like an enigma to those outside of it. Since Statistics is better understood as a field, and ML seems to overlap with it, the question of the relationship between the two arises frequently. Many answers have been given, ranging from the neutral or dismissive:
- “Machine learning is essentially a form of applied statistics”
- “Machine learning is glorified statistics”
- “Machine learning is statistics scaled up to big data”
- “The short answer is that there is no difference”
to the questionable or disparaging:
- In Statistics the loss function is pre-defined and wired to the type of method you are running. In machine learning, you will most likely write a custom program for a unique loss function specific to your problem.
- “Machine learning is for Computer Science majors who couldn’t pass a Statistics course.”
- “Machine learning is Statistics minus any checking of models and assumptions.”
- “I don’t know what Machine Learning will look like in ten years, but whatever it is I’m sure Statisticians will be whining that they did it earlier and better.”
The question has been asked—and continues to be asked regularly—on Quora, StackExchange, LinkedIn, KDNuggets, and other social sites. Worse, there are questions of which field “owns” which techniques [“Is logistic regression a statistical technique or a machine learning one? What if it’s implemented in Spark?”, “Is Regression Analysis Really Machine Learning?” (Mayo, see References)]. We have seen many answers that we regard as misguided, irrelevant, confusing, or just simply wrong.
We (Tom, a Machine Learning practitioner, and Drew, a professional Statistician) have worked together for several years, observing each other’s approaches to analysis and problem solving of data-intensive projects. We have spent hours trying to understand the thought processes and discussing the differences. We believe we have an understanding of the role of each field within data science, which we attempt to articulate here.
The difference, as we see it, is not one of algorithms or practices but of goals and strategies. Neither field is a subset of the other, and neither lays exclusive claim to a technique. They are like two pairs of old men sitting in a park playing two different board games. Both games use the same type of board and the same set of pieces, but each plays by different rules and has a different goal because the games are fundamentally different. Each pair looks at the other’s board with bemusement and thinks they’re not very good at the game.
The purpose of this blog post is to explain the two games being played.
To read the rest of the article, with Statistics and Machine Learning explained, click here.