For baseball stat geeks, the data mined with machine learning is a game-changer. Here’s how one MLB team is using it to its advantage.
TechRepublic’s Karen Roby spoke with Jeremy Raadt and Zane MacPhee. MacPhee is the analyst in baseball research for the Minnesota Twins, and Raadt director of baseball systems for the Twins. They discussed the use of machine learning in baseball. The Twins are having a good year. As of publication, the Twins are at the top of the American League Central. The following is an edited transcript of their conversation.
Karen Roby: Jeremy, this is not the typical industry that we think about machine learning really being entrenched in. But we’re hearing a little bit more about it and that’s why I’m really intrigued by this. Just start a little bit about, with the Twins, how you guys have been introduced to machine learning and really what it’s being used for.
SEE: TechRepublic Premium editorial calendar: IT policies, checklists, toolkits, and research for download (TechRepublic Premium)
Jeremy Raadt: Typically, you don’t think about pro sports as much in the big data and this type of machine learning, but it’s becoming more and more prevalent as the amount of data has increased. Just in the last couple of years alone, it’s grown exponentially with the amount of sensors and the amount of data just coming in from various sources. So now we can attach sensors to bats. We have video, we have all sorts of things going on now that are tracking so much more at a granular level, what’s happening in the game of baseball, which really allows us to start teasing out the luck component and trying to determine what is the true talent of the player that we’re evaluating.
Karen Roby: That’s really fascinating. Talk a little bit about how machine learning is helping you narrow down the information you need to know?
Zane MacPhee: You know, going back 10 years, all the information that you could ever imagine ever getting from a baseball game was in a box score. How many singles a player got, how many strike outs they got, how many walks. And just over the last 10 years and going into this new decade, we’ve seen just a birth of metrics available to us as analysts in our player evaluation quest. Like Jeremy was mentioning, bat sensors. Stat cast at the NLB level is very popular. You’ll see that on telecast all the time. You know, all that player tracking and pitch tracking information.
And as that data gets bigger and the questions that it can answer get more derivative, you can start asking questions about the outcomes, or the drivers of the outcomes of plate appearances or pitches, and start answering questions like why that pitch was good based on its movement or velocity. Our reliance on machine learning at that point as the data gets bigger and the questions get deeper and more interrelated, our reliance on machine learning techniques is going to increase. And as we look toward the future, it’s not like we’re going to get less data in the future. We’re going to get more and more. And we see this as a place where building an infrastructure early on here will set us up very well down the line to turn that information into actionable decisions as quickly as possible.
Karen Roby: Is there any way to say a time savings possibly that’s involved here by using machine learning?
SEE: Natural language processing: A cheat sheet (TechRepublic)
Zane MacPhee: I mean, if you were to go to a website like FanGraphs, which is one of the more prominent public baseball research-y websites, you’ll see a projection, a couple of projection systems that are based exclusively on box scores. And when we were building those in house, they would take maybe on the high end, I don’t know, an hour to run front to back, just because there’s not much data on every player.
And now, when we start doing research on this pitch level and player tracking information that’s available to us, the simulations and models we want to run are more complex. We want to simulate them a lot more. We want to be able to manipulate them in real time. And so, without machine learning techniques or some more data infrastructure techniques that we’re using, those models would take days or weeks to run. And in this landscape, information needs to be actionable as soon as possible. And so, running a model and waiting a week for a response is not really going to help our decision-makers make quick, timely, and responsible decisions.
Karen Roby: Jeremy, talk a little bit about how you went about making the decision to move forward with a program, with a plan here, and vendor selection, that type of thing, how difficult was it and what were some of the roadblocks that you ran into?
Jeremy Raadt: Yeah, for the last couple of years, we were attempting to do a lot of the stuff on our own, stringing different technologies together step by step, building on a lot of the servers’ infrastructure on the cloud ourselves, wiring things together, the communication. It became too much for our small team to handle. We don’t have that many people in our team. And so we were looking for resources that could help us get to the next level without taking as much human resources.
And so over the last year or so, we started investigating different platforms. One of the ones that we landed on was Databricks as our way of being able to process and handle this amount of data that we want to do and be able to do it in real time. That was one of the considerations we wanted. As we are now getting these metrics in real time, within seconds after they happen on the field, we want to make that actionable.
Databricks came to our attention because of their deep integration in the Microsoft ecosystem, which The Twins had a long-standing relationship with. We did investigate some other platforms as well, but what we really found that was helpful in Databricks, that set them apart, is just beyond like the machine learning and the actual compute power that it could have using Spark, they had this whole ecosystem around how do you manage your models, how do you do the DevOps, how do you enable your analysts to be more productive, to be able to share code, to be able to test their models and have all those runs track. And so they just had that whole rich ecosystem that we could dive into since we have a limited number on our team. We could lean heavily on that to coordinate a lot of the efforts instead of having people try to wire those things up ourselves.
SEE: Inside UPS: The logistics company’s never-ending digital transformation (free PDF) (TechRepublic)
Karen Roby: Zane, sometimes when we talk about data, of course, it’s often referred to as the “new oil” of the digital economy, because there’s just so much information that can be at your fingertips. Is it sometimes a little bit overwhelming? Do you feel like you can analyze too much? Do you ever feel that way?
Zane MacPhee: Oh yeah. But, that’s a motivating factor, I think. The deeper you dig even into the data world, the deeper the hole is. And that’s a really motivating factor, I think for our research team: We’re driven by the deeper and deeper questions that you can answer and the deeper understanding you can make.
The one thing I would add to that is there’s a lot of data at our disposal, and it’s, in a lot of ways, really hard to know how to weigh that appropriately or to make it actionable. Our decision-makers are evidence-based, and our job is to present them with evidence for them to make their decisions on.
And so part of the competitive landscape is turning this data into actionable, little bite-size morsels for our decision-makers. And when we have all this data at our disposal, it can be very overwhelming. And so kind of integrating Databricks and some practices there has allowed us to streamline that process, and hopefully cut through the noise and maybe not over-research as much as possible, just because we can do things a lot quicker.
Karen Roby: Back to that idea that machine learning, typically we don’t think of as its involvement in baseball. But just IT in general. With so many companies, IT was this department that was off to the side, brought in only when there was a small network issue and when computers go down. But IT obviously has such a big seat at the table now in really every industry, because it’s just so woven into the fabric of everything that we do. Do you feel like that’s the case here with sports? Is IT brought into the conversation much more now than ever before?
Jeremy Raadt: Yeah, for sure. I think that’s one thing that gives the Twins a competitive advantage over some other teams in the other sporting venues as well, and major league baseball, is that we actually integrate the baseball technology and IT. In fact, I’m actually on the IT team and on the baseball team. We have really deep crossover between us. So we can use the expertise of IT, the years of experience doing networking and server administration with a lot of the baseball technology. And that has given us a huge competitive advantage to be able to move really fast and make really good decisions.
Karen Roby: Do you feel like machine learning is an advantage for you guys kind of in a nutshell, and what sets you apart potentially from other teams, especially looking down the road?
SEE: AI on the high seas: Digital transformation is revolutionizing global shipping (free PDF) (TechRepublic)
Jeremy Raadt: Yeah. In my opinion, machine learning allows us to tackle this mountain of data to answer these questions that we’ve always asked in our head for the last 20 years, we’ve been asking these questions. We can finally, we feel like, answer these. Machine learning gives us the tools. And the platforms that we’re using right now allow us to get those even faster. Not only is the data increasing, but our speed and ability to get the data in-house is increasing. In fact, we can get it within seconds of the play happening. Being able to react so much quicker, machine learning gives us that advantage that we feel like we have an advantage over other teams.
Karen Roby: And Zane, as an analyst too, in baseball, I’m guessing, probably in your sleep, you’re thinking of players and how they’re batting, what they’re doing, how they’re feeling. So, this certainly has to be a huge help to you personally.
Zane MacPhee: Yeah. I think, when you track the level of information in our data sources as an analyst, the questions you could answer were very straightforward at the very beginning. Basically, how good you would be in the future as a player would be a function of your past performance. And now by getting more and more derivatives of these processes that lead to outcomes, whether it be, Jeremy was mentioning swing characteristics and bat sensors, or pitch-level information, I think the industry in general is moving more toward biomechanics as well. The more derivative you get, the less clear the questions or the reasons an outcome is happening, are clear.
And so machine learning is going to help cut through some of that noise where maybe 10 years ago, I knew all the questions that I needed to ask, and machine learning is kind of helping us maintain relevance in our curiosity and kind of cut through some of that noise.
Jeremy Raadt: It’s like the golden age of being a baseball nerd right now. It’s so much fun! Yeah, it’s a great time for people getting into machine learning and things. There’s so much data coming in, and it’s growing. As we start in the future, we see a lot with biomechanics and being able to track every ligament and movement and strain. And there’s so much more data that we see that’s coming down the pike, that this is just going to become an arms race in the future.
Karen Roby: It’s definitely not our granddad’s baseball. It’s really interesting to see how it’s moving and making its way into sports. Again, an industry where people never really saw it coming.
Jeremy Raadt: Yep. It’s a fun area to be in, for sure.
Credit: Google News