Modernizing Baseball Data and Analytics with the Lakehouse

Alexander Booth
5 min readAug 24, 2023

--

Globe Life Field in Arlington, TX

The world of baseball has changed dramatically since the Oakland A’s first used sabermetric methods to evaluate players (popularized by the 2011 film Moneyball). With the right data, MLB teams can get a distinct competitive edge. Within the Texas Rangers, we’ve fully embraced a data-driven approach in order to gain a competitive advantage against our peers.

Across the board, we rely on data and analytics to make decisions that will help us win more games and hopefully championships. We use biometric and in-game statistics to assess player performance, identify strengths and weaknesses, and make informed decisions regarding team composition, player development, and roster management. And these insights aren’t limited to only our team. We are able to use data to evaluate potential prospects, identify talent, and make strategic decisions during player acquisitions and drafts. Before and after each game, we analyze player performance, field positioning, and more to formulate game strategies, optimize in-game decision-making, and gain insights into opponent tendencies and patterns.

However, it wasn’t long before we found that our underlying infrastructure wasn’t able to cost-effectively scale to meet the ground-breaking use cases we were trying to deliver to the organization. There has simply been an explosion of new data sources, including streaming and unstructured or semi-structured data generated by thousands of sensors from cameras all around our stadium, and our old infrastructure couldn’t keep up.

Position and scope of Hawkeye cameras at a baseball stadium

We then decided to switch to a multi-cloud data warehouse, but with the compute and storage locked together, this system couldn’t handle the scale we were dealing with. In the last few years alone, there have been so many new data sources, including across the Amateur and International space, that we had not considered ingesting before. As a result, the multi-cloud data warehouse became too rigid to adapt to our changing needs, too complex to manage with our lean engineering team, and too costly to scale as we continue to ingest more data at higher velocity.

Compounding this was the fact that a lot of our data and systems were disjointed — nobody knew where data was or how to prepare it within the labyrinthian systems. With everyone saving data in different locations, this not only negatively impacted our data quality, it also led to redundant efforts and made the analysis more difficult. It was clear we needed a cloud-based, scalable platform that could grow with us and help us build a World Series-caliber team. That’s why we turned to Databricks.

Lakehouse enables smarter pre and post-game reporting

We knew we needed to get our transformation ETL layer out of our previous solution. With Databricks Lakehouse, we’ve been able to unify data, analytics and AI — we can now conduct analysis and carry out machine learning where the data resides in the Lakehouse. This has greatly simplified the process, allowing us to be more agile and efficient while also saving costs as we scale.

Now, we use Unity Catalog for federated data governance across the entirety of the Texas Rangers’ data environment, along with granular access controls and permissions. And with the use of generative AI, we are exploring use cases like text summarization — think articles by your favorite beat reporter. Feeding these insights into business intelligence reports help us make faster and more accurate decisions.

Meanwhile, Delta Lake contains our different layers of data to streamline our processes — the bronze layer contains our raw data, then the silver layer contains cleaned data, and then the gold layer is used for reporting. Auto Loader helps efficiently process new data files, and then Delta Live Tables speeds up data engineering processes. And to tie it all together, the team has been able to collaborate more effectively on everything with Databricks Notebooks and Workflows thanks to a unified view and templates across different data formats and sources.

With a unified approach to data, analytics and AI, the lakehouse has helped us simplify infrastructure management and harness our data to gain invaluable insights across various areas that support the aforementioned use cases. Our in-depth pre-game reporting enables us to analyze hitter and pitcher tendencies, while post-game reporting allows us to assess team performance and areas of focus moving forward. By capturing data at hundreds of frames per second, we can delve into player mechanics — tracking joints, body parts, and ground force exertion — to predict throwing speed and bat speed.

Furthermore, we are able to accurately predict ball trajectories for specific hitters and recommend optimal player defensive positioning on the field. Additionally, we have revolutionized hit probability and launch angle analysis, determining the most effective ways to put the ball in play. These insights have given our team a competitive advantage in strategy development and player performance optimization.

Texas Rangers All-Star, Josh Jung, discusses how analytics has changed his approach at the plate

Disrupting MLB with real-time data

In baseball, stats are everything, and Databricks has helped the Texas Rangers bring about some impressive numbers. With our previous solution, it would take 24–48 hours to generate post-game reporting, which was too late to be super helpful. Now, we can generate reports in hours, and in some cases, can even generate reports real-time for immediate consumption.

It also used to take up to six weeks to ingest new data sources due to the complexity of our legacy system. Now, it takes less than a week to roll out new data for analytics and ML, resulting in 7x more velocity when producing new data pipelines.

And, thanks to the unified system, we’ve improved collaboration across the board. Less technical analysts can be productive with data, letting us function with a lean data engineering team. This means we can now have 3x more analysts that are self-sufficient to leverage data than before.

The best part is that we’re able to do all of this more cost-efficiently. For the same cost as our previous multi-cloud data warehouse, we can work faster, more collaboratively, more flexibly, with more data sources, and at scale.

In conclusion

Data availability is the catalyst of innovation, and Databricks has allowed us to disrupt MLB and take the game of baseball to the next level. Maybe you’ll even see this impact the next time the Texas Rangers take the field.

Via GIPHY

--

--

Alexander Booth

Alexander is an Asst. Director of R&D w/in Baseball Ops for the Texas Rangers Baseball Club. He holds a MS in Data Science from Northwestern and a BA from WashU