Laying the Groundwork — Creating a New Baseball Projection System: Part 1
Only some of my blog posts are reposted on Medium. Original post here: http://ericdykstra.me/blog/laying-the-groundwork
This series of posts is about building a baseball projection system from scratch to be used for the purposes of fantasy baseball. If you already know the concept of baseball projections and of fantasy baseball, you can go ahead and skip to the next subheading, as the next few paragraphs will describe these.
A baseball projection is a player’s forecast for a season. What his expected batting average is, how many runs he is expected to score, how often he is expected to strike out, etc. These look exactly like a real statistical season, except it’s statistics for a season that hasn’t happened yet. For example, the Steamer projection for Mike Trout gives him a .304 batting average over 529 at bats in 147 games for the 2017 season.
Baseball teams use projection systems to estimate how much a player will be worth, and use that data to drive decisions about salary and trade decisions. The story of Sean Smith, creator of the projection system CHONE, shows just how valuable this information is to Major League teams. After his projections had a few years of results that bested all other systems, he was hired away by a Major League team who wanted his expertise all to themselves. For a more in-depth read on projection systems, check out this article over at Fangraphs for a great primer.
Fantasy baseball is a game played by baseball fans. A group of usually 10–15 “managers” take turns drafting from a pool of real MLB players. They try to choose the players that have the best overall statistics at the end of the season in a number of categories (home runs, for example). The players real life statistics are tallied together, and the team with the best combined statistics wins. This means that the fantasy managers who are the best at predicting performance will win their fantasy baseball league. I’m sure you can already see how fantasy baseball managers use projections to try to get an edge on their opponents. If you’re interested in knowing more about the history and different styles of fantasy sports, the Wikipedia article is a good resource.
Why Build a New Projection System?
I’ve been a baseball fan all my life, and a stathead since before Fangraphs existed. I was calculating WAR in spreadsheets to make the case Grady Sizemore should have won the MVP in 2006, before WAR was a published statistic anywhere. I took this obsession to the next level when I created an Excel-based, game-by-game score prediction model to try to beat Las Vegas betting odds.
My next project after that was beating daily fantasy baseball, which I did 3 years in a row, culminating in the 2014 season when I made $1,894.04 profit on a 12.52% ROI. My method for this was a program that used a combination of player stat projections, Las Vegas over/under odds, and a variation of the Knapsack Problem algorithm to find the lineups with the best estimated score. You can see the game-by-game results of my season that I uploaded to a Google Spreadsheet here. I stopped playing in the 2015 season when I moved to Japan, and none of the daily fantasy sites let me play any more (believe me I’ve tried).
So why this project and why now? Well, I almost always get the urge to do something baseball-related as the offseason wears on. It’s still more than a month before my own season starts, and a couple weeks more before the MLB season gets underway. It’s also just about the right amount of time to get this project off the ground and built from start to finish. I figure I can get my projections done by mid-March working on this just on weekends, plenty of time before my fantasy baseball drafts. Building a projection system also feels like the right balance of something I am confident I can do with some effort, but plenty to learn along the way.
At its core, every projection system is just data in one end, and a projection for the next season on the other. To get an idea of how simple a projection system can be, I’m going to introduce you to Marcel the Monkey. Marcel is a projection system created by Tom Tango, consultant to MLB teams and co-author of The Book: Playing the Percentages in Baseball (this is one of the books that sparked my interested in MLB statistics), for the purposes of being a sort of baseline projection; any projection system worth its salt should be able to beat a monkey. It’s very simple, as it just takes the 3 most recent year’s data, weighs the most recent data more heavily, regresses to the mean, and gives a penalty based on age of the player.
Since a projection is just a data transformation, I’m going to be using the functional programming language Elixir to create these projections. It’s a language with a simple way of expressing data flows, using “pipes”, indicated by the
|> symbol to pass data through functions. A simple Elixir “pipeline” to find the sum of the odd numbers between 1 and 100 might look like this:
1..100 |> Stream.filter(odd?) |> Enum.sum.
So here’s the basic “pipeline” that I’m going to use to make my projections:
players_stats |> normalize_for_environment |> project_season |> adjust_for_2017_environment
So what does this mean? We will start with a player’s stats from the previous 3 years, and adjust them for the playing environment they were in. Then we take those adjusted numbers and come up with a slightly more nuanced version of what Marcel the Monkey does, coming up with a weighted, regressed, age-adjusted projection. Finally, we take that projection and adjust for the environment that player is going to be playing in for the 2017 season.
Each of these steps is fairly complex, and will include multiple steps of data transformation. I’ll cover exactly what “playing environment” entails, and how we’re going to adjust for it in the next post.