Analyzing the Game of Baseball on GCP

Published in

baseballongcp

41 min readOct 25, 2016

Project “Uecker” — The Harry Doyle Method, by @notthateric

tldr; Data analysis with Google Cloud Platform services is fun. If you like to have fun and like baseball — here is a guide to help you start hacking.

Overview

This post reviews a user journey related to hacking on and analyzing baseball data, primarily the 2016 Major League Baseball (MLB) regular season and postseason, using Google Cloud Platform services and tools. This journey is bucketed into question formulation, data ingestion, data processing, and data analysis.

Note: Throughout this document I will use the formatting of “Note:” to highlight something that is important; however, the context may not fit or tie into the existing flow. Suffice to say if you see “Note:” you want to remember it for later.

Note: In this document I mix ‘I’ and ‘we’, which is a general reference to the team that helped with this project. There is no ‘I’ in team, but I speak on their behalf in this document.

Why Baseball

It’s fun. Well maybe… some people say baseball is boring, others say data analysis is boring. Our hope was to find a middle ground, with a trend towards fun.
There is a massive amount of data out there related to the game. One can find historical game logs back to the early 1900’s or more modern data like ball rotation speeds from every pitch from every pitcher. Access to data is empowering, it would be a shame to just let it sit there.
There is an overwhelming amount punditry and analysis, spanning a century plus of games. Like the access to data, these existing questions, insights, and analysis help birth new ideas. At a minimum we can verify existing insights and beliefs.
Baseball is a highly statistically driven sport. The majority of which are based on basic arithmetic. It’s an approachable analysis domain. Young kids learn ratios by memorizing the batting averages of their favorite players.

Audience

This paper was written with a focus on data centric developers/engineers who build build ETL systems, batch and streaming pipelines to move, sanitize, and prepare data; for data architects who think about what types of systems to build and how; for data analysts that drive day to day operational analysis; and data scientists who attempt to find order and value in the face of ever growing data and business complexity.

If you simply played baseball as a kid or like the game and you are a data nerd then you might enjoy this paper.

Project Goal(s)

The primary goal of this project was to share guidance and direction on how you might use Google Cloud Platform technologies, specifically Cloud Dataflow, Cloud Datalab, and BigQuery within a particular context.

Along the way I will:

Demonstrate how managed serverless services can help save time and money
Show why making mistakes is a good thing
Discover unknowns by looking at data vs. just “Googling it”
Deal with a small data set with big data problems

I will not:

Predict the outcome of the World Series
Reinvent Moneyball
Build a real time betting system capable of beating Vegas enough to retire

Takeaways

After reading this paper and related notebooks and following basic instructions you should be able to:

Access the 2016 game data within BigQuery
Write some basic queries against the data from BigQuery and from Cloud Datalab
Understand the role of Cloud Dataflow as an ETL tool to help move, shape, and aggregate data
Think about data analysis along a continuum of question to insight.
If you are new to baseball you may takeaway some new perspective on the game which you can use to impress your friends, family, or boss.

Bob Uecker

Note: If you are into baseball and do not know Bob Uecker — this paper cannot help you. Insert sarcastic emoticon here.

While working on this project I have had many folks ask what is an “UH-ker” or “YOO-aaa-ker”. Is it a disease, an exotic food, a number with lots of zeros? It’s pronounced “YOO-kur”.

Bob Uecker was a professional baseball player, specifically a catcher. He was charismatic (read: very funny) and considered to be a strong defensive player, but statistically not at the top of the heap. In 1967 he led the league with the most passed balls. With 5 years in the majors he hung up is cleats with a batting average of .146 in 1967. Then in 1971 he became the play-by-play announcer for the Milwaukee Brewers.

Note: Make sure you click on the statistics link. You will hit up baseball-reference.com or mlb.com a decent amount during this paper. Get used to it.

I wasn’t alive when he played, but I did get to see him as a kid on Johnny Carson and then on CBS and NBC as a color commentator. Through his humor he made baseball and sports even more fun.

Then in 1989 he starred in the movie Major League as the commentator Harry Doyle, which became a cult classic. I have watched it an embarrassingly large number of times. Once again, he made baseball even more fun. This is excellent commentary of his top lines from the movie.

After all of these years, at the age of 82, Bob Uecker is still funny. He is a World Series winner from 1964, he hit 14 home runs in the majors, he is a movie star, and a rock solid play-by-play announcer.

For Project Uecker I decided to channel the spirit of Harry Doyle to provide some guidance on analyzing baseball. Even when we fail, it will still be fun — much like Uecker’s playing career. Did I mention he was on a winning World Series team? <grin>

Warming Up

Note: If you are new to baseball you were just passively introduced to your first baseball concept: “Warming Up”. Pitchers warm up in a bullpen prior to entering a game. Not all pitchers that warm up enter the game and sometimes pitchers don’t get to warm up before entering.

To start, we will back away from baseball for a few minutes. As a developer advocate at Google, I spend the majority of my time helping existing customers, potential customers, and fellow data analysis nerds understand how to use or in some cases not use a particular product or product feature to accomplish their goal(s). Their goals are almost always focused on a question.

Question Vectors

In the “boring” non-sport world these questions span challenges like:

What was the impact on net-revenue last week due to the previous week’s marketing campaign?
What is my current stock level on product X and how many online baskets are targeting this product?
Will dropping the price of product Y by 10% will sales increase for product X?

In the “exciting” world of sport — like baseball — these questions might look like:

What was player X’s on base percentage with runners in scoring position last year?
What is my current pitcher’s performance against left handed hitters when playing on the road after less than 5 days of rest and when eating doughnuts for dinner?
Will replacing the current pitcher X with a new pitcher Y help get the next batter out?

If you don’t understand concepts like net-revenue or on-base percentage don’t worry, I’ll make sure to describe context where needed. The one thing to take note of here is that each list of questions has the same time and analysis pattern. “What was” is a historical perspective, “What is” is focused on now or near real time, and “Will” is focused on the future AND each question has it’s own potential answer path: basic analytics (counts, sums, etc.), complex analytics (correlation, causation, etc.), and predictive analytics (pattern analysis, machine learning).

Regardless of vertical (retail, healthcare, gaming, finance, manufacturing, or even sports) or discipline in which one works (marketing, sales, recruiting, security, accounting, coach, etc.) the endgame for answering these questions is the same. Unfortunately, the answer path rests within more questions: “How long with this take, how accurate are the answers, how hard is this going to be, what is this going to cost me?”

Time To Answer

Generically, I like to call this problem space — “Time to Answer” and the time to answer depends on many factors which impact how we choose to answer the question and what risk or cost we will tolerate along the way. A highly delayed answer may be valueless OR a fast answer that is inaccurate maybe dangerous OR the path to answer is so complex that the residual technical debt is too great.

Note: I recently published a webinar covering the relation and tension between speed, cost, ease, accuracy and how to think about “Time to Answer” for both at batch and streaming data processing. Project Uecker is mainly focused on the batch processing domain, although we introduce some streaming concepts for in-game analysis.

With this “Time to Answer” concept let’s look at the journey from question to insight. Unfortunately, once again, new questions arise.

Question: What is the right question to be answering? How do I know if I am accurate?
Ingestion: What techniques should I use to ingest or extract my data?
Transformation: Should I or do I need to apply transformations to my data?
Storage: Should I store in a blob store, in a database, in a HDFS, …?
Imperative Analysis: What imperative analysis should I perform so that downstream consumers don’t have to perform redundant analysis?
Interactive Analysis: How can I dynamically interact with this data? I need to load more data, how will that impact speed?
Machine Learning: Will I need to use machine learning techniques, if so how do my upstream decisions impact my machine learning workflow?

And don’t forget collaboration which may wrap each step. In particular — How can I interact with various team members who are critical to my analysis and question answering journey?

These questions are a continuum. In a perfect world we’d all methodically work each section and then move on, but in the real world we bounce from step to step, sometimes even ignoring important steps. The key is how fast can you move from step to step and how easy is it.

In a perfect world all tuples would be green for any question context, but the world is not perfect mainly because every question context is different and not all tools, products, and providers are the same. This rating appears fast, but gets expensive, and is hard to use.

Note: The color distribution is completely arbitrary. And I am making no specific comparisons of one product vs another, rather I’d like you to think about the challenges you face today within each step — in your real world use case(s). How much pain or not do experience when say performing ETL, can you quantify the pain? How expensive is your data warehouse? What would you change?

In the real world what we really want is the ability to jump between steps with consistent accuracy, with dialable speed and cost knobs, wrapped in an easy to use environment.

Ok, enough conceptual talk, let’s get back to the business of baseball. I’ll touch on these concepts as we continue the journey.

Baseball 101

Before going any further — I will review the fundamentals of baseball. If you have played baseball and/or watched baseball and have a fundamental understanding of the game then feel free to skip this section and go to Determining the Right Question.

Note: I thought about not writing this section, but during my interactions with fellow Googlers and the broader data analysis community I got a some comments like: “I don’t understand baseball, I think it’s boring, I never played, I don’t like sports, etc.”. Granted, my cohort group here is skewed towards the nerd world. What I did notice is after I explained some of the fundamentals of game play and the value of basic statistics the game can get super fun.

Baseball is a game of sport between two teams played on a field. The team who scores the most runs wins the game.

Field of Play

The field is roughly ¼ of circle where the outer edge of the circle is the outfield wall and the center is home plate.
The field is divided into the infield and outfield. The infield contains four bases or bags laid out in a diamond pattern. The bases: home plate, 1st, 2nd, and 3rd are each 90ft apart.
The distance from the pitcher’s mound to home plate is 60ft. 6in across every park.
The distance from the home plate to the outfield wall varies. The distance down RF line, LF line, RF alley, LF alley, and dead CF are different at every park.

Rosters

Each team has an active roster of players
Roster sizes change change over the course of the season. In the postseason rosters are limited to 25 players.
For each game a team places 9 players in their lineup.
Each player is assigned a position 1st base, 2nd base, left field, etc. and a batting order 1st, 2nd, … 9th.
In the American League pitchers do not bat, rather they have a designated hitter — so they still field 9 defensive players like any other team, but the designated hitter (DH) hits in the lineup (in place of the pitcher) and does not play a defensive position. One of the greatest, if not *the* greatest DH of all-time, is Edgar Martinez, who played 18 seasons as a Seattle Mariner and inexplicably has yet to be elected to the Baseball Hall of Fame. Another fantastic DH recently retired (David Ortiz of the Boston Red Sox). These two players are widely regarded as the greatest designated hitters of all-time.
The remaining roster players are spread across starting pitchers, relief pitchers, and fielders and or hitters that can be rotated into the game as needed.
Typically, an active roster will consist of five starting pitchers, seven relief pitchers, two catchers, six infielders, and five outfielders — although teams will often tweak their rosters in the postseason to adjust to a 5 or 7 game series against the same opponent

Flow of Play

Teams take turns being in the field OR batting over the course of innings.
In professional/competitive baseball there 9 innings per game. Extra innings are played if the scored is tied at the end of 9 innings. There are no ties in baseball, well yes in rare cases — it’s complicated.
There are two halfs to an inning, the top and the bottom.
The home team always takes the field in top of the inning and therefore bats in the bottom of the inning.
The away team always takes the field in the bottom of the inning and therefore bats in the top of the inning.

The Ball

The fielding team possesses the ball, primarily pitching the ball to the batter in an attempt to get him out.
The most notable position is the pitcher, mainly because he/she controls the ball in play and is responsible for pitching/throwing the ball to the plate for the batter to address.
The batting team, more specifically the batter at the plate is responsible for putting the ball in play in the field OR if awesome hit the ball over the outfield wall (this is called a homerun).

Innings

A game of 9 innings is played over halfs, the top and bottom. In order to stop a team from batting the fielding team has to manufacture 3 outs.
Outs are recorded by a pitcher striking out a batter, a batter hitting a ball in the air and it being caught by a fielder, a batter not getting to a base before he reaches a base on a ball in play (“put out” or “force out”). There are 33 ways to get out. Most outs are manufactured from strike outs, force outs and put outs.
The hitting team rotates through a hitting lineup 1 through 9.

At Bats

Each hitter gets 4 balls and 3 strikes to manage. At the start of the at bat, the batter has at 0 balls and 0 strikes. This is called the count.
As the pitcher throws balls and strikes or if the hitter swings and misses or fouls a ball with 0 or 1 strikes the count increases (a foul ball with two strikes will keep the count intact and another pitch will be thrown — but a bunt attempt foul with two strikes will result in a strikeout)
Umpires call balls and strikes based on the pitch location relative to the player and home plate.
Once the count reaches 3–2, the batter will either walk (ball 4), strike out (strike 3), foul the ball off (preserving the count for another attempt), or hit the ball and get on base or get out.
Once three outs are recorded, the teams switch places, with the batting team taking the field and the fielding team getting their turn to bat.

Scoring

A team earns a run by getting a runner starting at home plate, moving around 1st, 2nd, and 3rd, and back to home plate
Batters get on base by being walked (ball 4), getting a hit and reaching base before being put out, being hit by a pitch, and some other “exotics”. There are 7 ways to get on base.

Determining The Right Question

Now that you understand some of the fundamentals of the game of baseball, let’s dive into a few questions — after all this paper is about analysis. Recalling the questions I outlined above in the “Time to Answer” section you will remember that these questions typically present themselves due to some business problem that is reactive in nature OR a hypothesis that might be proactively explored. Either way these questions rarely come out of thin air, rather they are almost always driven by existing business context.

We are now going enter the business of baseball and journey through the process of assessing the right questions to analyze. If you have been a passive fan of the game, think what it might be like for a pitcher making a decision about what pitch to throw next. Or if you are more hardcore fan think from a coach’s perspective about what pitcher should be pulled from the bullpen to face an upcoming batter. Answering these questions can get pretty complex pretty quickly, so let’s start simple.

Given that the ultimate goal (where the goal is winning) of baseball is to score more runs than the other team — we can pragmatically ask the question — how does a team score more runs than another team? There are other goals of baseball like having fun and generating revenue. Both are typically correlated with winning so let’s focus on runs.

Note: There are teams who focus on attendance, fiscal prudence, and entertainment value over simply winning. Maybe later we can look at some data to test this statement.

So how does one team score more runs than another team? There are many philosophies, religions, strategies in this space to which I direct you to “Google It!”, but for this document I’ll simply break them down into two simple and independent buckets:

Score runs (at a rate greater than the opposing team)
Prevent runs (at a rate greater than the opposing team)

These sound super obvious — in fact almost insulting, but if you think about the interplay here you quickly get to the tension a.k.a. the beauty of baseball. The team in the field is 100% out there to prevent the batting team from scoring. And the batter is 100% focused on hitting or influencing a play to create and or score a run. The point is that you have to really do both in order and do them consistently over the course of a game to be successful.

Scoring Runs

There are numerous strategies on scoring runs. One of the most well known modern strategies was birthed in the early 2000’s by the Oakland A’s — it was called Moneyball. The movie is good, the book is great, the concept is brilliant. In essence the A’s strategy focused on finding financially viable (read: undervalued due to traditional valuation concepts) players with solid on base percentages and slugging percentages. The goal was to build a scrappy (read: a team that they could afford) team that could get on base. If you can get on base you can score runs. They also looked for players who made fewer errors. They were successful with this question path and model. This was the right question path at the right time.

If you follow the Moneyball path, you enter a world of many, many, many more questions — specifically around how do you get on base? There are three common ways to get on base: a hit, walk, or hit by pitch. The later is something that we will ignore for now due to its low frequency in the game.

Focusing on hits and walks creates more questions. What types of hits get a player on base? What players are capable of drawing more walks? Are certain hitters better at getting on base in specific situations than others?

Note: There are several formal analysis domains for baseball including the most noteable Sabermetrics. I pull some concepts from Sabermetrics in this document, but for now if this is new to you just know that there are numerous approaches, philosophies, and models out there to help guide one down a path of question enlightenment.

This is where things get fun, wild, and well super nerdy because we can keep going to down a logic path to come up with questions to explore how to score runs by focusing on getting on base. For now we will stop and move on to thinking about preventing runs.

Preventing Runs

In the same vein as scoring runs there are numerous strategies to preventing runs. The most common technique is acquiring stellar defensive position players to support a deep and versatile pitching staff. Optimum run prevention is the combination of good pitching with a good defense behind it. The challenge here is that is it really hard (read: costly) to build up an entire staff of above average pitchers and fielders.

Note: The same economic challenges arise if you wanted to build a team of all top star offensive players (read: guys who can hit to score runs and get on base).

For defense — let’s look at pitching. Since a starting pitcher in modern times has 4 days of rest between games, teams need to strategically build up their pitching rotation to get the right matchups. You want to have your best pitchers facing the best hitting teams. The difficulty lies in that over the course of a season pitcher performance can vary wildly. You never really know how a pitcher might perform on a given day. Maybe he ate some bad tacos.

There is another element to run prevention besides pitching — the defense that plays behind the pitcher. Statistically over the last century of baseball the batting average on balls in play has wavered minimally. In essence the average across all batters — have a ⅓ chance getting a ball in play. The connection between pitcher performance and ball in play outcome is effectively locked.

Note: I encourage you to read Big Data Baseball. There isn’t a movie yet, but I hope there will be, as it is an even more nerdy version of Moneyball that tells the story of how the Pittsburgh Pirates turned their 20 year losing streak around by through data analysis. As a native of Da Burgh and fellow data nerd, this book made me smile.

This statistical glacier sparked a round of analytical controversy in the early 2000’s which focused around shifting players in the field and exploiting other techniques like pitch framing to convert hits into outs and balls into strikes. The human element (imperfection of home plate umpires) has been widely leveraged by teams over the past 7 to 10 years. Terms like “stealing strikes” are now commonplace in baseball These concepts were brought to the field in early 2010 by the Pittsburgh Pirates, who through deep statistical analysis targeted catchers that could best frame a pitch that was technically a ball into a called strike. They also started dynamically shifting their defense base on pitcher to hitter matchups and game context.

Again, this is where things get exciting as asking a question like, “Does effectively framing pitches prevent runs from being scored?” have actual analytical and outcome evidence that the concept works. This was the right question path, at the right time.

I will revisit the concept of determining the “right question” later in this document. For now I’d simply like you to keep in mind that the path to determining the right question rests with first defining a clear purpose. In the 2011 Pirates’s case, their purpose was to seek out ways to prevent runs with the talent they already had on the team, by creating outs by literally altering space — focusing on outs, which lead to creating strikes through framing, and anticipating where the ball is going to be hit.

Note: A crucial factor in driving successful analysis is having the right subject matter expertise involved in the question formulation process. If you are ignorant to a particular domain, seek out experts that can help you. If are an expert in the domain, seek out expertise that may be counter to your focus or bias. As you learn, seek out others and try to teach them what you are learning. You will learn something valuable each time, like discovering you were missing a key factor.

Note: I followed the above advice. Even though I had some basic baseball knowledge, I was no expert in the game. I reached out to two key friends, one who has been a sport (baseball and basketball) commentator and analyst for the last 20 years and the other who played in the minor leagues for several years. After a few white boarding sessions, plus some beers and pizza, we had a method to focus on and that spurred a ton of tightly clustered “What-if” questions. In essence being on the same page really helps, especially if it is based on sound expertise.

Focus on Pitching

For Project “Uecker” we decided to mainly focus on pitching for the following reasons:

During the postseason and in particular the World Series pitchers are used and managed differently.
During the regular season, a coach may leave a struggling pitcher in the game to save his bullpen for the next game. In the World Series that pitcher will probably get pulled, but there is an impact on the bullpen and game management.
During the regular season American League teams play the majority of their games in American League parks; therefore, the designated hitter rule is in effect and thus most pitchers don’t bat. Consequently, they are mostly order killers, because they will get out. The winner of the All-Star game has home field advantage. This year the American League won the game 4–2, so the Indians have home field advantage. Therefore, the Indians play 2 games at home, then at least 2, possibly three games away, and then possibly 2 games at home. Those three away games will impact their lineup. Similar goes for the National League team when at American League park — they need to use a roster player as a designated hitter, which is something they normally don’t have to do in the regular season.
They are under way more pressure in the World Series which can impact emotional management, adrenalin, clarity of thinking, and ability to execute.
They bring stress into the series. These teams have just played 164 regular season games, plus the division and championship games. This creates a lot of stress.
They create stress on the field as the stakes are way higher. One mistake and the game could be out of reach and the season over.
They are in foreign territory. Granted some of the players may have played in their visiting park, but it’s nothing like the regular season where you play your division rivals to the point of nausea. Lighting is different, they may not like the outfield setup.
We removed assessing defensive strength and execution, except for the pitcher, because we are assuming that defensively each team, specifically their coaching staff has crafted a defensive game plan that accounts for shifting and positional alignment on an at bat and even pitch basis.
We backed into this assumption for two reasons: 1) We didn’t have access to defensive alignment data; therefore we could not focus on defense strength for fielders beyond basic stats like errors for the fielders in play. 2) Even if we had the alignment data we felt that the likelihood of a player not doing what he was supposed to do at this point in the season is lower. Albeit he can still make an error. In essence if the fielders are in the right place and they should be — the main focus comes down to a pitcher execution.

The Harry Doyle Method

With pitching the focus of our question and analysis domain we can up with a ‘simple’ way to look at pitching — we call it the Harry Doyle method which is based on a Vaughn Score and a Haywood Score.

Vaughn Score — The Vaughn score is a reflection of a pitcher’s ability to: manage the count, maintain location control, and drive overall at bat outcomes. Each area is scored and factored and then aggregated to single number. Higher numbers are an indication that the pitcher is better at staying ahead of the count and not giving up walks and hits.

Haywood Score — The Haywood score is a reflection of how much stress a pitcher is under on a per pitch basis. The main factors for the calculation are scoring threat (runners on base + batter) * batter strength factored over outs, count, and inning.

With these basic stats you can start looking at groupings and outliers. I deliberately removed the player, outcome type, etc. information as to avoid confusion.

Or we could look at something more concrete like this, which is Corey Kluber’s and Jon Lester’s count management score for their recent 30 games. This data really speaks to you. With a simple fit line you can see the overall difference in performance and quickly pick out anomalies.

A key part of the Vaughn Score is pitch zone management, specifically looking at how well a pitcher avoids hot zones for the hitter. For each at bat, we look back in time for all of the hits for the current batter and even split them by fastball and off-speed buckets. From there we build an odds ratio to compute the zones which the pitcher should avoid. Then as pitches occur during the at bat we can score each pitch based on safe zones and hot zones.

From Method To Data — Sportradar

The core data for Project Uecker was licensed through Sportradar. Here I explain why I chose to use their data and review at a high level the schema of the data.

There are tons of places to find baseball data. I outlined them at the end of the document in the appendix section — “Where to Find Data, Analysis and Punditry for Baseball”, but I chose to work with SportRadar for the following reasons:

They provide play-by-play feeds for each game, which include lineup changes and pitch-by-pitch outcome data. This fit nicely with my macro goal to look at pitching and hitting performance at a more fined grain level than just the at bat.
Their data is organized in well documented schema.
Accessing the data is through simple REST APIs that return XML or JSON.
Their live feed is very close to real time, in most cases well ahead of the live broadcast television feed. I wanted to do some pitch-by-pitch analysis as live games progressed.
They provide numerous aggregates and metadata services beyond play- by-play like box scores, rosters, player profiles, stats, and venue information that can be joined to enrich respective datasets.
They were open to letting me re-publish my derivatives so that others can easily play and hack around.
Developers (anyone) can get a free trial key to hack on any data (baseball and beyond). If folks wanted to take the process forward they easily could.

Note: The usage of the BigQuery baseball dataset is bound by the following usage terms: Copyright Sportradar LLC. Access to data is intended solely for internal research and testing purposes, and is not to be used for any business or commercial purpose. Data are not to be exploited in any manner without express approval from Sportradar. Display of data must include the phrase, “Data provided by Sportradar LLC,” and be hyperlinked to www.sportradar.com.

Read Everything

After settling on Sportradar as my main data source, I started to dig into the details of what was available via their API. As you can see there are many different types of data.

Note: You can do the same by heading to http://developer.sportradar.us/io-docs. Simply sign-up to get a free access token and you are in, albeit with limited API call quota, but you can do a lot with the free token.

I read every feed schema to get an understanding of what everything looked like.

Note: Read everything you can, assimilate and think about how you might use the data. You never know what you will miss, especially if you didn’t look for it in the first place. Even if you know what you are looking for — take some cycles to explore additional data sources, you might find something that is gold now or possibly later.

I first spent time looking at pre-canned statistics in seasonal splits and in player profiles. That data was too coarse grained, but did note that I didn’t have to calculate any of them, which might be good later. Then I jumped into the play-by-play feed as this is where the pitch and outcome data lives.

The play-by-play feed is a pitch-by-pitch feed — more specifically it contains each event for every at bat in the game. For each data feed you can get either JSON or XML. I live in the JSON world, due to PTXE disorder (post-traumatic-XML-exposure) from the early days of SOAP.

Here is a sample of the root of play-by-play feed JSON feed.

Here is a snippet of a pitch within in at bat within an inning half.

The graphic above describes the first pitch of the at bat and which was a SL in zone 1 at 84 MPH, by pitcher 37bde… to hitter cf981… and the outcome was a kKL. Cool! But, what the heck is zone 1, kKL, and SL? Baseball people would guess right that SL is a slider and kKL is a strike looking; however, I needed a process that is consistent and not based on guesses. Plus, I needed to know what other mappings were possible.

Fortunately, Sportradar has a cleanly documented feed schema and mapping document. I read the document and was able to understand each mapping. As an example pitch zone 1 is high and outside left (looking at the plate from the pitcher’s perspective).

After getting my head wrapped around what data was available and I understood it’s basic schema I set off on processing the data.

Sizing and Processing Needs

At this point, I knew I would have to implement some ETL processes to perform these joins as some of them weren’t simple lookups, others were more complicated like : what is the throwing hand for playerId and how many pitches did he throw in last 7 appearances?

Next, I started thinking about data size, specifically how much data will I possibly be addressing and how this might impact my ETL process and downstream query processing. I had access to 5 years (2012–2016) of data so that is 12,150 games (2430 games per season * 5) and then throw in the postseason.

A game file is anywhere from 300–500 events so the total events is just under 5M. Keep in mind this is 5M — highly nested events. Nested or not, once you do the data type sizing across all events. This nets out (~few GB)to be too big to say fit into Excel or Google Sheets or to reasonably load into memory using straight up Python or Java on a local dev environment.

I exited my data familiarity stage with two new work items:

Implement ETL processes over the play-by-play data
Decide on then deploy a query engine to store my data and enable me to query it

Next stage was to start picking tools.

Choosing The Right Tools

I had three key decisions to make for my core workflow tools. Here I review what I wanted from each tool/service and why I made each decision.

ETL
Query Engine
Analysis Environment

Note: You might be tempted to work this problem in ordinal order and start with ETL, but I started with the Query Engine, because all roads lead there. Your ETL processes targets the query engine and your analysis environment(s) targets the query engine. It’s also where your data ultimately lives in the long run, so if you can start there.

Before I started “buying” tools — I made three top level decisions that guided each tool selection:

The tools needed to be in the cloud. I wanted to make sure that each step was as accessible and shareable as possible so I could collaborate easily with others.
Note: If you are data processing engineer or data analyst and you keep (read: hoard) raw data on your local laptop or desktop that others may find valuable — it’s time to change your ways.
I would use Google Cloud Platform as my cloud of choice — after all I work on the team and it’s pretty awesome for data processing and analysis.
The tools needed to be portable. I wanted to make sure that if I wanted to take parts or all of the workflow someplace else, like run the ETL from the laptop or move the query engine elsewhere I could do that.

Query Engine (BigQuery)

For the query engine I chose BigQuery as it’s super easy to get started, it has SQL syntax support, it’s relational but also supports nested data, it’s completely serverless, it’s massively scalable, and cost effective. BigQuery is just as capable and effective with small as it is with big data.

You can access the 2016 pitch-by-pitch data I used for this project at: https://bigquery.cloud.google.com/dataset/bigquery-public-data:baseball.

Let’s dive a bit deeper into why I chose to use BigQuery.

Query Syntax (SQL)

In the spirit of portability I looked at what query language I wanted to use and the impacts on upstream tooling and collaboration. I thought about using Mongo, which I have used heavily for other projects and since the data is already in a JSON schema, I was a little tempted. I also looked at Cloud Datastore, or BigTable since I have easy access to them — but each would limit the shareability of my queries. I decided that I needed a query engine that supports SQL, hopefully standard SQL, so that a broad group of people could share queries or at a minimum parse them.

Note: NoSQL engines are amazing, fun, and powerful. Some even have SQL parsers, but if they are not SQL compliant, things can get frustrating when spinning up new developers or migrating to a new service.

Schema (Really Wide)

My focus on SQL syntax crossed off the NoSQL options, but there were other reasons why I went the relational route. Mainly, I knew I wanted to build 1 or 2 really wide tables so I could avoid having to do do any joins. In essence, I wanted as much context of a game to be housed with the same row. Since BigQuery is a columnar store this model works really well and greatly reduced the complexity of my queries.

Note: You might choose a NoSQL database like Cloud Bigtable or Cloud Datastore over BigQuery if your data is rapidly mutating or Cloud SQL if you need transactional support. In my case the data is slowly moving and there are few updates. BigQuery did recently announce update support, so in a data warehousing case like this you can load and then later update or delete data without having to perform bulk re-loads.

Deployment (Serverless)

Even though I decided to target my tools to run in the cloud, this decision does not explicitly solve any cost or management problem; rather it shifts economics management elsewhere and with it maintains a deployment challenge.

In the cloud — I now have more freedom to deploy whatever I want, but I still have the burden of deployment & management and the cost liability — especially if I build a static, non-scaled resources.

Example, deploying a single 4 core server to run MySQL for a month is has fixed rate cost and as you scale up so does your cost, in most cases hopefully linearly. The issue here isn’t as much about cost as it is about utilization and therefore waste. If your utilization over a month is say 10%, why would you want to pay for 100% of the resources?

This is a long standing value proposition of the cloud. You can deploy what you need, when you need it, but unfortunately folks are still building fixed capacity clusters of resources that are under utilized. They do this because dynamically scaling resources, especially for databases and data analysis engines is really hard and if you don’t manage the capacity to demand curve right, bad things happen. Thus, the model becomes — over provision and rationalize the waste.

This was true for my case. I wanted a query engine that was serverless to only use resources when I needed them and it needs to be completely “hot”. “Hot” meaning that I didn’t have to wait for the engine to load my data and get ready to process my queries. This is a key distinction between serverless and on-demand. An on-demand model helps reduces waste, but it can introduce lag while it’s getting “hot”. A serverless environment, when implemented correctly, provides hot processing, completely on demand.

Scalability (Dynamic)

Even though I was only starting with of few GB of data, I thought about data expansion and lots of queries. What if loaded 50 years of data? What if I got access to additional data? What if I needed to run 25 queries at once? In any case, I didn’t want to worry about how to scale the query engine.

Note: I have had a few heated discussions with folks about the use of BigQuery for small data. Their arguments are mainly that they perceived BigQuery as only helpful with really big data, but when you add up the value as described above BigQuery can be a really strong tool for smaller data. Plus — you have the built in scalability as you grow.

Cost (Free Up To 1TB)

BigQuery provides a free tier which give you up to 1TB of queries per month, and then $5 per TB per month after that. With the baseball data you can run thousands of queries (using only the columns you really need) and still stay within the free cap.

Note: Query price is a function of data size of columns addressed over all rows, regardless of your WHERE clause. This means you should be judicious about what columns you use in your queries.

BigQuery storage is the same cost as Google Cloud Storage. My guidance is that if you are going to query it, store the data in BigQuery. It doesn’t cost you anymore money, in fact you can save money by just storing the data only in BigQuery. If for some reason you need to move to Cloud Storage or elsewhere — exports are free.

Note: BigQuery does not currently support storing uncompressed data. Therefore, if you have lots of data e.g. 100’s of terabytes, you will save money on storage by only storing needed data in BigQuery. I encourage you to do the math on what pain level you would tolerate. At small sizes the net cost impact may be less than the mechanics of only moving certain sets of data back and forth.

Tooling Integration

BigQuery has seamless integration with Cloud Dataflow for both ingestion and query execution and Cloud Datalab has inline API support to make interacting with BigQuery super easy. I’ll talk more about this later.

ETL (Cloud Dataflow)

For my ETL processes I chose to use Cloud Datatflow for four main reasons:

I didn’t want to spend anytime deploying mini clusters to run my ETL jobs. I wanted to spend as much time writing code, not babysitting servers.
I wanted something that integrated well with BigQuery.
I wanted to make sure I could develop locally, but easily deploy my real jobs to cloud based resources.
I write a decent amount of Java code and wanted to make sure I could rip out some of my core feed parsing code if I needed to run it outside of Dataflow.

Portable

Cloud Dataflow is the basis for the incubating Apache Beam project.

If you are new to parallel processing and data pipelining read the below snippet from the Apache Beam home page. In my case, I needed embarrassingly parallel ETL and I wanted it to be portable. This was a good fit.

Apache Beam is an open source, unified programming model that you can use to create a data processing pipeline. You start by building a program that defines the pipeline using one of the open source Beam SDKs. The pipeline is then executed by one of Beam’s supported distributed processing back-ends, which include Apache Flink, Apache Spark, and Google Cloud Dataflow.

Beam is particularly useful for Embarrassingly Parallel data processing tasks, in which the problem can be decomposed into many smaller bundles of data that can be processed independently and in parallel. You can also use Beam for Extract, Transform, and Load (ETL) tasks and pure data integration. These tasks are useful for moving data between different storage media and data sources, transforming data into a more desirable format, or loading data onto a new system.

Note: To be clear I did not use the Apache Beam SDK; however, the delta between the Cloud Dataflow SDK and the current Beam SDK is minimal. I could port it with ease.

Fully Managed

Cloud Dataflow is a fully managed service. Fully managed means that the service handles allocating, deploying, scheduling, optimizing, and monitoring all of the resources needed to run a pipeline.

This is a snippet of a few finished Dataflow jobs that I ran while building the system. I ran a about 100 of them while doing my work. I’d write some code to parse the games and then run the ETL job and then inspect my output with queries in BiqQuery. The key here is that sometimes I needed to parse 1000’s of files, other times just a few.

As data size grows the job clock time gets longer; however, if really needed, I could ask the service to use more resources to which it would automatically allocate and deploy.

Input Scheduling and Sharding

Cloud Dataflow provides a wide array of support for sources and sinks including GCS, BigQuery, BigTable, and Cloud Datastore. In my case I had 1000’s of game files sitting on Google Cloud Storage.

During job execution Dataflow handles the reading and the sharding of input across the workers allocated for my job. I was using 10 to 25 single core workers.

Note: These workers are only running during the life of job. Worker allocation is completely on-demand. This really helps minimize cost, because there is no under utilization. This also helps in cases where say I may need to process 1M files and I needed it done within a clock time window. You simply ask the service to use more resources.

When my job starts Dataflow (after building out worker pool) scans the input path and handles the binding of input (work) to my workers. I didn’t have to worry about how to parallelize the work — it’s done for me.

BiqQuery Sink

Cloud Dataflow has optimized source and sink integration with BigQuery. I used this API to write all of my ETL output directly to BigQuery, as well as read tables and queries from BigQuery to drive my transformations.

Here is a snippet of code that shows how easy it is to bind my final ETL to a BigQuery schema, to which Dataflow does the heavy lifting of connecting to and then loading the data into BigQuery. I don’t have to worry about bundling, batching, or error handling — just build my rows and go!

Code Reuse

I wanted to make sure that I could move my core transformation code around if I needed to execute elsewhere or possibly compose it with other code. Below is a snippet of code I wrote to parse base runner location. Notice that there is no Dataflow code.

This among many other functions are wrapped in a transform called GameWriter and the final call in Game Writer is to the BigQuery sink.

Monitoring

Beyond job resource management, Dataflow provides real time monitoring. As I submit my job(s), I’d jump over to the monitoring UI to look at progress and drill into any errors. The UI focuses on the logical flow of the pipeline, not the underlying control plane and parallelization scheme — which helps me focus on what is important, which is the flow of the data through my transforms and processing steps.

Note: This job specific job was pretty small — only targeting ~1GB of data. Here I added a side input to join player information with the play-by-play information. Cloud Dataflow can scale to terabytes even petabytes of data. It works just as well with Big data and the same managed service benefits apply, in fact the managed aspects become more valuable the more you need to scale.

Awesome. With this, I now have my raw play-by-play data parsed, flattened, enriched, and loaded into BigQuery. Let’s pick a tool to analyze some baseball!

Analysis Environment (Cloud Datalab)

For my mainline analysis work I chose to use Cloud Datalab — mainly because it is super fast to setup and get running, it has built in support for connectivity to BigQuery and GCP services, and has a wide array of Python module support to help with visualization and more complex processing.

Setting Up

Datalab can run locally or in the cloud. I chose to run locally mainly because I like to have access to the rest of my local tools. Running locally is super easy. Install docker and then you use the pre-canned and latest Datalab container and you are off and hacking.

Once the container is running you can jump into you localhost notebooks view and start coding. This is my local set of notebooks. Looks like I need to update my container image <grin>.

Note: Source control with Datalab is “built-in” as it uses the local file system for the underlying storage of notebooks. Simply bind git to your desired notebook path(s) and you are good to go.

There some excellent built-in tutorials and samples, like the BigQuery ones here.

BigQuery Integration

Within a notebook, Datalab provides excellent magic cell support for BigQuery. Here is a snippet of a query that targets pitches for a pitcher.

From there I can call this module with parameter(s) in this the pitcher I want to analyze. The awesome thing here is that I get built in support to transform the BigQuery response into a pandas DataFrame. The column mapping, batching, etc. is all done for me.

Now that I am in DataFrame land I can start rocking on combining processing logic and the performing some basic analysis.

Note: Even though the data is highly denormalized and enriched, there are new metrics and stats that I’d like to build. This is in essence a prelude to flexible feature development. I made a conscience decision to inject common metric development in the denormalized data and then use downstream logic in my notebooks to create new metrics. As I discovered a new metric is really more appropriate for the denormalized source in BigQuery I could 1) go back to my Cloud Dataflow ETL job and compute it there OR 2) I could add it to my BigQuery query and have it do the work either inline or in a UDF.

Process and Visualize

Ok. We have a DataFrame, let’s write some logic. This is basic code to see if a pitcher threw a pitch down the gut on a hitter’s count. I have some wrapper code that reads the starting count and pitch zone from the pitch record in the DataFrame which then calls this function.

And then after I run all of my scoring functions, which are aligned back to each pitch I can run some aggregations. This highlights the trade offs you want consider about where you build your aggregations. In this case it was easier for me to do this inline with Python and pandas vs. pushing this down to a SQL statement.

And then finally create some plot from the DataFrame and hopefully get some insight.

This above statement produces this scatter plot. I left out the polyfit statements for simplicity.

This graph shows the mean count control score by game for two pitchers over their most recent 30 games. Higher is better. Both of these guys are amazing pitchers. If I were to plot a 3rd and poor performing pitcher you’d really see distribution and variance differences.

This graphic shows that Kluber has a slight advantage on control score and as the season progressed he increasingly performed better than Lester. If you were to look at the standard deviation you’d see that Lester is also a bit more volatile, where as Kluber is bit more consistent.

This data is just an indication of performance it does not tell the whole story. In fact Lester had a better ERA than Kluber for the year and a slightly better winning percentage, but this is where the analysis gets deeper. What if we took out the games where Kluber got shelled or factored out weak run production by his offense or looked at his days rest or the hotness of the other team batters? Could we definitively prove that one pitcher is better than the other? The questions are endless.

For now I think I will watch the games and see how things play out. Along the way I will post some new analysis as we see the game “talking to us” through the data.

Note: After the World Series is over I will clean the key workbooks and post them to Github for others to hack on. There isn’t anything magical inside of them, but what they do provide is an on ramp to deeper analysis. Most of the hard work has been done.

Deeper Learning

You might be asking, “Where is the machine learning aspect of this analysis?”. Great question.

Note: Cloud Datalab has awesome support for TensorFlow and CloudML.

When I started this project a few weeks back my initial focus was building a model to predict out vs. on-base outcomes for each at bat. I violated most of the rules outlined in the “Time to Answer” section.

I built the simplest (read: naive) ETL process to start seeding basic data to build my features — mainly with the basic observation data.
Then I built a few regression driven models and started training and testing. The process is relatively straightforward — due to the tight integration between Datalab to BigQuery and then Datalab to Dataflow and Cloud ML to help build my graph, train it, and evaluate.
My accuracy was horrible. Fail!
It turns out that my ETL process was producing bad data (due to my coding error) and subsequently producing bizarre results. Fortunately, I had my Dataflow code on hand. I spent time building better error handling and verification logic. Re-ingest all of data and get back to it.
My accuracy increased, but was still poor. I was about 4% better than a fair coin flip. Just guessing OUT every time would give me better results.
This is why I built this process and the paper. My features were too granular and in some cases just plain garbage. I didn’t have a process to truly understand and explore the complex relationships of the data. And for baseball in particular the real valuable data (read: source of features) is within the intricate, context, influence, and weighting between factors.
After the dust settles I plan to get back to building my models, hopefully armed with the “Right Question” to analyze and and a trove of potential features to win the game.

Note: “There’s one word that describes baseball: “You never know”. — Joaquin Andujar (former MLB pitcher). Maybe after some work we might :-)

Wrapping Up

Wow. This is a long post… If you made it here — awesome! To wrap up I encourage you to try the following:

If you don’t already have a Google Cloud Platform project, go set one up — it’s easy and it’s free to get started.
From there start hacking on the BigQuery public baseball dataset — it’s easy and it’s free to get started.
Finally, fire up Cloud Datalab locally and start creating your Harry Doyle method. It’s free to run locally and easy to get started.

I leave you with a classic Harry Doyle quote, “We’re in the top of the 9th inning leading 10–7, bases loaded, two down, and Rick Vaughn has come on to try to nail it down against Felipe Aguilar, a dangerous right-handed batter. Here’s the pitch.”

Good luck with your hacking.

Appendix

Where to Find Data, Analysis and Punditry for Baseball

There is an entire business and sub-culture around analyzing baseball, the games, the players, the financial transactions, and it’s impact on society. I encourage to read real analysis,beyond the Harry Doyle method. Here are some links to help find more knowledge.

Baseball Information Solutions which focuses on hardcore collection of data, deep analysis, and distribution of baseball statistics. I find even their news feed fascinating. If I owned a MLB team I’d probably be asking them “to take my money”.
Baseball Prospectus a massive gold mine of articles, blogs, analysis, stats, depth charts, and fantasy data for MLB and other professional leagues. I could spend all day reading their content.
The Society For American Baseball Research which focuses on developing and researching empirical analysis techniques for baseball. If someone talks about sabermetrics or throws out the phrase sabermetrician — you are more than likely dealing with someone who gets baseball.
Baseball Reference founded by Sean Foreman whose goal is to provide the “… easiest-to-use, fastest, and most complete sports statistics anywhere. If we have some fun in the process, that’s good too.” It’s one the keys statistical rosetta stones for baseball.
Of course — Major League Baseball, the governing and managing body for American baseball, which beyond shaping the business of baseball is at the forefront of data collection, media collection, and analysis and fun replay.
ESPN.com — the world leader in sports. There are always cooking up new and fun ways to look at data. Like streak analysis, park factors, and scoring a “Top Game”. Their Game Score method is a kin to the Harry Doyle method.
FanGraphs a vibrant online community of sophisticated fanatics and pundits sharing analysis, hypotheses, and well thought out opinions with high signal data and sometimes even graphs.
Retrosheet a site dedicated to collecting and hosting play-by-play and box score files dating back to the early 1900’s. It’s so fun to dive deep into the old box scores, such a fantastic archive.
Sean Lahman, journalist and author who is the provider of the Baseball Archive which is more approachable process to the Baseball DataBank data.
SportingCharts a visualization driven site that quickly parses through a ton of stats to help an end user get to the meat of team’s state.
Sportradar.com SportRadar is global sports data and sports content provider. They focus on building efficient, easy to use tooling that provides accurate data at low latency. Sportradar has exclusive agreements with the NFL, NHL, NASCAR and relationships with 65 other global sport leagues.