Magic Count with a BigQuery

Note: special thanks to co-author Steve Sandmeyer aka Sandy

Welcome to Game Two of the Fall Classic. Baseball can be a funny game. Prior to the series starting, everyone got into a lather (including us) over the “epic” game one pitching matchup between Clayton Kershaw (LAD) and Chris Sale (BOS). The result? Neither ace got out of the fifth inning and a total of 12 runs were scored as Boston opened the series with an 8–4 victory.

In this post we continue our analysis of the World Series using Google Cloud with a focus on hitting and a little tool called BigQuery and our lingua franca SQL.

The Red Sox cruised to a 108-win regular season and a Major League best 5.4 runs per game. They’ve been plating runs even more frequently during the postseason (10 games), outscoring opponents 64 to 39 so far — with a whopping 32 of those runs scoring with 2 outs. Try to sneak a fastball by them? Good luck. Mookie Betts hit .384 vs the fastball this season, ranking second in all of Major League Baseball. JD Martinez hit .366 vs the fastball, ranking third. (Interesting note: the other dugout has the baseball’s best hitter against the fastball: Matt Kemp of the Dodgers, who hit .388 vs the fastball, and homered in the second inning of last night’s game on — you guessed it — a fastball).

The Count loves numbers especially the number 1

With runners in scoring position, the Red Sox slashed .289/.379/.493 during the regular season. The .289 BA and .379 OBP were tops in baseball. But they are making a mockery of that in the playoffs, slashing an absurd .365/.461/.647 with RISP thus far. Boston’s offensive romp through the playoffs thus far has been nothing short of majestic.

There are many ways to define “best”: runs scored, batting average, etc. However, for our analysis, instead of looking at traditional top line stats, we have been looking at more specific situational metrics — count specific for example. We already know Boston has the best offense in baseball — but where do they inflict the most damage? In which situation?

Is there a “magic count?” and where might the players and team rank against others based on count?

For this analysis, we needed actionable pitch level data to create traditional stats like batting, slugging and on-base percentages, but specifically at the count level where the action occurs. Baseball is not a big data problem, but it is also not a small data problem. Wrangling 2.5 GB of data on a laptop, say in pandas, is not very efficient. So for this use case, we are using BigQuery to help execute transformations and push down all of the needed aggregations.

NOTE: You can hack on similar data in BiqQuery by heading over to our public datasets hosted on Google Cloud. After the World Series we will post new data from 2018.

The rows in this table (this is my project) contain the thumbprint for every pitch in the regular and postseason since 2015. With this data we can create stats relative to game context. Building these game state stats typically require ordering and for that we can employ standard SQL techniques like partitioning to create ordinal sequences like current count.

SELECT
*,
MAX(CONCAT(CAST(balls AS STRING), “-”, CAST(strikes AS STRING))) OVER(PARTITION BY sched.game_id, inning, half_inning, hitter_id ORDER BY pitch_number ASC ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) AS starting_count_raw,
ROUND(IF(ab>0, hits/ab, 0), 3) as BA,
ROUND((hits + bb + hbp)/(ab + bb+ hbp + sac),3) as OBP,
ROUND(IF(ab>0, total_bases/ab, 0), 3) as SLG
FROM
(…)

We use these queries as base views or our targeted analysis. In this case, we wanted to calculate stats at the count level and for that we relied on RANK() specifically:

SELECT
*,
RANK() OVER(PARTITION BY starting_count ORDER BY BA DESC) as BA_rank,
RANK() OVER(PARTITION BY starting_count ORDER BY OBP DESC) as OBP_rank,
RANK() OVER(PARTITION BY starting_count ORDER BY SLG DESC) as SLG_rank
FROM
(..)

With a bit of SQL magic we can create player level ranks for stats by count. Here, Andrew Benintendi is ranked ninth putting the ball in play on an 0–2 count with a .259 batting average

(count when AB or PA ended and pa_count > 40)

So what does this look like when we group by team? Below are the Red Sox rankings for the 2018 regular season. Can you find the number 1 — like the Count?

A sign of a good offense is one that doesn’t let opportunities slip away. Every baseball team has an advantage when hitting ahead in the count. The Red Sox just happen to be better than most, ranking in the top 4 slash lines throughout MLB on 2–0 and 2–1 counts.

Interestingly, while the numbers might not look aesthetically pleasing, Boston owns the 2–2 count compared to the rest of MLB, ranking number one across the board with a slash line of .225/.227/.366. As for the playoffs? Other worldly. Jumping on first pitches (.333/.350/.667) 0–0 and demolishing the 2–1 count (.435/.435/.783)

Postseason (not including last night’s game)

By comparison, here were the Dodgers during the regular season. Los Angeles is an aggressive fastball hitting team that excels by hunting early in the count — evidenced by the success on 0–0 and 0–1 counts. They remain aggressive even behind in the count (1–2).

The awesome part of the workflow here is that we don’t have to extract data elsewhere for transformation and analysis AND the majority of these queries run in a few seconds thanks to the power of BigQuery. It’s great for big and small data, for good counts and bad counts.

Lately, I have been pushing my team to migrate more and more aggregations out of pandas and into SQL. This way we can share the value of the aggregations more easily AND we are saving a ton of time not having these run as one off pieces of code. Find something interesting — create a view! This post is glimpse into a bit our hacking. If you are new to SQL hopefully you can see some of the magic when using a serverless data warehouse like BigQuery. If nothing else — hopefully you can see just how good the Red Sox inside the count.