Visualizing Pitcher Clusters: A Next OnAir Digital Experience

Ramzi BenSaid
baseballongcp
Published in
18 min readJul 29, 2020

As fans of data analysis as well as sports, our Google Cloud developer relations data analytics team was disappointed by the curveball that crossed our collective plate in early March. After years of highlighting a variety of sports data work at Google Cloud NEXT, we had been excited to showcase our new capacity as the Official Cloud Partner and Official Cloud Data and Analytics Partner of Major League Baseball. Fortunately, after months of patience and cooperation, we’re pleased to share a digital-first experience in conjunction with Next OnAir.

For this experience, we focused on a pitcher’s pitching process (by which we mean strategy, mechanics, and delivery). If you’ve ever had the feeling that two of your favorite (or least favorite!) pitchers tend to pitch the same way, you can now visualize it. In the demo, we’ve rendered each pitcher season in an interactive, navigable 3D space, where pitchers exist in a cluster near those who have shown similar tendencies when delivering the ball the standard 60 feet and 6 inches to home plate.

Each pitcher’s season has been assigned to a particular cluster. When hovering over any pitcher, those in the same cluster will be highlighted as well. You can examine clusters, search for your favorite pitchers, filter by some important performance indicators, and go back in time five seasons.

Data Availability and Ingest

Our first step, unsurprisingly, was to gather the data. Casual fans probably know that baseball is a stat-heavy sport, but the latest ballpark tracking technology is far more sophisticated than the boxscores of yesteryear. MLB’s five year-old Statcast, a combination of two different tracking systems installed in every stadium, covers practically any event in a game. Statcast includes things like distance covered by fielders, the amount of time it takes a catcher to get the ball out of his glove when trying to pick off a runner, how long it takes a batter to get out of the box after making contact with the ball, and any other obscure datum you might imagine. If it happens on a baseball field, it’s probably in Statcast.

Pitcher stats such as ERA (earned run average) and WHIP (walks plus hits per inning pitched) have been around for years, but we wanted to see what specific tendencies might surface using metrics that wouldn’t have been available before Statcast. We wanted to look at metrics like release point, spin rate, and velocity of particular pitch types in order to dig into the nuts and bolts of a pitcher’s process more than a pitcher’s performance.

Using Dataflow (Google Cloud’s fully managed data processing service for Apache Beam) we ingested five seasons of Statcast data from MLB’s API into BigQuery, which served as our data warehouse for this project. With over one hundred data points per pitch ingested into BigQuery, we could start thinking about how to frame this problem. After exploring a few possibilities, we decided to look at each pitcher-season in aggregate for all pitchers with a minimum of 100 batters faced on the season. This gave us 2,542 pitchers, or approximately 508 pitchers per season to cluster.

Feature Engineering and Clustering

We stayed in BigQuery to compute all of the aggregate calculations of interest, leveraging materialized views on top of the raw data we had ingested from the MLB API. We created a massively wide table with hundreds of metrics relating to when and how pitchers throw particular pitches, as well as what happens after the ball has left their fingertips for the duration of each season.

We then used K-means clustering via BigQuery Machine Learning to assign pitchers to particular clusters. With BQML, training a K-means model is as simple as running a SQL command:

CREATE OR REPLACE MODEL models.kmeans_19 OPTIONS(    model_type=’kmeans’    , num_clusters=19    , kmeans_init_method=’KMEANS++’    , standardize_features=TRUE    , early_stop=FALSE    , max_iterations=15    , distance_type=’EUCLIDEAN’) ASSELECT * EXCEPT (psid)FROM clustering.pitcher_cluster_stats_season_tableWHERE pas > 99

We trained dozens of combinations of K-means models using different feature sets, numbers of clusters, methods of initializing clusters, and other factors. This process happened iteratively using Google Colab Notebooks. Once the models were trained, we could do some initial model and cluster analysis in the BigQuery UI, where we could examine things like training loss, cluster size, and feature distribution.

After much discussion about where we expected particularly unique pitchers to be showing up in this clustering problem, we settled on a model that we felt balanced insight from our unique features as well as interpretability from a baseball perspective.

Visualizing Results

It’s one thing to arrive at good cluster results in a table; it’s quite another to render the results of this high-dimensional clustering problem in a comprehensible space. To accomplish this, we turned to TensorFlow Transform (TFT) for a Principal Component Analysis (PCA). Using the same Colab Notebook where we’d fit and explored dozens of models earlier, we pulled cluster results and fed the features into a TFT pipeline that reduced them to three components. Translation? X, Y, and Z coordinates for our final application.

Understanding the Clusters and Analysis

There are a few important points to keep in mind when exploring the application.

  • The clusters’ presentation in space relates strictly to the feature engineering and results, and has nothing to do with the physical space over home plate, or the trajectory of a particular pitch.
  • The clusters persist across the seasons. The characteristics of the clusters that we describe below are true over time.
  • Pitchers’ processes can change over time. You may notice that a pitcher’s cluster changes across the seasons.
  • An observation is the occurrence of a pitcher-season in a cluster.
  • A number in parentheses indicates the number of times a pitcher has shown up in a cluster
  • The values in the pitch usage tables will rarely total 100% as we chose to leave out pitch types that clusters rarely throw
  • A number in brackets indicates the cluster’s rank in terms of usage with particular pitch.

While we don’t have recordings of our Hangouts sessions analyzing cluster stats and debating the central themes of each, for our fellow baseball nerds we do have some analysis of our final clusters. Without further ado, the clusters:

Cluster 1: Sinkers, Sliders, and Spin

The majority of this cluster includes late inning relievers (from both the left and right side) who are essentially two-pitch dominant (sinkers and sliders) without relying on the four-seam fastball as their primary pitch. Not only does this cluster throw the greatest amount of sinkers to begin with, it also possesses the greatest amount of pitch break on their sinker, as well as the second highest velocity (speed), whiff rate, and spin rate specific to sinkers — which explains the adage of “pitching to your strengths.” The slider provides a nice complimentary pitch with a different break plane, but it’s an equally strong pitch for this cluster: it’s got the second highest slider break and fourth highest whiff rate of any cluster.

  • Most Frequent: Joe Smith, Steve Cishek, Oliver Perez, TJ McFarland, Tony Watson, Zack Britton (5)
  • Notables: Sergio Romo, Matt Albers

Cluster 2: Not All Fastballs Are Equal

This cluster of right-handed pitchers includes both starters and middle/long relievers. The common theme are those whose fastball repertoire is relatively diverse between four-seam, two-seam, and cut fastballs. In fact, two of these three fastballs come within the top three ranked nasty factor of any cluster. This cluster disperses three different fastball velocities as primary pitches. It’s also unique in that these pitchers rely on less conventional pitches as their secondary pitches — think knuckleball curve in lieu of traditional curve and slider. It’s an effective tactic for this crew: the ‘plus’ knuckle curve ranks #1 in break and #3 in spin rate.

  • Most Frequent: James Shields, Trevor Bauer (4), Dillon Gee, Vance Worley, Phil Hughes (3)
  • Notables: Clay Buchholz, Mat Latos

Cluster 3: In, Out, Up, and Down

This cluster includes right-handed starters whose four-seam fastball is a secondary pitch to their cutter. These pitchers demonstrate a unique combination of velocity (speed) and downward movement, so it’s not surprising that this cluster owns the greatest spin rate (sink) of two-seam fastballs and the fifth-greatest break. This cluster represents a wide variety of pitches without drastic changes in velocity, relatively speaking. Their success (or failure) depends on movement.

  • Most Frequent: Jeff Samardzija (3)
  • Notables: Dan Haren, Alfredo Simon, Masahiro Tanaka, Yu Darvish

Cluster 4: Splendid Splitters

While there are a few starters mixed in here based on pitch repertoire alone, this group consists mainly of right-handed set-up men and closers. What makes this cluster unique? Not all late inning plus-strikeout relievers rely solely on the four-seam fastball as a predominant pitch. Sure, it’s a ‘plus’ pitch for them, but this cluster shows the penchant for the splitter being pitch 1A. Let’s be clear: these splitters aren’t just effective, they’re practically diabolical. These 22 pitchers collectively own not only the greatest frequency of splitters, but also the highest whiff rate of any cluster as well as the second highest break length on splitters. And did we mention the four-seam fastball ranking #5 in velocity? Yikes.

  • Most Frequent: Kevin Gausman (5), Kirby Yates, Oliver Drake, Luis Garcia, Koji Uehara,
  • Notables: Jonathan Papelbon, Pedro Strop, Edward Mujica

Cluster 5: The Knuckleheads

Simply put: the knuckleball needed a home by itself in our three-dimensional rendering and clustering of Statcast data. Granted, veteran knuckleballer R.A. Dickey retired before the 2018 season, but he’s still part of our data corpus dating back to 2015. And as the only active knuckleballer left (though he did not meet our minimum threshold for inclusion in 2019), we also wanted to give Steven Wright a home. Confession: we had a tough time leaving MLB Savant’s complete collection of at-bat footage for this crew — so many good choices. Check out the movement on this knuckleball — it’s too good to pass up.

  • Most Frequent: Steven Wright (4), R.A. Dickey (3)

Cluster 6: Variety is the Spice of Life

This group of predominantly right-handed pitchers (two-thirds of which starters) keeps a diverse repertoire of five effective pitches with movement, without deploying traditional curveballs or changeups for their offspeed portfolio (the knuckle-curve is a good example, though it’s not used too frequently). Their higher velocity repertoires with movement includes a splitter and a two-seam fastball to complement a four-seam. There are lots of ways to make a quality pitch in this cluster, and while there’s not a ton of slider and/or knuckle-curve usage, this group knows how to make them effective when they do throw them.

  • Most Frequent: Matt Shoemaker, Jeremy Jeffress (3), Nathan Eovaldi, Jake Odorizzi (2)
  • Notables: Homer Bailey, Tyler Mahle, Ricky Nolasco, Drew Pomeranz

Cluster 7: Two-Seam Fastball Served Here

This cluster represents predominantly right-handed and starting pitchers who utilize the two-seam fastball more than any other cluster. Everything still comes off the four-seam fastball, but the two-seamer is definitely pitch 1A for this group. The slider/curveball/change combination is evenly dispersed and all serve as effective secondary pitches (#4 in slider speed). Despite the frequent use of the two-seam pitch, this cluster still includes a more traditional repertoire of fastball/slider/curve/changeup as a whole.

  • Most Frequent: Rick Porcello, Ivan Nova, Kyle Gibson, Sonny Gray, Marcus Stroman, Julio Teheran, Mike Foltynewicz, Stephen Strasburg, Aaron Sanchez (5)
  • Notables: Zack Greinke, Johnny Cueto, Jacob deGrom, Lance Lynn, Matt Harvey, Yovani Gallardo

Cluster 8: Keep the Ball Down

This cluster prefers mixing their pitches up with sinkers, curveballs, and cutters rather than relying on the four-seam fastball. There’s a lot of downward movement in this cluster, which explains a high ground ball rate (6th highest ). Everything comes off the sinker (#5 in velocity, #4 in whiff rate, #4 in spin rate). The other pitches in the portfolio largely depend on how they are set up by the ‘plus’ sinker as the primary pitch.

  • Most Frequent: Jon Lester, Kyle Hendricks, Adam Wainwright, Corey Kluber, Josh Tomlin, Felix Hernandez, Jake Arrieta (5)
  • Notable: Noah Syndergaard, Doug Fister, Tanner Roark, Bronson Arroyo

Cluster 9: Throw the Kitchen Sink at ‘Em

This is a fun cluster — rarely will you see a straight pitch. With a near even distribution of starters and relievers, these folks love their sinkers, cutters, splitters, and sliders, and throw in a knuckle-curve for good measure. This cluster is majority right-handed, demonstrates hard movement (see those main pitches), and also has a four-seam fastball mixed in.

  • Most Frequent: Masahiro Tanaka (4), David Robertson, Hisashi Iwakuma, Tony Barnette, Mike Pelfrey (3)
  • Notables: Jarred Cosart, Edinson Volquez, Brandon McCarthy, A.J. Burnett

Cluster 10: A Cluster Unto Itself

After great debate, we felt Junichi Tazawa belonged in a cluster all by himself. Keep in mind, the corpus of data goes back the past five seasons — and although Tazawa didn’t pitch in MLB last year (and sparingly in both 2017 and 2018) — his forkball and curveball usage and spin rate make him unique. In fact, he’s the only pitcher in the last five years to throw a forkball.

Cluster 11: Speed Trap (Fast and Slow)

While this cluster sets up its repertoire with a traditional four-seam fastball, it’s far more telling that the changeup is the second most frequent pitch. Switching speeds is always an effective way to keep opposing hitters off balance. The result? A #4 whiff rate for that changeup. Mix in the cutter and two-seamer and you’ve got a variety of both movement and velocity in this portfolio.

  • Most Frequent: Marco Estrada (5), George Kontos, Luke Weaver, Michael Wacha (4)
  • Notables: Yusmeiro Petit, Cole Hamels, Max Scherzer, Madison Bumgarner, Justin Verlander, Eduardo Rodriguez, Mike Fiers, Dallas Keuchel, David Price

Cluster 12: Keep it Simple

Similar to the previous cluster, this group utilizes the changeup with great frequency (more than any other cluster, even the above). What makes this cluster unique is the classic “old school” repertoire of fastball/slider/curve/change. Granted, the changeup is a more frequent weapon for this group, but this is a classic case of a four-seam fastball setting up secondary pitches. This is a relatively large and balanced group: split nearly in half between right- and left-handed pitchers, and split in thirds between starters, middle relievers, and late-inning setup/closers.

  • Most Frequent: Blake Snell (4), Blaine Hardy, Drew Smyly, Luis Severino, Erik Goeddel, Chris Devenski, Matt Strahm, Tyler Glasnow, Sean Newcomb, Luis Avilan, Adam Conley, Jerry Blevins, Tyler Thornburg, Adam Morgan (3)
  • Notables: Tommy Milone, Joakim Soria, Tommy Kahnle, Clayton Kershaw

Cluster 13: Turn Out the Lights

For these (almost exclusively) set-up guys and closers, their fastball power makes them almost entirely strikeout pitchers (more than 9 strikeouts per 9 innings pitched). These are pitchers that miss a lot of bats with their fastball (#1 in four-seam velocity and #1 in four-seam whiff rate). When the ball is put in play, it will often come in the form of a pop out (#1 in pop-up rate).

  • Most Frequent: Cody Allen (5), Sean Doolittle (4), Craig Kimbrel (3)
  • Notables: Corey Knebel, Chad Green, Dellin Betances

Cluster 14: A One-Two Punch

When it comes to a traditional two-pitch portfolio of a fastball/slider combo, this cluster is it. Ninety-seven percent of this cluster consists of middle- and late-inning relievers, who often are two-pitch throwers. The four-seam fastball has major velocity (#2 in speed), resulting in lots of missed bats (#4 in whiff rate), while the slider complements that pitch with a significant spin rate (#2). As far as repertoire goes, this is classic old-school pitching.

  • Most Frequent: Ken Giles, Shawn Kelley, Andrew Miller (5), Cam Bedrosian, Jake McGee, Anthony Swarzak, Will Smith (4)
  • Notables: Josh Hader, Addison Reed, Tony Cingrani, Boone Logan, Tony Sipp, Chris Archer, Aroldis Chapman, Heath Hembree

Cluster 15: Three’s Company

At first glance, this cluster seems to be a traditional four-seam fastball and slider combination. But after a closer look, we can see the two-seam fastball is the distinguishing part of the equation. These pitchers move the ball from the same release point on three alternating planes (four-seam, slider, two-seam) a total of 76% of the time, which can be devastating to batters. The consistent frequency of this trio also sets up the softer offspeed arsenal (curveball and changeup) for greater success. Notably, this is the biggest cluster.

  • Most Frequent: Rich Hill, Gio Gonzalez, J.A. Happ (5), Mike Clevinger, Jon Gray, Jose Alvarez, Francisco Liriano, Danny Duffy, Wei-Yin Chen, Drew VerHagen, Jordan Zimmerman (4)
  • Notables: Dallas Keuchel, Clayton Kershaw, Ervin Santana, Justin Verlander, Chris Archer

Cluster 16: For the Moose

Hall-of-Famer Mike Mussina would be proud of this cluster. After all, he rode his knuckle-curveball to a career that ended in Cooperstown. The tandem of a four-seam fastball backed by a ‘plus’ knuckle-curve right behind it harkens to his playing days. Sure, Moose didn’t throw the cutter, two-seam fastball, or slider as much as this cluster does, but the portfolio is reminiscent enough to merit the name.

  • Most Frequent: Santiago Casilla, Tyler Clippard, Jason Vargas, Drew Pomeranz, Dominic Leone (4)
  • Notables: Fernando Abad, David Price, Joe Kelly, Lance McCullers Jr, Brad Peacock, Shane Bieber, Gerrit Cole

Cluster 17: Don’t Blink

This cluster — from a full portfolio perspective, not just the four-seam fastball — has higher velocity offerings as a whole. No curveball or changeup prominent within this cluster (<10% of the repertoire). Some heat (four-seam) with some sink to it (two-seam), and a downward off-speed splitter, and a tight slider mixed in. These pitchers like velocity and movement, rather than outright deception and speed change (unless it’s a splitter).

  • Most Frequent: Junior Guerra, Luis Perdomo, Wade Davis (3)
  • Notables: Jake Faria, Pedro Strop, Jorge De La Rosa, Shohei Otani, Jake Odorizzi

Cluster 18: Roller Coaster Ride

There aren’t a lot of flat, straight pitches (four-seam fastballs) in this cluster, but there are plenty of sinkers (#2 in break), cutters (#2 in spin), and knuckle-curves (#1 in spin). These pitchers exhibit tons of movement, tons of spin, a strong break, and a devastating changeup (#1 in break). There’s no upper echelon velocity, relatively speaking, so this group relies on deception, movement, and change of speed in their game. Simply put, most of these pitchers haven’t thrown a flat straight pitch two times in a row since they were in high school.

  • Most Frequent: Mike Leake, Anibal Sanchez (5), Trevor Cahill, Zack Godley (3)
  • Notables: Jeremy Hellickson, Odrisamer Despaigne, Jarred Cosart, Wander Suero

Cluster 19: A Few Tricks up the Sleeve

This cluster is vastly dominated by right-handed pitchers with ‘plus’ four-seam fastballs who utilize non-traditional pitches for their offspeed and movement. The absence of a traditional curveball and slider (frequency less than 10%) is notable, and it’s replaced by a ‘plus’ knuckle-curve and a splitter on the vertical plane, and a cutter on the horizontal plane. When these pitchers are at their best, they’re living on the corners of the strike zone.

  • Most Frequent: Mark Melancon (5), Brandon Workman (3)
  • Notables: Alex Cobb, Dellin Betances, Zach Putnam, Jorge De La Rosa, Brandon Morrow, Touki Toussaint

Cluster 20: Look Out Below…

In an era where launch angle seems to be everything for opposing hitters, the sinker has become a primary weapon for a wide variety of pitching talent levels. To overly-simplify: it’s difficult for a batter to get lift and launch on a ball that drops downward at a fast rate (aka, a sinker). This cluster definitely fits the bill with their devastating sinkers (#3 speed, #3 spin, #3 whiff rate, #5 break), which have a hard sink, change that dies downward, and a slider on a downward plane. There’s a bit more representation of lefties in this cluster, and a near-even dispersal of starters, mid-, and late-inning relievers.

  • Most Frequent: Steven Matz (5), Jeanmar Gomez, Derek Holland, Jimmy Nelson, Brad Ziegler, CC Sabathia (4)
  • Notables: Dan Otero, Hector Santiago, Jose Quintana, Matt Albers, Edinson Volquez, Jeff Locke, Trevor Cahill

Cluster 21: The Power Trio

When you throw the four-seamer nearly half the time (and you’re #3 out of all clusters in velocity), back it with a slider 25% of the time, and your two-seamer is your third most frequent pitch — you’re throwing with velocity at greater frequency, and not as much finesse as a whole. Despite being four-seam dominant, movement is actually a big factor here because of the two-seam portfolio (#2 spin, #2 whiff rate, #4 speed). No true off-speed pitches here other than the slider, and in many cases even that has some velocity to it as well.

  • Most Frequent: Daniel Hudson, Hansel Robles (5), Carlos Rodon, Brandon Maurer, Chris Sale, Ryan Pressly, Hunter Strickland, John Axford, Roberto Osuna, Juan Nicasio (4)
  • Notables: Joaquin Benoit, Ervin Santana, Adam Ottavino, Patrick Corbin, Joakim Soria, Heath Hembree, Matt Belisle, Tommy Kahnle, Carl Edwards Jr, Trevor Rosenthal

Cluster 22: Balance is Everything

This cluster includes some pretty strong four-seam dominant starters (almost exclusively from the right side) that essentially use the two-seam fastball and slider as options 1A and 1B in their repertoire. From there, they mix in a ‘plus’ slider (#5 in break) and a curveball (#4 in break), giving them a well-rounded portfolio with plenty of options. It’s one that lends itself favorably to starting pitchers, so it’s no surprise that starters make up 96% of this cluster.

  • Most Frequent: Charlie Morton (4), Taijuan Walker, Ubaldo Jimenez (3)
  • Notables: Homer Bailey, Jake Odorizzi, Colin Rea, Ricky Nolasco, Zack Wheeler, Alfredo Simon, Jeff Samardzija, Nathan Eovaldi, Yu Darvish, Jorge De La Rosa

Cluster 23: Fast and Furious

This cluster shows its strength in hard throwing: lots of four-seam fastballs and cutters, as well as high-velocity sinkers and sliders. While there’s not a ton of speed variance, there is a lot of movement, even with these hard throws. These pitchers switch things up with five different pitches thrown more than 10% of the time, and most at the highest velocity relative to their pitch type.

  • Most Frequent: Kenley Jansen (5), Josh Fields, Alex Wood, Jeurys Familia, Will Harris, Bryan Shaw (4)
  • Notables: James Paxton, Yonny Chirinos, Dellin Betances, Brad Peacock, Craig Kimbrel

Play Ball

We’re glad we had the chance to bask in the past five years of pitching data, explore the patterns of some of our favorite pitchers and share it with you. As this young (and certainly unique) season continues, we’ll be playing with the visualizer and dreaming about the match-ups to come. Enjoy the rest of the Next OnAir offerings this summer!

--

--