Roger Federer serving. //credit: Woojin Kim

Pondering Point Patterns

Statistics and Storytelling

Earlier this month Nikita Taparia and Jeff Sackmann launched the Tennis Data Storytelling Challenge, a data visualization competition which draws upon the growing dataset being made available by the Match Charting Project. In my opinion this challenge sits at the nexus of a number of pursuits that are trending sharply upwards:

Visualization challenges themselves are even becoming a bit of a thing. Information is Beautiful hosts regular challenges, and there are recent data visualization challenges directly related to sports: The Major League Data Challenge 2015.

Data Visualization boasts a growing number of international conferences and startups, and open source toolkits 100% focused on data visualization are proliferating rapidly.

Sports Analytics are more visible than ever, with major brands (IBM, SAP, InfoSys, Intel, ESPN) marketing their number crunching prowess, prestigious conferences, journals, apps for exploring sports stats (fantasy leagues and betting), and ever more sports/activity sensors with captive audiences pouring over charts and graphs of their personal performance peaks. Professional sports data is even being made available to the public, at least by the NBA.

As I understand it, the Match Charting Project was inspired, in large part, by a dearth of publicly available data related to tennis. Jeff gave an eloquent talk early in 2015 about the sad state of tennis analytics. There is plenty of data being captured, by systems such as Hawk Eye and IBM SlamTracker, but that data is not generally accessible, and its most notable use is to “enhance viewer experience,” which rarely translates into profound or even useful understanding of the dynamics of play.

The stark reality is that there are very few people who have a good grasp of how the kind of data being generated by Hawk Eye can be used, and what it should be used for… tennis coaches, apparently in contrast to the professional coaching staff in other sports, are unusually skeptical — even at the highest level, where extensive datasets can be purchased. I’ve heard stories of coaches for top players being surprised what the data can reveal…

But even those who spend time dreaming of what can be achieved by analyzing all the data that can be produced from tennis matches are unsure what will really be useful, at least to coaches and players. (There’s no question that snazzy graphics are useful for entertainment and corporate branding!)

Without publicly available data and the passion of individuals who are driven to investigate questions on their own, freed from paid-for studies and the ‘tyranny of experts’, the pace of exploration of these questions would likely be glacial (at least in the nuanced sense that word used to convey).

But now, thanks to the Match Charting Project, we do have publicly available, crowdsourced, open data. And questions which can be pursued with this data set are not hard to formulate. Jeff has posted a long list of topics, many directly related to the Tennis Data Storytelling Challenge; there is no shortage of experts expounding on tennis tactics and pattern analysis, in numerous on-line and in-print publications, that can be usefully tested against real-world data.

The Match Charting Project (MCP) itself is even a subject of study. Stephanie Kovalchik, who now works for the Australian Open analytics team, has written a thoughtful piece for researchers, and even interviewed one of the top contributors of match data.

It’s my opinion that efforts such as the Tennis Data Storytelling Challenge can only help this crowdsourced effort to mature. I’ve done a bit of analysis of the MCP dataset in my effort to create some tools to aid the challenge, and its clear that, apart from return-of-service, the depth of rally shots is not something that is being charted, though it is certainly possible to do so using the existing MCP spreadsheet. This limits the questions that can be addressed at this point, but that’s not necessarily a bad thing… because there are so many questions that can be pursued, and because it is possible that the pursuit of what is possible now can better focus the future efforts of the project, both in terms of what data is important to capture and how that data might best be captured.

“Storytelling is Serious Business”

I included “Narrative Networks” in my list of relevant upward-trending pursuits not because I believe that submissions to the Tennis Data Storytelling Challenge will necessarily be describable as such but rather for the evocative qualities of the phrase. I could just as well have referenced “Narrative Visualization” (the work of UW Interactive Data Lab and the Nvis2015 conference are certainly relevant to the theme of the challenge); but I want to draw attention here to the fact that more attention is being drawn to the power of narratives in shaping our perception of reality.

“Narrative Networks” is a phrase used by DARPA; I noted a few years ago a recruitment effort focusing on academics with backgrounds in linguistics and sociology (I was trying to grok Actor-Network theory at the time). Storytelling with sports data certainly doesn’t carry the gravity of the psychological and social issues being grappled with by government researchers, but storytelling is serious business, and the same set of skills and suite of tools apply to both endeavors.

In a similar vein, I think its worth noting that there is great overlap between those who do sports data visualization and financial / economic statistics. John Burn-Murdoch is a ‘Data Journalist’ at the Financial Times who seems to post as often about sports as other issues. An example close to home: the “Points-to-Set” chart at originated with the work of Francis X. Diebold, a U Penn Economics professor, and Glenn Rudebusch, Executive Vice President and Director of Research at the Federal Reserve Bank of San Francisco.

I don’t mean to imply that the application of sports analytics to the marketing efforts of large corporations isn’t serious business. But I do want to suggest that an individual’s pursuit of visual storytelling using data from any sport can contribute to the development of a set of skills which are more broadly applicable, and that “Open Data” and “Open Source” tools for analyzing and visualizing that data, while they may not counter the assault upon the “attentional commons”, can play a significant role in enabling individuals to direct their own attention and contribute to public narratives.

All of this is a long-winded way of saying that I deem the “Open” attributes of the Tennis Data Storytelling Challenge to be important. Some of the most interesting and award winning Data-driven Journalism is being driven by accessible data from “Open Government” initiatives, and there’s no reason that some of the most interesting and dare I say profound insights into the “patterns of play” in tennis shouldn’t come from an open, crowdsourced effort.

I expect that most of those who have signed up for the Challenge already have some experience with Data Visualization, and perhaps I’m preaching to the choir, but I want to put it out there that it’s possible to participate in the challenge even if you don’t have a background in Data Visualization or any of the tools traditionally associated with Visual Storytelling. Seven months is a long time (the contest runs through August 2016). It may not be a stellar example, but I started TennisVisuals less than a year ago with zero knowledge of Javascript/D3/Mongo/Node (not that those tools are better suited than any others to visualize MCP data… in fact, they’re not on the short list for ‘true’ data scientists) and a 30-year gap since my classes with Tufte during my college days. Of course there are far shorter paths to sexy statistical representations of numbers derived from sports!

As a sponsor of the tennis data storytelling challenge I have a vested interest in seeing as many participants as possible. I can’t help in crafting a story or a visual, but I do want to offer help to make the data as accessible as possible (while the data is all freely available here, shots from each point still need to be parsed and aggregated based on the questions being pursued). At the conclusion of the competition I’ve offered to host the winning visualizations and tie them to the live and growing database of matches such that the analyses can continue beyond the cutoff date, indefinitely.

Tennis players are actors in a social network of fellow competitors, and every exchange of shots in a rally can be seen as a conversation that is part of a narrative encompassing each new achievement in the sport. Do some players try to have the same conversation with every opponent they face? Do some have a more expansive vocabulary and select specific patterns of play every time they face specific opponents? Are some players able to change the conversation when it’s not going their way? Are there ‘truthy’ stories that sports journalists and experts spin to keep the banter lively and the audience engaged? Can statistical analysis reveal whether it makes any sense to be asking these questions?

If you haven’t already, please join the conversation!

And feel free to contact me on Twitter: @TennisVisuals, or email me: info-at-tennisvisuals-dot-com.