Sports Analytics and Data Science

Top 10 Technologies for Sports Analytics

Coolkingsingh
The Sports Scientist
9 min readJun 3, 2020

--

This article is aimed at undergraduate and graduate students that want to make a career on the intersection of sports and analytics. I will give a brief introduction followed immediately by the technologies and then the skills.

Introduction

I am going to introduce some of the technologies and skills that are useful for performing analytics effectively in sports, we must first understand sports — the industry, the business, and what happens on the fields and courts of play. We need to know how to work with data — identifying data sources, gathering data, organizing and preparing them for analysis. We also need to know how to build models from data. Data do not speak for themselves. Useful predictions do not arise out of thin air. It is our job to learn from data and build models that work.

Technologies

For a practicing sports analyst and data scientist, there are considerable advantages to being technically inclined. It pays to be multilingual, with some understanding of both R and Python.

Python

What is it?

Guido van Rossum, a fan of Monty Python, released version 1.0 of Python in 1994. This general-purpose language has grown in popularity in the ensuing years. Many systems programmers have moved from Perl to Python, and Python has a strong following among mathematicians and. scientists. Many universities use Python as a way to introduce basic concepts of object-oriented programming. An active open-source community has contributed more than fifty-seven thousand Python packages.

How to use it?

Doing data science with Python means gathering programs and documentation from GitHub and staying in touch with organizations like PyCon, SciPy and PyData. At the time of this writing, the Python programming environment consists of more than fifty-seven thousand packages. There are large communities of open-source developers working on scientific programming packages like NumPy, SciPy, and SciKit-Learn. There is the Python Software Foundation, which supports code development and education. Useful general references for learning Python include Chun (2007),Beazley (2009), Beazley and Jones (2013), Lubanovic (2015), Slatkin (2015),and Sweigart (2015).

Why we use it?

Python is essentially used by a sports analyst to perform the following:

i. Building data pipelines to collect or transform data from databases and other sources. For example, web scraping, ETL.

ii. Data manipulation and pre-processing to clean the data

iii. Performing data science methods below on the data to gain key tactical insights

R

What is it?

Designed by Ross Ihaka and Robert Gentleman, R first appeared in 1993. R provides specialized tools for modeling and data visualization. It rep-resents an extensible, object-oriented, open-source scripting language for programming with data. It is well established in the statistical community and has syntax, data structures, and methods similar to its precursors, Sand S-Plus. Contributors to the language have provided more than five thousand packages, most focused on traditional statistics, machine learning, and data visualization. R is the most widely used language in data science, but it is not a general-purpose programming language.

How to use it?

Doing data science with R means looking for task views posted with the Comprehensive R Archive Network (CRAN). We go to R Forge and GitHub. We read package vignettes and papers in The R Journal and the Journal of Statistical Software. The R programming environment consists of more than five thousand packages, many of them focused on modeling methods. Use-ful general references for learning R include Matloff (2011), Lander (2014),and Wickham (2015). Venables and Ripley (2002), although written with S/SPlus in mind, remains a critical reference in the statistical programming community

Why we use it?

R is essentially used by a sports analyst to perform the following:

i. Data manipulation and pre-processing to clean the data

ii. Performing data science methods(which I have mentioned below) on the data to gain key tactical insights

An example of how sports data is used and transformed using the programming language R. Source (Miller, T. W. (2016). Sports Analytics and Data Science.)

Tableau, Power BI

What are they?

Data visualization tools.

How to use them?

These tools support winning strategies with nimble, easy-to-build graphics. You can create graphs, heat maps, line graphs, etc. On the field, they use these tools to identify the most valuable players, develop their abilities, and build balanced teams. Behind the scenes, Tableau helps them streamline operations, engage fans, and stay relevant.

Why we use them?

To communicate with management, players and coaches. Sports analysts leverage creating visualizations to explain their insights in a fast and effective manner. To do this they can use technologies such as Tableau and PowerBI.

Cloud and IOT

Databases and analytics information systems are distributed across many computers in clusters or clouds. We carry the smallest of computers in our pockets. Watches, wearables, and data collection devices abound. Microprocessor chips and sensors contribute to the glut of data in what is sometimes called the Internet of Things.

A sports analyst needs to understand how to leverage the power of cloud for his analysis, this can be done by understanding the services offered on the cloud and how to utilize them. This can be done by understanding services such as IaaS, PaaS, SaaS and DBaaS.

Biometrics

Another frontier data source for player and team performance is locational and biometric devices. These include GPS devices, radio frequency devices, accelerometers, and other types of biometric sensors. One vendor of these tools, for example, is Catapult Sports, which developed GPS and accelerometer-based devices in Australia. Zebra Technologies offers a radiofrequency ID (RFID) tag for location data that is being explored by a few professional teams. Adidas offers the Mi Coach system (including GPS and biometric sensors), which was adopted by all US Major League Soccer teams in 2013. Several English Premier League teams use similar devices in practices, but they are not yet allowed in game use. Some NFL (e.g., the Buffalo Bills) and NBA (e.g., San Antonio Spurs) use GPS devices in practices, which is the only time at which they are approved by their respective leagues. The locational devices are most frequently used to assess total activity (miles or kilometers run, steps taken, average speed) undertaken by players in a game or practice.

Databases (SQL and NoSQL)

Databases addresses unstructured and semi-structured text as well as numerical data. Sports analysts employs NoSQL document stores as well as spreadsheets and relational databases. And increasingly, data science provides methods for data exploration and discovery that help businesses benefit from large information stores. In a data-intensive, data-driven world, searching and selecting data have be-come as important as sampling. Gathering and making use of information is what sports analysts and data scientists do every day. They understand information technology as well as statistical modeling. They work with data. They understand databases as well as box scores. They know about object-oriented programming and play-by-play logs. Ex. MongoDB, PostgreSQL, Excel

Real-time analytical technologies

Apache Spark: To work with real-time data streams, Apache spark provides a scalable environment for the processing of data and jobs. This could be video streams or anything else.

Apache Kafka: Is a queuing technology in which the data captured by the cameras are being streamed to and then read by Apache spark.

Skills

I will now introduce the skills that can be required to work with these technologies.

Data Visualization

In communicating with management, sports analysts need to go beyond formulas, numbers, definitions of terms, and the magic of algorithms. Sports analysts possess the capability of transforming complex models into simple, straightforward language that others can understand

One experienced analyst for an NFL team noted:

“I use data visualizations — simple stuff — to try to improve group decision-making for the college draft. There are dozens of key “measurables” for players and it’s difficult for the evaluators to digest all of the trade-offs between them (e.g., ‘this guy is very quick, but lacks size/strength.’) I provide them a color coded 1-pager that provides a “visual conjoint analysis” of sorts. I try to get these visualization tools in at the front of the decision process, and hopefully get some data-based discussions going.”

Example of a one page visual co-joint analysis. Source https://infogram.com/nba-vs-wnba-1g0gmj9zz8o121q

Data Science Methods

I will now introduce what might be the most important skill that is required to be a sports analyst and data scientist.

Inferential Statistics: For example, MLB, NBA, and NFL in August 2015. Player salary distributions are positively skewed. The mean salary across NFL players is around $1.7 million, but the median is $630 thousand. The mean salary across NBA players is around $5.1 million, with median salary $2.8 million. The mean salary across MLB players is around $4.1 million, with the median $1.1 million.

Mathematical Programming: There are myriad applications of mathematical programming in sports. We need to pick players for teams and determine when and where to use players. Choices are subject to constraints such as salary caps, the number of players on a roster, and the number of players in the lineup. Many problems in sports management analytics involve allocating scarce resources, maximizing revenue, or minimizing costs subject to constraints. Mathematical programming models are deterministic, with known, fixed parameters in the objective function and constraints. But for practical purposes, it is unreasonable to assume that parameters are known and fixed.

Classical and Bayesian Statistics: While the classical approach treats parameters as fixed, unknown quantities to be estimated, the Bayesian approach treats parameters as random variables. In other words, we can think of parameters as having probability distributions representing our uncertainty about the world. The Bayesian approach takes its name from Bayes’ theorem, a famous theorem in statistics. In addition to making assumptions about population distributions, random samples, and sampling distributions, we can make assumptions about population parameters. In taking a Bayesian approach, our job is first to express our degree of uncertainty about the world in the form of a probability distribution and then to reduce that uncertainty by collecting relevant sample data.

Regression and Classification: Much of the work of data science involves a search for meaningful relation-ships between variables. We look for relationships between pairs of continuous variables using scatter plots and correlation coefficients. We look for relationships between categorical variables using contingency tables and the methods of categorical data analysis. We use multivariate methods and multi-way contingency tables to examine relationships among many variables. And we build predictive models.

Text and Sentiment Analysis: When we talk about Sentiment Analysis (also called Mining of Opinions or Emotional Artificial Intelligence), we are referring to a series of applications of natural language processing techniques, computational linguistics and text mining, which aim to extract subjective information from generated content by users such as comments on blogs, social media, etc. Referring to sports relating sentiment or emotional AI to sports, for example, we live in the era of the “experts” of football, of deep analysis and of special guests who are ex-technicians and ex-football players who manage and weave countless possibilities. This generation of wise men analyze corner shots, make sketches on blackboards and between questions and answers assemble alignments and predict results. It is the intellectual age of football that has caused a notable influence on managers, coaches and footballers. Nobody escapes excessive criticism.

Time Series Data: An analyst can work with time series data, using past sales to predict future sales, noting overall trends and cyclical patterns in the data. Exponential smoothing, moving averages, and various regression and econometric methods may be used with time series data.

Social Network Analysis: Teams, apart from competing on the fields and courts, interact as businesses and cooperate in leagues. There are player trades between teams and many communications among teams. Professional sports present a potentially rich domain for social network research, thinking of teams as economic agents. Another way of using social network analysis in sports would be to consider patterns of interaction among players and coaches of a team.

Computer Science

This is the most basic skill required. It includes :

  1. Programming knowledge : Knowledge of basic programming principles like OOP, loops, etc. And how languages are translated and understood by machines?
  2. Systems understanding : How do different systems in an application work together what? How do they communicate? What role does each system play and what is its importance?

These are some of the basic knowledge that the a good sports analyst should have.

I thank you for reading this article and hope you have a better knowledge of what it takes to be a good sports analyst and their responsibilities.

Some common job roles look like :

Sports Data Analyst

Sports Information Director

Sports Performance Analyst

Sports Management

Sports Statistician

References

[1]Barlow, J. M. (2015). Data Analytics in Sports. O’reily.

[2]Davenport, T. H. (2014). Analytics in Sports : New Science of Winning. SAS institute.

[3]Miller, T. W. (2016). Sports Analytics and Data Science.

--

--