Transforming NHL Play Level Data for Data Science

Transforming Complex Play-by-Play Action into Familiar Big Data Formats for the Python Community

Jason Blahovec
3 min readJan 29, 2024
Image generated with image generator GPT by NAIF J ALOTAIBI

In the fast-paced world of hockey, every pass, shot, and save holds a story waiting to be told. The National Hockey League (NHL), a treasure trove of such narratives, generates an immense volume of play level data that captures the essence of the game in its rawest form. However, the true potential of this data remains largely untapped, obscured by its complex, unstructured format that poses a significant barrier to analysis.

To allow data scientists more streamlined access to this complex data, I would like to introduce NHL-Insights-Hub. This GitHub repository is designed to support in-depth data analytics and insights into NHL Play-Level data. This project transforms data from the NHL API (version released 2023–11) into structured formats tailored for Python developers and data scientists.

The Data Pipeline: Simplifying Complexity

The solution to unlocking the value of NHL play level data lies in a two-part data pipeline. The first script, ingest_nhl_play_html.py, is designed to interact with the NHL's public API, retrieving detailed play-by-play data from API endpoints like this one. This script handles the extraction of each play in the document, ensuring that the data is accurately captured for further processing.

Following this, the html_to_sparse_parquet.py script takes over, converting the initially extracted HTML data into a more manageable sparse Parquet format (information on the output schema can be found here). This step is crucial for transforming the data into a form that's more suitable for analysis, making it easier for data scientists to work with.

Making Data More Accessible

This data pipeline can enhance the accessibility of NHL play level data for the data science community. By converting the data into the Parquet format, it becomes more straightforward to handle, analyze, and store, even when dealing with large volumes of information, across any number of NHL seasons. This accessibility opens up numerous possibilities for in-depth analysis and research within the realm of sports analytics.

Technical Insights into the Pipeline

The development of this data pipeline required careful consideration of several technical challenges. The ingest_nhl_play_html.py script, for instance, not only retrieves data from the NHL's API but also ensures that the data is accurately and efficiently processed. This involves parsing complex HTML structures and extracting relevant play-by-play information without losing any critical details.

The subsequent transformation process carried out by the html_to_sparse_parquet.py script is equally intricate. Converting HTML data to sparse Parquet format involves structuring the data in a way that optimizes both storage and query performance. This format is particularly beneficial for handling the dataset's sparse nature, where many fields may have missing or null values, typical in play-level sports data.

Throughout the pipeline’s development, optimizing for efficiency and scalability was a priority. Given the vast amount of data generated by the NHL, the scripts are designed to process and convert data batches in a way that reduces memory usage and improves processing speed.

Applications and Opportunities

The transformed NHL play level data opens a wide array of opportunities for data scientists and sports analysts. With the data now in a more accessible format, it becomes easier to apply various analytical techniques, from basic statistical analyses to more complex machine learning models. Researchers can explore trends, identify patterns, and even predict future game outcomes based on historical data.

Potential applications include performance analysis, where individual player statistics can be examined in unprecedented detail, and team strategy development, where insights drawn from play-by-play data inform tactical decisions.

Collaboration and Exploration

The availability of this data pipeline invites collaboration among data scientists, sports analysts, and enthusiasts. By leveraging this resource, the community can develop innovative analytics tools, contribute to sports science research, and deepen our understanding of the game.

The dataset and pipeline documentation are accessible to those interested in exploring this rich dataset further. Whether for academic research, professional sports analytics, or personal interest projects, the possibilities are vast.

Conclusion

The development of this data pipeline is a step towards making NHL play level data more accessible and usable for the data science community. By transforming complex raw data into a structured and analysis-friendly format, it lays the groundwork for a wide range of research and analytics projects.

Special Thanks to Drew Hynes for his work in documenting the NHL API.

--

--