Revealing the Hidden Insights of NFL Big Data Bowl 2023 through Univariate Data Visualization

Arunkumar N
Variablz Academy
Published in
6 min readDec 31, 2022
credits: edition

Let’s dive into some sports. It’s American football this time. This article will explore the power of univariate analysis in understanding and analyzing data. Univariate analysis is a statistical technique that analyzes the distribution of a single variable. By applying univariate analysis to the NFL Big Data Bowl 2023 dataset, we can gain valuable insights into the characteristics and patterns of the data.

This article will go through the techniques and approaches used in univariate analysis and how they can be applied to the NFL Big Data Bowl 2023 dataset to reveal hidden insights and trends. We hope this article will provide a solid understanding of the importance and applications of univariate analysis in data science.

Importing the required libraries

Here, I’ve used “pandas.set_option” to scroll through all the columns in the dataset.

I have taken the recently trending Kaggle NFL Big Data Bowl 2023 dataset. You can download it from here for reproducibility. I have already optimized and merged the dataset. You can refer to my Kaggle notebook if you like to reproduce this article.

Reading the optimized file.

We have approximately 8 million rows and 70 columns.

From the info(), we can see that I have converted all float64 and int64 datatypes to float32 and int32, respectively.

Customizing the Plot Fonts — Font dictionary

For customizing the size and color of axes labels, we use the fontdict option in Matplotlib. We can put this directly into code or store it in a variable to change it as needed.

Box Plot

Through research and experimentation, these columns contain a reasonable number of outliers. I have chosen numerical columns to see how the values are distributed. I preferred to show all the plots in a single view by using the subplots option to reduce scrolling.

Since the data is bigger and we are also using loops, it is safer to use gc.collect(), a garbage collector that removes unwanted memory.

The option "plt.tight_layout()" automatically adjusts the spacing between the subplots, so there is no alignment problem.

From the above box plots, we can able to see,

Both the home team and the visiting team score mostly between 4 and 18 points. The maximum score of both the home and visiting teams is around 38. We rarely see teams' scores above 40, which are outliers.

The number of yards to go for one down is in the range of 5–10. If this value is low, the offensive team dominates and pushes forward. If this value is high, the defensive team has not let the offensive push forward for the goal.

We can see many outliers for the distance traveled from the initial time point. After 0.8 yards, all are outliers, and most players traveled between 0 and 0.5 yards.

KDE Plot

A kernel density plot is a graphical representation of the distribution of a continuous variable. It is a smoothed version of a histogram, which estimates the underlying probability density function of the data.

It gives us insights, which I will explain in the histogram part. Now let’s enjoy the plot. Some plots look like the mask of Batman 😅😅😅. When compared to other plots, this one took more time to process.

Histogram

A histogram is a graphical representation of a dataset that shows the frequency or proportion of the data falling within different intervals or bins. It is a valuable tool for visualizing the distribution of a dataset and identifying patterns or trends in the data.

In the histogram, we can use the "bins" option to group the values into several bars. It makes our plots look crisp and clean.

From the Histogram, we can see that,

Players are mostly in the middle of the ground as the count of ‘x’ and ‘y’ maximum for the middle values

Most players are in the weight categories 180–220 pounds and 320 pounds

The orientation of the players and angle of player motion are mostly 90 degrees or 270 degrees, as the players of both teams are facing each other 180 degrees apart.

Count Plot

A count plot is a graph that displays each unique category's frequency of occurrences in a categorical dataset. It is often used to visualize the distribution of categorical data.

I wanted to share this with you while working on this plot. Everything about this plot was completed, but not in this order. I was thinking about how to order these bars from maximum to minimum. Then I saw the option "order," where we can give the order of categories in a list. We can’t do it manually since the categories are different for different columns. I have sensed why we should not use list comprehension since we can use lists in "order." I gave it a try, and it worked out. I have enjoyed that moment since I applied what I learned.

From the plots, we can see,

Most of the time, there are six defenders in the box where they can directly tackle the offense.

Teams achieved 1 down most number of times.

Most games were played at 13:00 Eastern time.

The shotgun formation is the most preferred offensive formation by the teams.

Most of the time, the pass result is "C-Completed."

The majority of the defenders cover man to man.

Count plots for many categories

Since ‘foul name’ has more than 25 categories, I have split it into the Top 5 and bottom 5 for clear visibility, or they would be rushing with one another.

From this plot, we can see that,

Offensive holding, defensive pass interference, and defensive holding are the most common fouls committed by players.

From this plot, we can see that,

The two most common fouls committed by players are illegal substitution and disqualification.

Word Cloud

A word cloud visualizes the most common words in a text or a group of texts. It is created by displaying the words in such a way that the size of each word is proportional to its frequency in the text.

From this word cloud, we can see,

Teams like KC, MIA, TEN, and TB have played the most games.

Conclusion

Overall, this univariate analysis provides a valuable starting point for further exploration and analysis of the data and will be a helpful resource for beginners. Still, relationships, comparisons, and multivariate analyses are yet to be covered, and we will see that in the upcoming article.

My Kaggle notebooks on the NFL Big Data Bowl 2023 Optimization and Univariate analysis

Code for optimization and merging the data You can use my GitHub repo.

For more data science insights, connect with me on LinkedIn.

https://www.linkedin.com/in/arunkumar-data-scientist/

Thanks & Regards

Arun Kumar

--

--