NBA Data: A Live-Example of Data Cleansing

TKP
Analytics Vidhya
Published in
7 min readJan 8, 2020
What you can do with a little elbow grease and Pandas

Summary

This post walks through a data cleansing exercise that covers a plethora of topics in Python using the Pandas library to go through NBA player logs.

Agenda:
1. Looping through a list of files and merging them into a single DataFrame.
2. Building custom functions to cleanse mixed string/numeric data.
3. Addressing missing/incomplete source data and using Pandas SQL-like merge functionality to combine different datasets.
4. Focus your quality assurance (QA) workflow using Pandas tools such as aggregation and selectors.
5. How to build a dashboard using Pandas, Panel, and Altair (the header image).
6. Using Pandas for row-wise calculations using itertuples.

Skip to the bottom to see gist with all the details. Since the details are in the code itself, I will speak to the above topics conceptually to keep this from getting to be too lengthy.

Background

First, the problem statement: how do I QA 30 years of NBA player game logs using data from an undocumented API?

Excel (and VBA when necessary) is usually my go-to for analyzing most datasets; when that doesn’t do the trick, the next step would usually be to add MS-Access to the workflow for smaller datasets or query a SQL database for larger datasets, but in this situation Excel won’t do the trick, especially because its own arcane format for dates and is terrible for time calculations (among other things). I know Excel will coerce certain formats which will later create issues unless I do a bunch of preprocessing in VBA, which isn’t time efficient. See https://www.sciencemag.org/news/2016/08/one-five-genetics-papers-contains-errors-thanks-microsoft-excel. Excel is great for modeling cashflows and what-if analysis, but not my go-to tool for data cleansing.

This gave me a good excuse to use Python for a project that interests me.

Installation

Going through this tutorial will require the following libraries/modules in your notebook:

1. Pandas and Numpy;
2. Datetime and time;
3. Panel + Altair + HTML (from Ipython) for the visuals;
4. OS (optional if you want to hardcode filepaths);
5. Logging (optional);
6. Xlwings (optional); and
7. Webrowser (optional).

Getting to the Point

Looping thorough a set of files

Python makes looping pretty straightforward and the OS module helps make working with files and directories more efficient.

1. os.getcwd() will get your root directory and store it in a string.
2. os.listdir() will let you get a list of files in a directory. Put your path in () and store these files in a list.
3. Now you can loop the list with a for loop where each element is a file. You can use an if statement if only want to open certain files.

For element in directory list:
If element.endswith('*.csv'): #If element meets condition
df=pd.read_csv(element) #Do something

Custom Functions for your DataFrame rows or columns using .apply

Sometimes you need to apply a custom function to the elements in DataFrame. The Pandas .apply() method makes this straightforward. In my case I needed to normalize the minutes played in each game for every player because sometimes it was a string and sometimes it was an integer, but to illustrate the concept it’s better to use a simple example.

Define custom_function (x):
x = str(x) + 'day(s)' #Do something to x
Return x

Let’s say my function squares x and I want to use this in a series in a DataFrame, I can apply it to as follows:

[df[‘series_adj’]= df[‘series’].apply(custom_function).

That’s it! If ‘series ’contained the values 1, 2, 3, 4 ‘series_adj’ would become ‘1 day(s)’, ‘4 day(s)’, ‘9 day(s)’, ’16 day(s)’. Note that if .apply() was used for a DataFrame you can apply the function to either rows or columns by setting the “axis” parameter.

Merging DataFrames (to Deal with Missing Data)

When I looked into the NBA data, most of it looked good, but there were still data issues, which I later confirmed were in the source data itself. One of the key fields I needed was the day the game took place, which fortunately was available in a different table.

  1. I had to see if I could pull the data somewhere else and link it to my source data using a common key. Always make sure you have a way to join datasets before going any further.
  2. The next step is relatively straightforward and pretty intuitive for anyone who has ever joined tables using SQL (http://www.sql-join.com/ is tutorial on joins). Note that by default Pandas assumes an inner join and tries to guest the keys but its more prudent to be explicit.
Merged_DataFrame = DataFrame1(DataFrame2,
left_on = 'DataFrame1_key',
right_on = 'DataFrame2_key',
how = 'left/right/inner/outer')

Adding Aggregation to the Toolbox

In some cases it’s relatively easy to isolate data issues. For example I know every player should have a name (or can’t be Null/empty). Ideally that is all you need to screen for — practically speaking it’s almost never the case. It’s entirely possible that a player never takes a shot during a game or doesn’t hit any of his shots (not everyone is as good as LeBron).

Common sense, however, dictates that while a player could go scoreless in a game, if a game takes place someone will score. This is where aggregation and selectors can come in handy.

By aggregation I will be taking the different players and putting them together using a set of common characteristics. For example I could group all of the players on the same team for a specific game; I ended up grouping all of the players in a specific game together focusing on counting stats.

Left: ungrouped; right: grouped.

In Pandas grouping elements in a DataFrame is pretty straight forward (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html).

grouper = DataFrame.groupby([list of grouper columns])
grouper = df.groupby(['col1','col2')

This is how you could create a selector variable called ‘group’. Note that if you are only using one column for grouping the [] is not required, however, I would recommend getting in the habit of using a list even if there is one element to keep things consistent.

You can then use the grouper variable for aggregation. You can also do this in one line if desired.

grouper.[list of numeric to aggregate].sum() #aggregation_method
grouper.['col1','col2'].sum()

In the code, sum(), mean(), median(), etc would be an aggregation method. You could also use the .agg() method for more granular aggregation (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html).

This in addition to DataFrame selectors are invaluable when using Pandas (https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html).

Making a simple dashboard

I’ll write up a tutorial on making an interactive dashboard in a future post, but wanted to show how you can make a simple dashboard using the Panel library (See https://panel.pyviz.org/user_guide/Components.html for more information on Panel).

This is a high-level summary on how Panel works for dashboarding, using a corkboard for comparison:
1. Panel creates a container. This is like the corkboard.
2. The container holds your digital objects like a chart, table, or photos.
3. You group these objects in rows, columns or both.
4. You place them in the Panel.

This allows you add widgets to make the charts interactive and even build applications that can be shared to anyone with access to a browser.

Panel allows you to make your own digital corkboards

Using Pandas and itertuples to loop through a DataFrame by Row

Doing a calculation across rows is usually pretty easy in a DataFrame. If James Harden played 30 minutes and scored 30 points, I could just divide the points column by the minutes column. But what if I wanted to do something more complicated like figure out how his scoring changes based on the number of days of rest between games, and, in this case, I need to figure out, for that specific game, how many days of rest he had?

In Excel this would be relatively straightforward. I could just filter for James Harden, sort by date, and put in a formula as per the screenshot.

Not going to fly in Pandas

This, however, isn’t going to work on a massive dataset, and, even if it does, you likely need to turn on turn on manual calculation and run into potential performance issues. But how would you do this in Pandas? You can’t really do B3-B2 like in Excel. There are a few methods but , once you get the hang of it, itertuples works like a charm. This will give you a tuple that contains a row of data in your DataFrame where row[index] can be used to access a specific column, so column 1 = index 0, column = index 1, etc.

So say I wanted to loop though a DataFrame to compare something across rows. The example below would print ‘match’ if the column 4 in the the current row is the same as column 4 in the previous row.

for row in df.itertuples():
if row[0] != 0: #no comparison possible for the first row
if row[3] == last[3]:
print('match')
last = row
else:
last=row

So that’s it! We’ve gone through agenda.

Here is a link to the gist with the actual project file.

https://github.com/tkpca/Python-Scripts/tree/master/Data%20Cleansing%20Example

--

--

TKP
Analytics Vidhya

Interested in learning ways to use data to tell stories.