cuDF for Data Scientists: Functions for Data Wrangling

Mohammed R. Osman
26 min readJan 5, 2023

--

Photo by Nicolas Arnold on Unsplash

Reference guide made to help people transition from Pandas to NVIDIA’s RAPIDS.AI cuDF package.

Purpose

Think of a one stop shop to find data wrangling help to use for cuDF.

I hope to make it a living document that people can have bookmarked and use to quickly & easily find help using both RAPIDS.AI cuDF and standard Pandas. The focus will be on loading, organizing and manipulating data on the GPU. The list of functions are based on personal experience as the most common or important pandas data wrangling functions. Functions that analyze, predict, or used in machine learning are outside of the scope. This post should have everything you need to take raw data into the GPU as a DataFrame, clean & organize it, and output it in other forms.

While there is a serious effort to make the transition from pandas to cuDF as pain-free as possible, there can be important differences. This often goes beyond the scope of one post, so links to specific articles and I’ll try to point out things to look out for. For comparisons’ sake links to both cuDF and pandas docs will be provided and examples (and some personal commentary) included.

As RAPIDS evolves, expect this post to go through multiple revisions as mistakes are fixed, new content added from suggestions, and new cuDF capabilities are integrated.

Setup

Function descriptions will use the following format:

Notice that Pandas insert() has more options available and cuDF has a ‘nan_as_null’ option

Arrow 1: Instead of going alphabetical, the functions are arranged in groupings or families of functions.

Arrow 2: The actual function will be defined, usually word for word as written by Pydata or RAPIDS teams. Personal notes are written here afterwards.

Arrow 3: The format on how the function is used, with arguments is written out. Usually the cuDF and pandas formats are exactly the same but not always. This is a good way of being alerted to that. Both cuDF and pandas official docs links are provided so you can dig into the functions and see examples.

Arrow 4: Example provided with code. Provided on a case by case basis

Arrow 5: Comment provided to explain what is going on and what insight was gained in the example.

cuDF Functions

cuDF DataFrame:

The fundamental basis of managing data. Two-dimensional, size-mutable, potentially heterogeneous tabular data. Data structure also contains labeled axes (rows and columns). Arithmetic operations align on both row and column labels.

cuDF: (data=None, index=None, columns=None, dtype=None, nan_as_null=True)
pandas: (data=None, index=None, columns=None, dtype=None, copy=None)

agg:

Aggregate using one or more operations over the specified axis. When DataFrame.agg is called with single operation/function, a Series is returned. With several operations/functions, a DataFrame is returned

cudf: (aggs, axis=None); note: axis not yet supported
pandas: (func=None, axis=0, *args, **kwargs)

After creating the DataFrame, agg() was used to condense it down.

agg() is a very flexible tool, this is just a small look into what it can do.

Using functions across the dataframe without loops (apply, applymap, map, apply_rows, groupby().transform(), items, scatter_by_map). Unfortunately, as of this writing I myself am not an expert in these functions so will not include examples.

apply (apply, groupby.apply):

Is highly bespoke in Rapids.AI packages, please look at the doc linked before using. Apply works in parallel, always use this instead of for loops where possible.

Apply a function along an axis of the DataFrame. apply relies on Numba to JIT compile func. Thus the allowed operations within func are limited to those supported by the CUDA Python Numba target. For more information, see the cuDF guide to user defined functions.

Support for use of string data within UDFs is provided through the strings_udf RAPIDS library. Supported operations on strings include the subset of functions and string methods that expect an input string but do not return a string. Refer to caveats in the UDF guide referenced above.

cuDF: (func, axis=1, raw=False, result_type=None, args=(), **kwargs)
pandas: (func, axis=0, raw=False, result_type=None, args=(), **kwargs)

using the user defined function over the DataFrame row by row

apply_chunks:

Transform user-specified chunks using the user-provided function. (Untested)

cuDF: (func, incols, outcols, kwargs=None, pessimistic_nulls=True, chunks=None, blkct=None, tpb=None)

applymap:

Apply a function to a Dataframe elementwise. This method applies a function that accepts and returns a scalar to every element of a DataFrame. (Couldn’t get to work)

cuDF: (func: Callable[[Any], Any], na_action: Optional[str] = None, **kwargs)
pandas: (func, na_action=None, **kwargs)

Example using applymap() in Pandas

apply_rows:

Apply a row-wise user defined function. Not available in pandas

cuDF: (func, incols, outcols, kwargs, pessimistic_nulls=True, cache_key=None)

Example from RAPIDS doc

groupby().transform()

Passes a function to the whole series and returns a series. Apply an aggregation, then broadcast the result to the group size. Shares builtin functions with agg(). Pandas version uses cython engine as default but can be switched to numba.

cudf: groupby(categorical)[column function applied].transform(‘function’)
pandas: (func, *args, engine=None, engine_kwargs=None, **kwargs)

Using groupby().transform() on the DataFrame puts information per category within the rows

items:

Iterate over column names and series pairs. Used to take chunks of the data, count the elements, and then reassemble the counts. cuDF, pandas. Better write up in Pydata.

items() example in Pydata.org done in cuDF

map (Series):

Map values of Series according to input correspondence. Used for substituting each value in a Series with another value, that may be derived from a function, a dict or a Series. Values that are not found in the dict are converted to NaN. Default values in dicts are currently not supported. Think of map() as taking a dataframe, telling it the function you want to apply, and the inputs it needs. cuDF’s map() functionality may seem more limited than the Pandas version. This has to do with how cuDF store data and interacts with ‘object’ datatype.

cuDF: (arg, na_action=None)
pandas: (arg, na_action=None)

Utilizing a dictionary to switch values in the DataFrame with map()

Be careful about what kinds of data you use on maps():

Using cupy.arange() to generate an array of numbers instead of inputting a list caused a complicated situation

scatter_by_map:

Scatter to a list of dataframes. Uses map_index to determine the destination of each row of the original DataFrame. Not available in pandas. (Not tested)

cuDF: (map_index, map_size=None, keep_index=True, **kwargs)

assign binning values (cut, qcut):

cut:

Bin values into discrete intervals. Can be used to convert continuous values into categorical. Is being debugged in cuDF as of writing.

cuDF: (x, bins, right: bool = True, labels=None, retbins: bool = False, precision: int = 3, include_lowest: bool = False, duplicates: str = ‘raise’, ordered: bool = True)
pandas: (x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates=’raise’, ordered=True)

cut() created 4 categorical values or bins from a range of numbers

qcut:

Quantile-based discretization function. Cut with percentages of the dataframe. Use qcut() to create a numerical categorical variable that separate the DataFrame to analyze. Not available in cuDF as of writing

pandas: (x, q, labels=None, retbins=False, precision=3, duplicates=’raise’)

Categorizing the DataFrame into 25% sections

columns (columns, list):

Give the columns in the dataframe. Columns truncates, use list when you need the full rundown.

columns:

Returns a tuple of columns. cuDF, pandas

Using dataframe options you can specify when output truncates. In general use list when you want everything.

list:

List method from Python applied to a Data Frame or Series

List shows everything as a vertical output.

Combining dataframes (concat, join, merge).

Honestly, I use the pydata.org guide all the time.

concat:

Concatenate DataFrames, Series, or Indices row-wise

cuDF: (objs, axis=0, join=’outer’, ignore_index=False, sort=None)
pandas: (objs, *, axis=0, join=’outer’, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=True)

Putting two DataFrames together vertically using concat()

join:

Join columns with other DataFrame on index or on a key column. Usually used when putting together using an index. Join is often useful when the data files is unlabeled and don’t have any common columns but are using the same index.

cuDF: (other, on=None, how=’left’, lsuffix=’’, rsuffix=’’, sort=False)
pandas: (other, on=None, how=’left’, lsuffix=’’, rsuffix=’’, sort=False, validate=None)

merge:

Merge (GPU) DataFrame objects by performing a database-style join operation by columns or indexes. You are able to join using more than one column.

cuDF: (right, on=None, left_on=None, right_on=None, left_index=False, right_index=False, how=’inner’, sort=False, lsuffix=None, rsuffix=None, indicator=False, suffixes=(‘_x’, ‘_y’))
pandas: (right, how=’inner’, on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=(‘_x’, ‘_y’), copy=True, indicator=False, validate=None)

Using the merge() function to combine two DataFrames with 2 identifying columns in common

crosstab:

Compute cross tabulation of two (or more) factors. Computes a frequency table of the factors. Unless an array of values and an aggregation function are specified.

cuDF: No documentation available and some technical issues occur but it works for numerical data, not string/object. Use pandas crosstab instead.
pandas: (index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, margins_name=’All’, dropna=True, normalize=False)

In the crosstab table, ‘a’ here is the y axis and the ‘b’ & ‘c’ are the x axis.

drop:

Drop specified labels from rows or columns.

cuDF: (labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors=’raise’)
pandas: (labels=None, *, axis=0, index=None, columns=None, level=None, inplace=False, errors=’raise’)

Could’ve also used ‘inplace=True’ to save the dataframe with dropped columns

Dealing with duplicate data (duplicated, drop_duplicates)

duplicated:

Return boolean Series denoting duplicate rows. Not available as of writing in cuDF.

cuDF: Unavailable
pandas: (subset=None, keep=’first’)

drop_duplicates:

Return DataFrame with duplicate rows removed, can optionally consider specified subset of columns.

cuDF: (subset=None, keep=’first’, inplace=False, ignore_index=False)
pandas: (subset=None, *, keep=’first’, inplace=False, ignore_index=False)

Only one play from each game is retained. Shows that there was 122 games played in 8 weeks.

explode:

Transform each element of a list-like to a row, replicating index values. Basically if a value is a series or list, break down the list into separate values.

cuDF: (column, ignore_index=False)
pandas: (column, ignore_index=False)

Three categories in DataFrame converted into two using explode(). Note the index remaining the same.

Finding values (eval, filter, query, groupby.get_group)

eval:

Evaluate a string describing operations on DataFrame columns. eval and query are closely connected in functionality by sharing kwargs.

cuDF: (expr: str, inplace: bool = False, **kwargs)
pandas: (expr, *, inplace=False, **kwargs)

Created a DataFrame from a subset of columns, used eval() to create a new column & custom data then dropped missing values. Based on what we’ve found bad things happen to QBs 6% of plays. But that’s including all the run plays. I found the number to be really low…

filter:

Subset the dataframe rows or columns according to the specified index labels. Does not filter a dataframe on its contents, applied to the labels of the index. Currently unavailable in cuDF.

pandas: (items=None, like=None, regex=None, axis=None)

query:

Query with a boolean expression using Numba to compile a GPU kernel. As of this writing, cuDF’s query is more limited than the pandas version by not dealing with categorical or string/object variables. You will get a error saying: “query only supports numeric, datetime, timedelta, or bool dtypes.” To deal with this, convert the categorical variable into ordinal values or use one-hot encoding. Query and eval are closely connected in functionality by sharing kwargs. Please be aware that in cuDF, query doesn’t have the ‘inplace’ option. to make changes to the dataframe, you have to specify by writing ‘df = df.query(expr)’.

cuDF: (expr, local_dict=None)
pandas: (expr, *, inplace=False, **kwargs)

Simple query() filtering all the acceleration values over 4
query() using index values
Info function of query() can be used to see value counts leftover
query() can use multiple criteria at the same time
query() can even use criteria using values outside of the DataFrame using the ‘@’
cuDF’s query() doesn’t use objects or string variables, here each team is given a number identifier
Using the dictionary, string team values were replaced with numerical values, and query() can be used

groupby.get_group: (name, obj=None)

Construct DataFrame from group with provided name. Not available in cuDF as of writing. I’m including it here just to show an alternative to query available to Pandas users.

pandas: (name, obj=None)

DataFrame filtered down to Pass Blocking plays where Offensive Linemen are in standard “Pass Pro” or “PP”

Frequency (count, unique, nunique, value_counts, groupby.ngroups):

count:

Count non-NA cells for each column or row.

cuDF: (axis=0, level=None, numeric_only=False, **kwargs)
pandas:(axis=0, level=None, numeric_only=False)

unique:

Return unique values

nunique:

Count number of distinct elements in specified axis. Return Series with number of distinct elements. Can ignore NaN values.

cuDF: (axis=0, dropna=True)
pandas: (axis=0, dropna=True)

value_counts:

Return a Series containing counts of unique rows in the DataFrame

cuDF: (subset=None, normalize=False, sort=True, ascending=False, dropna=True)
pandas: (subset=None, normalize=False, sort=True, ascending=False, dropna=True)

The difference between unique() and nunique() was that one Null value

groupby.ngroup:

Number each group from 0 to the number of groups — 1. Usually arranged alphabetically. This is the enumerative complement of cumcount. Note that the numbers given to the groups match the order in which the groups would be seen when iterating over the groupby object, not the order they are first observed. This is a fast way of coding categorical values into numerical ones. This function is analogous to factorize() applied to groupby()

cuDF: (ascending=True)
pandas: (ascending=True)

In cuDF, query only supports numeric, datetime, timedelta, or bool dtypes. Use ngroup() to code string or object datatypes into numerical.

groupby.ngroups:

Pandas only as of this writing. Use this function as a way of finding out how many unique values a column holds based on other categories. It’s a faster, simpler alternative to using query() + unique(). It can also be used exactly like groupby.ngroup().

Let’s look at QB drop backs. There were 8 different types and ngroup was used to code them.
Numerically coded, now analysis can be done using QB drop backs, most of which were ‘Traditional’ (6)

Coding data from categorical to numerical: (factorize, get_dummies):

“category” data type:

cuDF and Pandas got can compute DataFrame columns with datatype assigned ‘object’ as coded numbers while still showing the original input. To do this, change the data type from “object” to “category.” Advantages of doing this that it is easy, fast, and allows the data to be included in statistical processes. The downside is that the user has to figure out the numbering system (usually sorted alphabetically) later.

factorize:

Encode the object as an enumerated type or categorical variable. It creates two outputs, a list of encoded values and a StringIndex or code book that shows what the numbered codes represent. This method is useful for obtaining a numeric representation of an array when all that matters is identifying distinct values. factorize() is available as both a top-level function. cuDF has ‘na_sentinel’ and Pandas has ‘use_na_sentinel’, they are equivalents. If True the sentinel -1 will be used for NaN values. If False, NaN values will be encoded as non-negative integers and will not drop the NaN from the uniques of the values. This is a good function to use for quickly coding categorical values into ordinal values. An alternative to factorize() is use groupby.ngroup().

cudf: (na_sentinel=-1)
pandas: (values, sort=False, na_sentinel=[OLD Avoid], use_na_sentinel=[default=True], size_hint=None)

factorize() creates two outputs: an array and a StringIndex. Save them as two separate lists then add a column or replace the original column with the array. Save the StringIndex for interpreting the data afterwards
Where players lined up in the play, code book has unique values column ranked alphabetically -1.

get_dummies:

Returns a dataframe whose columns are the one hot encodings of all columns in the original dataframe or dataframe column

cudf: (df, prefix=None, prefix_sep=’_’, dummy_na=False, columns=None, cats=None, sparse=False, drop_first=False, dtype=’uint8')
pandas: (data, prefix=None, prefix_sep=’_’, dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)

get_dummies() is crucial in dealing with categorical variables

groupby:

Group using a mapper or by a Series of columns. A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups. Can be very different to use compared to pandas so I’m linking the RAPIDS doc on the groupby functions a second time. Unlike Pandas, cuDF uses sort=False by default to achieve better performance, which does not guarantee any particular group order in the result.

cuDF: (by=None, axis=0, level=None, as_index=True, sort=False, group_keys=False, squeeze=False, observed=False, dropna=True)
pandas: (by=None, axis=0, level=None, as_index=True, sort=True, group_keys=_NoDefault.no_default, observed=False, dropna=True)

Examples of groupby() done after reading Soner Yıldırım’s excellent post.

Grouping the DataFrame by where players lined up, making a table of the mean stats for bad plays, and then sorting that table by the worst play: allowing the QB to be sacked
Same as the example above but using agg()
Nonsensical example of using groupby() with agg(). cuDF doesn’t sort without being told to.
Doing a groupby of (1st) Player’s official position, then breaking further down to the type of block used on the play. The results given are the min, mean, and max, in that order. Screencap of a portion of the readout.
groupby() using the ‘dropna=True’ to see if the means change from sorting out missing values

The following groupby() examples are Pandas-only:

DataFrame using groupby() on two categories and agg() for custom columns
grioupby() used with as_index() set to False. DataFrame now has it’s own indexing, which now intersperses categories with missing values. In this case it shows players on the defensive side
Grouping by position, then block type, the table is sorted by the 10 highest sack rates using nlargest(). I’m guesing a QB gave the ball to another player to throw then tried to block a rusher. It didn’t work. The same thing done to the right, using sort_values() and head()
groupby() with unique() yields a table with all the unique values in each grouping. Saving this table as a DataFrame can be very useful for organizing & coding the data and using explode() later on.
groupby() with the nunique() to find the number of unique values after dividing into groups

Quick dataframe information (describe, head & tail, info, len, shape)

describe:

Get descriptive statistics. Will truncate with many columns, use specific functions if you need data on all the columns.

cuDF: (percentiles=None, include=None, exclude=None, datetime_is_numeric=False)
pandas: (percentiles=None, include=None, exclude=None, datetime_is_numeric=False)

head & tail:

get the first (or last, default = 5) rows of the dataframe.

info:

This method prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage.

cuDF: (verbose=None, buf=None, max_cols=None, memory_usage=None, null_counts=None)
pandas: (verbose=None, buf=None, max_cols=None, memory_usage=None, show_counts=None, null_counts=None)

Knowing how much VRAM a DataFrame is taking is crucial

len:

Count how many values in a set/series/dict, applied to a dataframe gives rows. Used outside of the dataframe (example: len(df.columns))

shape:

Returns a tuple representing the dimensionality of the DataFrame.

shape() is used often just to check how the large the DataFrame is and if join/merge/concat() worked correctly

Directly access data (at, get_loc, insert, iloc, isin, loc, xs)

insert:

Add a column to DataFrame at the index specified by loc.

cuDF: (loc, name, value, nan_as_null=None)
pandas: (loc, column, value, allow_duplicates={bool, optional, default lib.no_default})

New column with values from a UDF shows more than half the time DEs outplayed Tackles

get_loc:

Get integer location, slice or boolean mask for requested label. For indexes, find what index value a column or row is. Necessary when a DataFrame is converted into a numpy array or other formats where column/row identifiers are stripped out.

cudf: (key, method=None, tolerance=None)
pandas: (key, method=None, tolerance=None)

iloc:

Select values by position, using the index or another integer based identifier. May also be used with a boolean array. Use like (x,y) coordinate data [row, column]. cuDF; pandas

iloc() is not as often used as loc() but can be important when locating outliers

isin:

Whether each element in the DataFrame is contained in values. cuDF; pandas

isin() can be used further for replacing or changing values

loc / at:

Select rows and columns by label or boolean mask. cuDF; pandas

Used describe() to see how many values were left after using loc()

xs:

Return cross-section from the Series/DataFrame, uses key argument to select data at a particular level of a MultiIndex.

cuDF: [Unavailable as of writing]
pandas: (key, axis=0, level=None, drop_level=True)

Missing Values (dropna, fillna, isna, isnull, isnull().sum, nans_to_nulls, notna, notnull)

dropna:

Drop rows (or columns) containing nulls from a Column.

cuDF: (axis=0, how=’any’, thresh=None, subset=None, inplace=False)
pandas: (*, axis=0, how= {‘any’, ‘all’}, thresh= {int, optional}, subset=None, inplace=False)

nflId had 48,614 missing values which were removed

fillna:

Fill null values with value or specified method. Same as using backfill when using bfill method.

cuDF: (value=None, method=None, axis=None, inplace=False, limit=None)
pandas: (value=None, *, method=None, axis=None, inplace=False, limit=None, downcast=None)

fillna() is good to use when using numerical coding, ‘9999999’ here is used as missing value

interpolate:

Interpolate data values between some points. Currently, only ‘linear` is supported, which ignores the index and treats each value as equally spaced.

cuDF: (method=’linear’, axis=0, limit=None, inplace=False, limit_direction=None, limit_area=None, downcast=None, **kwargs)
pandas: (method=’linear’, *, axis=0, limit=None, inplace=False, limit_direction=None, limit_area=None, downcast=None, **kwargs)

interpolate() is a good way of generating new values for indexes and identifiers.

isna / isnull:

Identify missing values. cuDF, pandas; isnull: cuDF; pandas

isna().sum() is a good way of finding how much of a problem missing values are.

notna / notnull:

Identify non-missing values. cuDF, pandas; notnull: cuDF; pandas

notna() tells you how many datapoints aren’t missing values

Pivot tables (melt, pivot, pivot_table, stack, unstack, wide_to_long):

melt:

Unpivots a DataFrame from wide format to long format, optionally leaving identifier variables set. Also wide_to_long() can be used in pandas.

cuDF: (frame, id_vars=None, value_vars=None, var_name=None, value_name=’value’, col_level=None)
pandas: (id_vars=None, value_vars=None, var_name=None, value_name=’value’, col_level=None, ignore_index=True)

Created a DataFrame then used melt() to isolate ‘tax’ values
When you add to ‘value_vars’, melt() lengthens the DataFrame

pivot:

Return reshaped DataFrame organized by the given index and column values

cuDF: (data, index=None, columns=None, values=None)
pandas:(*, index=None, columns=None, values=None)

Large DataFrame created for illustration
Pivot table created using an ‘index’ and ‘columns’. The third column of values makes up the values on the chart to create a 3 dimensional table.
If you put the ‘columns’ input under ‘[ ]’, then the pivot table expands and becomes 2 dimensional
Multiple values under ‘columns’ greatly expands the 2 dimensional pivot table

pivot_table:

Create a spreadsheet-style pivot table as a DataFrame. The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns. More visually appealing than pivot().

cuDF: (data, values=None, index=None, columns=None, aggfunc=’mean’, fill_value=None, margins=False, dropna=None, margins_name=’All’, observed=False, sort=True)
pandas: (values=None, index=None, columns=None, aggfunc=’mean’, fill_value=None, margins=False, dropna=True, margins_name=’All’, observed=False, sort=True)

DataFrame created and then turned into a pivot table
Same as above but ‘fill_values’ was used to replace null values with 0.

stack:

Stack the prescribed level(s) from columns to index. Return a reshaped Series. Basically take away a column while still retaining the data.

cudf: (level=-1, dropna=True)
pandas: (level=- 1, dropna=True)

stack() could be useful in organizing wide DataFrames

unstack:

Pivot one or more levels of the (necessarily hierarchical) index labels.Pivots the specified levels of the index labels of df to the innermost levels of the columns labels of the result. unstack() doesn’t change the pivot table back to a regular DataFrame, it changes the orientation of the data.

cuDF: (df, level, fill_value=None)
pandas: (level=- 1, fill_value=None)

Change the ‘level’ argument to change the orientation of the data presented even further.

wide_to_long:

Wide panel to long format. Pandas only melt() alternative.
pandas: (df, stubnames, i, j, sep=’’, suffix=’\\d+’)

rank:

Compute numerical data ranks (1 through n) along axis. By default, equal values are assigned the average of the ranks of those values.

cuDF: (axis=0, method=’average’, numeric_only=None, na_option=’keep’, ascending=True, pct=False)
pandas: (axis=0, method=’average’, numeric_only=_NoDefault.no_default, na_option=’keep’, ascending=True, pct=False)

rank() used to find the best plays in terms of yardage.

rename:

Alter column and index labels. Function / dict values must be unique (1-to-1). Labels not contained in a dict / Series will be left as-is.

cuDF: (mapper=None, index=None, columns=None, axis=0, copy=True, inplace=False, level=None, errors=’ignore’)
pandas: (mapper=None, *, index=None, columns=None, axis=None, copy=None, inplace=False, level=None, errors=’ignore’)

resample (asfreq):

Convert time series to specified frequency. Very necessary when working with time series data. Pandas’ asfreq() function is not available in cuDF, use resample() instead. As of this writing I had some problems using the apply function with resample, am looking into it.

cuDF: (rule, axis=0, closed=None, label=None, convention=’start’, kind=None, loffset=None, base=None, on=None, level=None, origin=’start_day’, offset=None)
pandas: (rule, axis=0, closed=None, label=None, convention=’start’, kind=None, loffset=None, base=None, on=None, level=None, origin=’start_day’, offset=None, group_keys=_NoDefault.no_default)

Created a DataFrame containing time series minute bins. Then resampled the frequency into 15 min bins
If you resample upwards, cuDF will create the time data values but leave the values black. User should fill those empty values using fillna() or interpolate()

Exchanging values (mask, replace, where)

mask:

Replace values where the condition is True. When you want to use change values using boolean methods, can be chained using ‘and.’

cuDF: (cond, other=None, inplace=False)
pandas: (cond, other=nan, *, inplace=False, axis=None, level=None)

replace:

Replace values given in ‘to_replace’ with alternate value

cuDF: (to_replace=None, value=None, inplace=False, limit=None, regex=False, method=None)
pandas: (to_replace=None, value=_NoDefault.no_default, *, inplace=False, limit=None, regex=False, method={‘pad’, ‘ffill’, ‘bfill’})

I’m a Patriots fan.

where:

Where condition is True, keep the original value. Where False, replace with corresponding value from other.

cuDF: (cond, other=None, inplace=False)
pandas: (cond, other={scalar, Series/DataFrame, or callable}, *, inplace=False, axis=None, level=None, errors=’raise’, try_cast=_NoDefault.no_default)

where() is good way of exchanging values

Indexing (index, MultiIndex, names, reset_index, reindex, reindex_like, set_index, sort_index, truncate).

Not too many examples will be provided for this section because index functions are pretty niche, until you need them. Fantastic write up from PyData here. As stupid as it sounds, you’ll find yourself messing with indexes much more than you’d ever expect.

index:

The basic object storing row labels for all cuDF objects. The options for manipulating dataframe index using cuDF is extensive, especially when combining dataframes or data cleaning. Its worth a look at the official documents.

cuDF: (data=None, dtype=None, copy=False, name=None, tupleize_cols=True, nan_as_null=True, **kwargs)
pandas

Not the most useful output but could help when hunting down missing values

MultiIndex:

A multi-level or hierarchical index. Provides N-Dimensional indexing into Series and DataFrame objects. Good post on advanced uses for MultiIndex / advanced indexing here.

cuDF: (levels=None, codes=None, sortorder=None, names=None, dtype=None, copy=False, name=None, **kwargs)
pandas: (levels=None, codes=None, sortorder=None, names=None, dtype=None, copy=False, name=None, verify_integrity=True)

index.names

Rename the index in a DataFrame using index.names function. I find naming the index and using reset_index() as good practice during data filtering, merging/joining, and modification. Remember to drop the old index if it’s not relevant to the new DataFrame.

Let’s look at the 2022 Rookie Running Backs. After uploading a database on all the Rookies, and specifying relevant columns, I sorted for Rushing Yards
The DataFrame’s index reset to the new sort. The first index value is moved from ‘0’ to ‘1’
Changed the name of the index to ‘Rushing Rank’ and called for the top 10 Rookie rushers in the 2022 season.

reset_index:

Reset the index of the DataFrame, or a level of it.

cuDF: (level=None, drop=False, inplace=False, col_level=0, col_fill=’’)
pandas: (level=None, *, drop=False, inplace=False, col_level=0, col_fill=’’, allow_duplicates= {bool, optional, default lib.no_default}, names=None)

Taking a random sample of 1000 data points from the entire DataFrame.
reset_index() is used here to clean the sample DataFrame and used to create randomized DataFrames

reindex:

Conform DataFrame to new index. Places NA/NaN in locations having no value in the previous index. A new object is produced unless the new index is equivalent to the current one and copy=False. I mostly use this to reorder DataFrame columns & rows alphabetically.

cuDF: (labels=None, index=None, columns=None, axis=None, method=None, copy=True, level=None, fill_value=<NA>, limit=None, tolerance=None)
pandas: (labels=None, index=None, columns=None, axis=None, method=None, copy=None, level=None, fill_value=nan, limit=None, tolerance=None)

reindex() used to organize DataFrame columns alphabetically

reindex_like:

Conform DataFrame to new index. Places NA/NaN in locations having no value in the previous index. A new object is produced unless the new index is equivalent to the current one and copy=False.

cuDF: [unavailable as of writing]
pandas: (other, method=None, copy=True, limit=None, tolerance=None)

set_index:

Return a new DataFrame with a new index

cuDF: (keys, drop=True, append=False, inplace=False, verify_integrity=False)
pandas: (keys, *, drop=True, append=False, inplace=False, verify_integrity=False)

sort_index:

Sort object by labels (along an axis).

cuDF: (axis=0, level=None, ascending=True, inplace=False, kind=None, na_position=’last’, sort_remaining=True, ignore_index=False, key=None)
pandas: (*, axis=0, level=None, ascending=True, inplace=False, kind=’quicksort’, na_position=’last’, sort_remaining=True, ignore_index=False, key=None)

sort_index() used here to take away some of the randomness of the selection

sample:

Return a random sample of items from an axis of object. random_state parameter can be specified to get same samples. If reproducible results are required, a random number generator may be provided via the random_state parameter.

cuDF: (n=None, frac=None, replace=False, weights=None, random_state=None, axis=None, ignore_index=False)
pandas: (n=None, frac=None, replace=False, weights=None, random_state=None, axis=None, ignore_index=False)

Two ways of taking samples, by explicitly choosing how many or a fraction of the entire DataFrame

sorting values (sort_values, nlargest & nsmallest, groupby.nth, truncate):

sort_values:

Sort by the values along either axis

cuDF: (by, axis=0, ascending=True, inplace=False, kind=’quicksort’, na_position=’last’, ignore_index=False)
pandas: (by, *, axis=0, ascending=True, inplace=False, kind=’quicksort’, na_position=’last’, ignore_index=False, key=None)

nlargest & nsmallest:

Return the first n rows with the largest values in columns, in descending (& ascending for nsmallest) order. The columns that are not specified are returned as well, but not used for ordering. Including these two functions because it saves time from writing out: “ sort_values(columns, ascending=False).head(n) “ and is computationally faster.

cuDF: (n, columns, keep=’first’)
pandas: (n, columns, keep=’first’)

How the hell did a team get to the point where they needed 39 yards for 1st down? [Sack]

groupby.nth:

For use either after creating a DataFrame using groupby() or attached to the end of the line utilizing groupby(). Take the nth row from each group if n is an int, otherwise a subset of rows. Can be either a call or an index. dropna is not available with index notation. Index notation accepts a comma separated list of integers and slices.

Pandas only: if dropna, will take the nth non-null row, dropna is either ‘all’ or ‘any’; this is equivalent to calling dropna(how=dropna) before the groupby. Even though it is not specified in the docs, you can use the following args: slice, dropna (=’any’), as_index (=True or False). This is equivalent to calling dropna(how=dropna) before the groupby.

cuDF: (n)
pandas: (n, dropna=None)

Looking at the 100th row of each position group

value formats (astype, dtypes, select_dtypes, to_datetime)

astype:

Cast the object to the given dtype.

cuDF: (dtype, copy=False, errors=’raise’, **kwargs)
pandas: (dtype, copy=True, errors=’raise’)

Changing the datatype of a column. Key in reducing the DataFrame size by downgrading datatypes

dtypes:

find the data types in a dataframe

select_dtypes:

dtypes gives the format/type of data. select_dtypes returns a subset of the DataFrame’s columns based on the column dtypes.

cuDF: (include=None, exclude=None)
pandas: (include=None, exclude=None)

Looking at the ‘float64’ columns. Can be converted to integers to gain efficiency

For reference on how data types work in python:

from the Python Data Science Handbook by Jake VanderPlas

to_datetime (astype(‘datetime[]’):

Convert argument to datetime. cuDF likes it’s dates in the “Year-Month-Day” format. “Month/Day/Year” has to get converted. Converting dates (in multiple columns) into datetime can also be done by using the astype() function. As of writing there are some bugs with the ‘dayfirst’ arg in cuDF so avoid changing it to ‘True.’ Some have found using the astype() function faster than to_date() but I haven’t tested it out myself.

cuDF: (arg, errors=’raise’, dayfirst=False, yearfirst=False, utc=None, format=None, exact=True, unit=’ns’, infer_datetime_format=False, origin=’unix’, cache=True)
pandas: (arg, errors=’raise’, dayfirst=False, yearfirst=False, utc=None, format=None, exact=True, unit=None, infer_datetime_format=False, origin=’unix’, cache=True)

Using to_date()
Using astype()

Description of Dataset Used for Illustrative Purposes

The dataset used for all the examples were provided by the NFL on Kaggle for their 2023 Data Bowl competition. I love the NFL so thought it’d be fun to use that dataset for this post.

The data provided was a combination of general NFL player & game descriptions (players.csv, games.csv), the first 8 weeks of last season (week1.csv, etc.), NFL NextGen data (plays.csv), and ProFootball Focus scouting reports on each game (pffScoutingData.csv).

The competition was focused on offensive and defensive line play, so had lots of data on players along the line of scrimmage and what they did each play.

Generating NFL Databowl 2023 DataFrame in cuDF

The Kaggle data files can be downloaded and used using the Kaggle API, directly downloaded as zip files (using wget and zipfile, etc.) into pandas, or just downloading to your directory. I did the last just to keep things simple and allow me to rerun the jupyter notebook over and over again.

For this exercise all we need is to load the cuDF module. I included the others in the code just to show other things Rapids users can use.

#import pandas as pd
#import numpy as np

import cudf

#import cupy as cp
#import cuml
#import cugraph as cg
#import cuspatial as cs
#import cuxfilter as cf
#import cusignal as csig
#import cucim as cc

# data viz
#import matplotlib.pyplot as plt
#from matplotlib import rcParams
#import seaborn as sns

Load all the files provided by the NFL Databowl competition organizers provided. Using cuDF, all these files are loaded directly into the GPU’s VRAM. As long as all processes are kept inside the GPU using Rapids, the computer’s CPU, RAM, and disk storage will be minimally involved.

# Load given datasets

games = cudf.read_csv('games.csv')
pffScoutingData = cudf.read_csv('pffScoutingData.csv')
players = cudf.read_csv('players.csv')
plays = cudf.read_csv('plays.csv')
week1 = cudf.read_csv('week1.csv')
week2 = cudf.read_csv('week2.csv')
week3 = cudf.read_csv('week3.csv')
week4 = cudf.read_csv('week4.csv')
week5 = cudf.read_csv('week5.csv')
week6 = cudf.read_csv('week6.csv')
week7 = cudf.read_csv('week7.csv')
week8 = cudf.read_csv('week8.csv')

The first 8 weeks NFL NextGen stats of the 2021 season have to be put together. The concat() function will do that work.

Notice that as soon as I’m done with a dataframe, it is deleted immediately to conserve VRAM. This is not necessary if your dataset is small relative to your total VRAM capacity but becomes crucial as you get near the limit.

# Combine the 8 weeks of data

weeks = [week1, week2, week3, week4, week5, week6, week7, week8]
weeks = cudf.concat(weeks)
del[week1, week2, week3, week4, week5, week6, week7, week8]

Edit: the following alternative was suggested in feedback that is faster and more flexible:

#Load weekly data and concat in one line

weeks = cudf.concat([cudf.read_csv("week" + str(x)) for x in range(1,9)])

Using merge() function, combine the weekly data frame with the PFF scouting numbers.

# Combine weekly data with PFF Scouting data

df = weeks.merge(pffScoutingData, on = ['gameId', 'playId', 'nflId'])
del(weeks, pffScoutingData)

Add game information to the data frame.

# Add in the basic Games data in front

df = games.merge(df, on ='gameId')
del(games)

Attach more detailed information on what happened in each play

# Add the details of the plays at the end of the dataframe

df = df.merge(plays, on = ['gameId', 'playId'])
del(plays)

Our primary data frame now has all the game data for the first half of the 2021 NFL season. Now attach information about the players.

# Add in Player data, putting it in front of the dataframe

df = players.merge(df, on = 'nflId')
del(players)

The base database is now complete. We have all the information the contest providers have given us in an organized, singular data frame. There are many categorical and ordinal values throughout. To do statistical analysis on these columns, we use one-hot encoding to split everything into 0 (False) and 1 (True). This will vastly increase the number of columns the data frame will contain.

# One Hot Encoding of variables with multiple categories using Pandas

officialPosition = cudf.get_dummies(df.officialPosition)
down = cudf.get_dummies(df.down)
passResult = cudf.get_dummies(df.passResult)
offenseFormation = cudf.get_dummies(df.offenseFormation)
personnelO = cudf.get_dummies(df.personnelO)
personnelD = cudf.get_dummies(df.personnelD)
dropBackType = cudf.get_dummies(df.dropBackType)
pff_passCoverage = cudf.get_dummies(df.pff_passCoverage)
pff_passCoverageType = cudf.get_dummies(df.pff_passCoverageType)
pff_role = cudf.get_dummies(df.pff_role)
pff_blockType = cudf.get_dummies(df.pff_blockType)
foulName1 = cudf.get_dummies(df.foulName1)
pff_positionLinedUp = cudf.get_dummies(df.pff_positionLinedUp)

Here we meet our first difference between pandas and cuDF but it’s a very minor one. In pandas when you combine dataframes that happen to have the same column names but are treated as unique, they are treated as repeats. Pandas simply changes the name of the columns (like adding a ‘-x’ suffix).

Rapids AI’s cuDF does not do this.

It will stop the kernel and give you a warning. If you see columns with the same name in data frames you are combining, give each repeat column a unique name.

# Renaming columns to avoid issues with repeats

down.columns = ['SP/NaN', 'FirstD', 'SecondD', 'ThirdD', 'FourthD']
officialPosition.columns = ['Center', 'CornerBack', 'DBack', 'DEnd', 'DTackle', 'FullBack', 'FreeSafety', 'Guard', 'InsideLB', 'LineBacker', 'MiddleLB', 'NoseTackle', 'OutsideLB', 'QuarterBack', 'RunningBack', 'StrongSafety', 'Tackle', 'TightEnd', 'WideReciever']
passResult.columns = ['Complete','Incomplete', 'Interception', 'Scramble', 'Sack']

Now it adding the one-hot encoded information to the primary data frame should go problem free.

# Join One Hot Encoded info into the main dataframe

df = cudf.concat([df, passResult, down, officialPosition, pff_positionLinedUp, pff_role, pff_blockType, dropBackType, offenseFormation, personnelO, personnelD, pff_passCoverage, pff_passCoverageType, foulName1], axis=1)
del([down, officialPosition, pff_positionLinedUp, pff_role, pff_blockType, passResult, dropBackType, offenseFormation, personnelO, personnelD, pff_passCoverage, pff_passCoverageType, foulName1])

And that’s it. we have our complete database for analysis.

There’s about 8 million rows, 290 columns, and takes up around 8 GBs of VRAM. Pretty damn big. Using RAPIDS, it gets done in less than 5 seconds and almost all of it is taken up by moving the initial csv files from the SSD hard drive to the GPU.

From here on in the database will be used to illustrate how to use the most common pandas functions used to prepare (data munging) data sets for data science and machine learning.

In the vast majority of cases, cuDF and pandas are like for like. There will be explicit notes in where there are differences

Thanks for reading.

EXTRA: Simplified NFL Database Using Pandas

Since I plan on posting this on the Kaggle discussion boards and most people will want to use regular old pandas, here’s how to create the same data set, minus a few columns and the information in the weekly csvs. If your PC can handle all of it (32+ GB RAM), take out the # marks in the code.

cd ~/Dropbox/NFL_Databowl_2023

# Load up modules

# data manipulation
import pandas as pd


# Load up the data files from NFL NextGenStats

games = pd.read_csv('games.csv')
pffScoutingData = pd.read_csv('pffScoutingData.csv')
players = pd.read_csv('players.csv')
plays = pd.read_csv('plays.csv')
#week1 = pd.read_csv('week1.csv')
#week2 = pd.read_csv('week2.csv')
#week3 = pd.read_csv('week3.csv')
#week4 = pd.read_csv('week4.csv')
#week5 = pd.read_csv('week5.csv')
#week6 = pd.read_csv('week6.csv')
#week7 = pd.read_csv('week7.csv')
#week8 = pd.read_csv('week8.csv')
# Drop Unnecessary data from games, players and plays dataframes
games = games.drop(['season', 'gameDate', 'gameTimeEastern', 'homeTeamAbbr', 'visitorTeamAbbr'], axis=1)
players = players.drop(['height', 'weight', 'birthDate', 'collegeName'], axis=1)
plays = plays.drop(['playDescription', 'quarter', 'possessionTeam', 'defensiveTeam', 'yardlineSide', 'gameClock', 'preSnapHomeScore', 'preSnapVisitorScore', 'absoluteYardlineNumber'], axis=1)
# Merge games, pffScoutingData, and players into a single dataframe 'df'
df = games.merge(pffScoutingData, on = 'gameId')
df = players.merge(df, on = 'nflId')
df = df.merge(plays, on = ['gameId','playId'])

del[plays, players, games, pffScoutingData]

# combine the 8 playing weeks of data into 'weeks'
# weeks = pd.concat([week1,week2,week3,week4,week5,week6,week7,week8])

# merge it with the dataframe
# df = pd.merge(df, weeks, on='gameId', how='outer')

# delete unused dataframes
#del[weeks, week1,week2,week3,week4,week5,week6,week7,week8]
# Let's encode Position and Down

officialPosition = pd.get_dummies(df.officialPosition)
down = pd.get_dummies(df.down)

# One Hot Encoding of variables with multiple categories using Pandas

passResult = pd.get_dummies(df.passResult)
offenseFormation = pd.get_dummies(df.offenseFormation)
personnelO = pd.get_dummies(df.personnelO)
personnelD = pd.get_dummies(df.personnelD)
dropBackType = pd.get_dummies(df.dropBackType)
pff_passCoverage = pd.get_dummies(df.pff_passCoverage)
pff_passCoverageType = pd.get_dummies(df.pff_passCoverageType)
pff_role = pd.get_dummies(df.pff_role)
pff_blockType = pd.get_dummies(df.pff_blockType)
foulName1 = pd.get_dummies(df.foulName1)
pff_positionLinedUp = pd.get_dummies(df.pff_positionLinedUp)



# Join all the codes to main dataframe

df = pd.concat([df, down, officialPosition, pff_role, pff_blockType, passResult, dropBackType, offenseFormation, personnelO, personnelD, pff_passCoverage, pff_passCoverageType, foulName1, pff_positionLinedUp], axis=1)
# Save Work

df.to_csv('NFL_2023_Databowl_Data_with_codes.csv')

--

--