cuDF for Data Scientists: Functions for Data Wrangling
Reference guide made to help people transition from Pandas to NVIDIA’s RAPIDS.AI cuDF package.
Purpose
Think of a one stop shop to find data wrangling help to use for cuDF.
I hope to make it a living document that people can have bookmarked and use to quickly & easily find help using both RAPIDS.AI cuDF and standard Pandas. The focus will be on loading, organizing and manipulating data on the GPU. The list of functions are based on personal experience as the most common or important pandas data wrangling functions. Functions that analyze, predict, or used in machine learning are outside of the scope. This post should have everything you need to take raw data into the GPU as a DataFrame, clean & organize it, and output it in other forms.
While there is a serious effort to make the transition from pandas to cuDF as pain-free as possible, there can be important differences. This often goes beyond the scope of one post, so links to specific articles and I’ll try to point out things to look out for. For comparisons’ sake links to both cuDF and pandas docs will be provided and examples (and some personal commentary) included.
As RAPIDS evolves, expect this post to go through multiple revisions as mistakes are fixed, new content added from suggestions, and new cuDF capabilities are integrated.
Table of Contents
cuDF Functions
- cuDF DataFrame
- agg
- Using functions across the dataframe without loops (apply, applymap, map, apply_rows, groupby().transform(), items, scatter_by_map)
- Assign binning values (cut, qcut)
- columns (columns, list)
- Combining dataframes (concat, join, merge)
- crosstab
- drop
- Dealing with duplicate data (duplicated, drop_duplicates)
- explode
- Finding values (eval, filter, query, groupby.get_group)
- Frequency (count, unique, nunique, value_counts, groupby.ngroup / ngroups):
- Coding data from categorical to numerical: (“category” data type, factorize, get_dummies)
- groupby
- Quick dataframe information (describe, head & tail, info, len, shape)
- Directly access data (at, get_loc, insert, iloc, isin, loc, xs)
- Pivot tables (melt, pivot, pivot_table, stack, unstack, wide_to_long):
- rank
- rename
- resample
- Exchanging values (mask, replace, where)
- Indexing (index, MultiIndex, names, reset_index, reindex, reindex_like, set_index, sort_index, truncate)
- sample
- sorting values (sort_values, nlargest, nsmallest, groupby.nth)
- value formats (astype, dtypes, select_dtypes, to_datetime)
Extras:
- Description of Dataset Used for Illustrative Purposes
- Generating the NFL Databowl 2022 DataFrame in cuDF
- Simplified NFL Database Using Pandas
Setup
Function descriptions will use the following format:
Arrow 1: Instead of going alphabetical, the functions are arranged in groupings or families of functions.
Arrow 2: The actual function will be defined, usually word for word as written by Pydata or RAPIDS teams. Personal notes are written here afterwards.
Arrow 3: The format on how the function is used, with arguments is written out. Usually the cuDF and pandas formats are exactly the same but not always. This is a good way of being alerted to that. Both cuDF and pandas official docs links are provided so you can dig into the functions and see examples.
Arrow 4: Example provided with code. Provided on a case by case basis
Arrow 5: Comment provided to explain what is going on and what insight was gained in the example.
cuDF Functions
cuDF DataFrame:
The fundamental basis of managing data. Two-dimensional, size-mutable, potentially heterogeneous tabular data. Data structure also contains labeled axes (rows and columns). Arithmetic operations align on both row and column labels.
cuDF: (data=None, index=None, columns=None, dtype=None, nan_as_null=True)
pandas: (data=None, index=None, columns=None, dtype=None, copy=None)
agg:
Aggregate using one or more operations over the specified axis. When DataFrame.agg is called with single operation/function, a Series is returned. With several operations/functions, a DataFrame is returned
cudf: (aggs, axis=None); note: axis not yet supported
pandas: (func=None, axis=0, *args, **kwargs)
agg() is a very flexible tool, this is just a small look into what it can do.
Using functions across the dataframe without loops (apply, applymap, map, apply_rows, groupby().transform(), items, scatter_by_map). Unfortunately, as of this writing I myself am not an expert in these functions so will not include examples.
apply (apply, groupby.apply):
Is highly bespoke in Rapids.AI packages, please look at the doc linked before using. Apply works in parallel, always use this instead of for loops where possible.
Apply a function along an axis of the DataFrame. apply relies on Numba to JIT compile func. Thus the allowed operations within func are limited to those supported by the CUDA Python Numba target. For more information, see the cuDF guide to user defined functions.
Support for use of string data within UDFs is provided through the strings_udf RAPIDS library. Supported operations on strings include the subset of functions and string methods that expect an input string but do not return a string. Refer to caveats in the UDF guide referenced above.
cuDF: (func, axis=1, raw=False, result_type=None, args=(), **kwargs)
pandas: (func, axis=0, raw=False, result_type=None, args=(), **kwargs)
apply_chunks:
Transform user-specified chunks using the user-provided function. (Untested)
cuDF: (func, incols, outcols, kwargs=None, pessimistic_nulls=True, chunks=None, blkct=None, tpb=None)
applymap:
Apply a function to a Dataframe elementwise. This method applies a function that accepts and returns a scalar to every element of a DataFrame. (Couldn’t get to work)
cuDF: (func: Callable[[Any], Any], na_action: Optional[str] = None, **kwargs)
pandas: (func, na_action=None, **kwargs)
apply_rows:
Apply a row-wise user defined function. Not available in pandas
cuDF: (func, incols, outcols, kwargs, pessimistic_nulls=True, cache_key=None)
groupby().transform()
Passes a function to the whole series and returns a series. Apply an aggregation, then broadcast the result to the group size. Shares builtin functions with agg(). Pandas version uses cython engine as default but can be switched to numba.
cudf: groupby(categorical)[column function applied].transform(‘function’)
pandas: (func, *args, engine=None, engine_kwargs=None, **kwargs)
items:
Iterate over column names and series pairs. Used to take chunks of the data, count the elements, and then reassemble the counts. cuDF, pandas. Better write up in Pydata.
map (Series):
Map values of Series according to input correspondence. Used for substituting each value in a Series with another value, that may be derived from a function, a dict or a Series. Values that are not found in the dict are converted to NaN. Default values in dicts are currently not supported. Think of map() as taking a dataframe, telling it the function you want to apply, and the inputs it needs. cuDF’s map() functionality may seem more limited than the Pandas version. This has to do with how cuDF store data and interacts with ‘object’ datatype.
cuDF: (arg, na_action=None)
pandas: (arg, na_action=None)
Be careful about what kinds of data you use on maps():
scatter_by_map:
Scatter to a list of dataframes. Uses map_index to determine the destination of each row of the original DataFrame. Not available in pandas. (Not tested)
cuDF: (map_index, map_size=None, keep_index=True, **kwargs)
assign binning values (cut, qcut):
cut:
Bin values into discrete intervals. Can be used to convert continuous values into categorical. Is being debugged in cuDF as of writing.
cuDF: (x, bins, right: bool = True, labels=None, retbins: bool = False, precision: int = 3, include_lowest: bool = False, duplicates: str = ‘raise’, ordered: bool = True)
pandas: (x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates=’raise’, ordered=True)
qcut:
Quantile-based discretization function. Cut with percentages of the dataframe. Use qcut() to create a numerical categorical variable that separate the DataFrame to analyze. Not available in cuDF as of writing
pandas: (x, q, labels=None, retbins=False, precision=3, duplicates=’raise’)
Combining dataframes (concat, join, merge).
Honestly, I use the pydata.org guide all the time.
concat:
Concatenate DataFrames, Series, or Indices row-wise
cuDF: (objs, axis=0, join=’outer’, ignore_index=False, sort=None)
pandas: (objs, *, axis=0, join=’outer’, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=True)
join:
Join columns with other DataFrame on index or on a key column. Usually used when putting together using an index. Join is often useful when the data files is unlabeled and don’t have any common columns but are using the same index.
cuDF: (other, on=None, how=’left’, lsuffix=’’, rsuffix=’’, sort=False)
pandas: (other, on=None, how=’left’, lsuffix=’’, rsuffix=’’, sort=False, validate=None)
merge:
Merge (GPU) DataFrame objects by performing a database-style join operation by columns or indexes. You are able to join using more than one column.
cuDF: (right, on=None, left_on=None, right_on=None, left_index=False, right_index=False, how=’inner’, sort=False, lsuffix=None, rsuffix=None, indicator=False, suffixes=(‘_x’, ‘_y’))
pandas: (right, how=’inner’, on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=(‘_x’, ‘_y’), copy=True, indicator=False, validate=None)
crosstab:
Compute cross tabulation of two (or more) factors. Computes a frequency table of the factors. Unless an array of values and an aggregation function are specified.
cuDF: No documentation available and some technical issues occur but it works for numerical data, not string/object. Use pandas crosstab instead.
pandas: (index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, margins_name=’All’, dropna=True, normalize=False)
Dealing with duplicate data (duplicated, drop_duplicates)
duplicated:
Return boolean Series denoting duplicate rows. Not available as of writing in cuDF.
cuDF: Unavailable
pandas: (subset=None, keep=’first’)
drop_duplicates:
Return DataFrame with duplicate rows removed, can optionally consider specified subset of columns.
cuDF: (subset=None, keep=’first’, inplace=False, ignore_index=False)
pandas: (subset=None, *, keep=’first’, inplace=False, ignore_index=False)
Finding values (eval, filter, query, groupby.get_group)
eval:
Evaluate a string describing operations on DataFrame columns. eval and query are closely connected in functionality by sharing kwargs.
cuDF: (expr: str, inplace: bool = False, **kwargs)
pandas: (expr, *, inplace=False, **kwargs)
filter:
Subset the dataframe rows or columns according to the specified index labels. Does not filter a dataframe on its contents, applied to the labels of the index. Currently unavailable in cuDF.
pandas: (items=None, like=None, regex=None, axis=None)
query:
Query with a boolean expression using Numba to compile a GPU kernel. As of this writing, cuDF’s query is more limited than the pandas version by not dealing with categorical or string/object variables. You will get a error saying: “query only supports numeric, datetime, timedelta, or bool dtypes.” To deal with this, convert the categorical variable into ordinal values or use one-hot encoding. Query and eval are closely connected in functionality by sharing kwargs. Please be aware that in cuDF, query doesn’t have the ‘inplace’ option. to make changes to the dataframe, you have to specify by writing ‘df = df.query(expr)’.
cuDF: (expr, local_dict=None)
pandas: (expr, *, inplace=False, **kwargs)
groupby.get_group: (name, obj=None)
Construct DataFrame from group with provided name. Not available in cuDF as of writing. I’m including it here just to show an alternative to query available to Pandas users.
pandas: (name, obj=None)
Frequency (count, unique, nunique, value_counts, groupby.ngroups):
count:
Count non-NA cells for each column or row.
cuDF: (axis=0, level=None, numeric_only=False, **kwargs)
pandas:(axis=0, level=None, numeric_only=False)
unique:
Return unique values
nunique:
Count number of distinct elements in specified axis. Return Series with number of distinct elements. Can ignore NaN values.
cuDF: (axis=0, dropna=True)
pandas: (axis=0, dropna=True)
value_counts:
Return a Series containing counts of unique rows in the DataFrame
cuDF: (subset=None, normalize=False, sort=True, ascending=False, dropna=True)
pandas: (subset=None, normalize=False, sort=True, ascending=False, dropna=True)
groupby.ngroup:
Number each group from 0 to the number of groups — 1. Usually arranged alphabetically. This is the enumerative complement of cumcount. Note that the numbers given to the groups match the order in which the groups would be seen when iterating over the groupby object, not the order they are first observed. This is a fast way of coding categorical values into numerical ones. This function is analogous to factorize() applied to groupby()
cuDF: (ascending=True)
pandas: (ascending=True)
groupby.ngroups:
Pandas only as of this writing. Use this function as a way of finding out how many unique values a column holds based on other categories. It’s a faster, simpler alternative to using query() + unique(). It can also be used exactly like groupby.ngroup().
Coding data from categorical to numerical: (factorize, get_dummies):
“category” data type:
cuDF and Pandas got can compute DataFrame columns with datatype assigned ‘object’ as coded numbers while still showing the original input. To do this, change the data type from “object” to “category.” Advantages of doing this that it is easy, fast, and allows the data to be included in statistical processes. The downside is that the user has to figure out the numbering system (usually sorted alphabetically) later.
factorize:
Encode the object as an enumerated type or categorical variable. It creates two outputs, a list of encoded values and a StringIndex or code book that shows what the numbered codes represent. This method is useful for obtaining a numeric representation of an array when all that matters is identifying distinct values. factorize() is available as both a top-level function. cuDF has ‘na_sentinel’ and Pandas has ‘use_na_sentinel’, they are equivalents. If True the sentinel -1 will be used for NaN values. If False, NaN values will be encoded as non-negative integers and will not drop the NaN from the uniques of the values. This is a good function to use for quickly coding categorical values into ordinal values. An alternative to factorize() is use groupby.ngroup().
cudf: (na_sentinel=-1)
pandas: (values, sort=False, na_sentinel=[OLD Avoid], use_na_sentinel=[default=True], size_hint=None)
get_dummies:
Returns a dataframe whose columns are the one hot encodings of all columns in the original dataframe or dataframe column
cudf: (df, prefix=None, prefix_sep=’_’, dummy_na=False, columns=None, cats=None, sparse=False, drop_first=False, dtype=’uint8')
pandas: (data, prefix=None, prefix_sep=’_’, dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)
groupby:
Group using a mapper or by a Series of columns. A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups. Can be very different to use compared to pandas so I’m linking the RAPIDS doc on the groupby functions a second time. Unlike Pandas, cuDF uses sort=False
by default to achieve better performance, which does not guarantee any particular group order in the result.
cuDF: (by=None, axis=0, level=None, as_index=True, sort=False, group_keys=False, squeeze=False, observed=False, dropna=True)
pandas: (by=None, axis=0, level=None, as_index=True, sort=True, group_keys=_NoDefault.no_default, observed=False, dropna=True)
Examples of groupby() done after reading Soner Yıldırım’s excellent post.
The following groupby() examples are Pandas-only:
Quick dataframe information (describe, head & tail, info, len, shape)
describe:
Get descriptive statistics. Will truncate with many columns, use specific functions if you need data on all the columns.
cuDF: (percentiles=None, include=None, exclude=None, datetime_is_numeric=False)
pandas: (percentiles=None, include=None, exclude=None, datetime_is_numeric=False)
head & tail:
get the first (or last, default = 5) rows of the dataframe.
info:
This method prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage.
cuDF: (verbose=None, buf=None, max_cols=None, memory_usage=None, null_counts=None)
pandas: (verbose=None, buf=None, max_cols=None, memory_usage=None, show_counts=None, null_counts=None)
len:
Count how many values in a set/series/dict, applied to a dataframe gives rows. Used outside of the dataframe (example: len(df.columns))
shape:
Returns a tuple representing the dimensionality of the DataFrame.
Directly access data (at, get_loc, insert, iloc, isin, loc, xs)
insert:
Add a column to DataFrame at the index specified by loc.
cuDF: (loc, name, value, nan_as_null=None)
pandas: (loc, column, value, allow_duplicates={bool, optional, default lib.no_default})
get_loc:
Get integer location, slice or boolean mask for requested label. For indexes, find what index value a column or row is. Necessary when a DataFrame is converted into a numpy array or other formats where column/row identifiers are stripped out.
cudf: (key, method=None, tolerance=None)
pandas: (key, method=None, tolerance=None)
iloc:
Select values by position, using the index or another integer based identifier. May also be used with a boolean array. Use like (x,y) coordinate data [row, column]. cuDF; pandas
isin:
Whether each element in the DataFrame is contained in values. cuDF; pandas
loc / at:
Select rows and columns by label or boolean mask. cuDF; pandas
xs:
Return cross-section from the Series/DataFrame, uses key argument to select data at a particular level of a MultiIndex.
cuDF: [Unavailable as of writing]
pandas: (key, axis=0, level=None, drop_level=True)
Missing Values (dropna, fillna, isna, isnull, isnull().sum, nans_to_nulls, notna, notnull)
dropna:
Drop rows (or columns) containing nulls from a Column.
cuDF: (axis=0, how=’any’, thresh=None, subset=None, inplace=False)
pandas: (*, axis=0, how= {‘any’, ‘all’}, thresh= {int, optional}, subset=None, inplace=False)
fillna:
Fill null values with value or specified method. Same as using backfill when using bfill method.
cuDF: (value=None, method=None, axis=None, inplace=False, limit=None)
pandas: (value=None, *, method=None, axis=None, inplace=False, limit=None, downcast=None)
interpolate:
Interpolate data values between some points. Currently, only ‘linear` is supported, which ignores the index and treats each value as equally spaced.
cuDF: (method=’linear’, axis=0, limit=None, inplace=False, limit_direction=None, limit_area=None, downcast=None, **kwargs)
pandas: (method=’linear’, *, axis=0, limit=None, inplace=False, limit_direction=None, limit_area=None, downcast=None, **kwargs)
isna / isnull:
Identify missing values. cuDF, pandas; isnull: cuDF; pandas
notna / notnull:
Identify non-missing values. cuDF, pandas; notnull: cuDF; pandas
Pivot tables (melt, pivot, pivot_table, stack, unstack, wide_to_long):
melt:
Unpivots a DataFrame from wide format to long format, optionally leaving identifier variables set. Also wide_to_long() can be used in pandas.
cuDF: (frame, id_vars=None, value_vars=None, var_name=None, value_name=’value’, col_level=None)
pandas: (id_vars=None, value_vars=None, var_name=None, value_name=’value’, col_level=None, ignore_index=True)
pivot:
Return reshaped DataFrame organized by the given index and column values
cuDF: (data, index=None, columns=None, values=None)
pandas:(*, index=None, columns=None, values=None)
pivot_table:
Create a spreadsheet-style pivot table as a DataFrame. The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns. More visually appealing than pivot().
cuDF: (data, values=None, index=None, columns=None, aggfunc=’mean’, fill_value=None, margins=False, dropna=None, margins_name=’All’, observed=False, sort=True)
pandas: (values=None, index=None, columns=None, aggfunc=’mean’, fill_value=None, margins=False, dropna=True, margins_name=’All’, observed=False, sort=True)
stack:
Stack the prescribed level(s) from columns to index. Return a reshaped Series. Basically take away a column while still retaining the data.
cudf: (level=-1, dropna=True)
pandas: (level=- 1, dropna=True)
unstack:
Pivot one or more levels of the (necessarily hierarchical) index labels.Pivots the specified levels of the index labels of df to the innermost levels of the columns labels of the result. unstack() doesn’t change the pivot table back to a regular DataFrame, it changes the orientation of the data.
cuDF: (df, level, fill_value=None)
pandas: (level=- 1, fill_value=None)
wide_to_long:
Wide panel to long format. Pandas only melt() alternative.
pandas: (df, stubnames, i, j, sep=’’, suffix=’\\d+’)
rank:
Compute numerical data ranks (1 through n) along axis. By default, equal values are assigned the average of the ranks of those values.
cuDF: (axis=0, method=’average’, numeric_only=None, na_option=’keep’, ascending=True, pct=False)
pandas: (axis=0, method=’average’, numeric_only=_NoDefault.no_default, na_option=’keep’, ascending=True, pct=False)
rename:
Alter column and index labels. Function / dict values must be unique (1-to-1). Labels not contained in a dict / Series will be left as-is.
cuDF: (mapper=None, index=None, columns=None, axis=0, copy=True, inplace=False, level=None, errors=’ignore’)
pandas: (mapper=None, *, index=None, columns=None, axis=None, copy=None, inplace=False, level=None, errors=’ignore’)
resample (asfreq):
Convert time series to specified frequency. Very necessary when working with time series data. Pandas’ asfreq() function is not available in cuDF, use resample() instead. As of this writing I had some problems using the apply function with resample, am looking into it.
cuDF: (rule, axis=0, closed=None, label=None, convention=’start’, kind=None, loffset=None, base=None, on=None, level=None, origin=’start_day’, offset=None)
pandas: (rule, axis=0, closed=None, label=None, convention=’start’, kind=None, loffset=None, base=None, on=None, level=None, origin=’start_day’, offset=None, group_keys=_NoDefault.no_default)
Exchanging values (mask, replace, where)
mask:
Replace values where the condition is True. When you want to use change values using boolean methods, can be chained using ‘and.’
cuDF: (cond, other=None, inplace=False)
pandas: (cond, other=nan, *, inplace=False, axis=None, level=None)
replace:
Replace values given in ‘to_replace’ with alternate value
cuDF: (to_replace=None, value=None, inplace=False, limit=None, regex=False, method=None)
pandas: (to_replace=None, value=_NoDefault.no_default, *, inplace=False, limit=None, regex=False, method={‘pad’, ‘ffill’, ‘bfill’})
where:
Where condition is True, keep the original value. Where False, replace with corresponding value from other.
cuDF: (cond, other=None, inplace=False)
pandas: (cond, other={scalar, Series/DataFrame, or callable}, *, inplace=False, axis=None, level=None, errors=’raise’, try_cast=_NoDefault.no_default)
Indexing (index, MultiIndex, names, reset_index, reindex, reindex_like, set_index, sort_index, truncate).
Not too many examples will be provided for this section because index functions are pretty niche, until you need them. Fantastic write up from PyData here. As stupid as it sounds, you’ll find yourself messing with indexes much more than you’d ever expect.
index:
The basic object storing row labels for all cuDF objects. The options for manipulating dataframe index using cuDF is extensive, especially when combining dataframes or data cleaning. Its worth a look at the official documents.
cuDF: (data=None, dtype=None, copy=False, name=None, tupleize_cols=True, nan_as_null=True, **kwargs)
pandas
MultiIndex:
A multi-level or hierarchical index. Provides N-Dimensional indexing into Series and DataFrame objects. Good post on advanced uses for MultiIndex / advanced indexing here.
cuDF: (levels=None, codes=None, sortorder=None, names=None, dtype=None, copy=False, name=None, **kwargs)
pandas: (levels=None, codes=None, sortorder=None, names=None, dtype=None, copy=False, name=None, verify_integrity=True)
index.names
Rename the index in a DataFrame using index.names function. I find naming the index and using reset_index() as good practice during data filtering, merging/joining, and modification. Remember to drop the old index if it’s not relevant to the new DataFrame.
reset_index:
Reset the index of the DataFrame, or a level of it.
cuDF: (level=None, drop=False, inplace=False, col_level=0, col_fill=’’)
pandas: (level=None, *, drop=False, inplace=False, col_level=0, col_fill=’’, allow_duplicates= {bool, optional, default lib.no_default}, names=None)
reindex:
Conform DataFrame to new index. Places NA/NaN in locations having no value in the previous index. A new object is produced unless the new index is equivalent to the current one and copy=False. I mostly use this to reorder DataFrame columns & rows alphabetically.
cuDF: (labels=None, index=None, columns=None, axis=None, method=None, copy=True, level=None, fill_value=<NA>, limit=None, tolerance=None)
pandas: (labels=None, index=None, columns=None, axis=None, method=None, copy=None, level=None, fill_value=nan, limit=None, tolerance=None)
reindex_like:
Conform DataFrame to new index. Places NA/NaN in locations having no value in the previous index. A new object is produced unless the new index is equivalent to the current one and copy=False.
cuDF: [unavailable as of writing]
pandas: (other, method=None, copy=True, limit=None, tolerance=None)
set_index:
Return a new DataFrame with a new index
cuDF: (keys, drop=True, append=False, inplace=False, verify_integrity=False)
pandas: (keys, *, drop=True, append=False, inplace=False, verify_integrity=False)
sort_index:
Sort object by labels (along an axis).
cuDF: (axis=0, level=None, ascending=True, inplace=False, kind=None, na_position=’last’, sort_remaining=True, ignore_index=False, key=None)
pandas: (*, axis=0, level=None, ascending=True, inplace=False, kind=’quicksort’, na_position=’last’, sort_remaining=True, ignore_index=False, key=None)
sample:
Return a random sample of items from an axis of object. random_state parameter can be specified to get same samples. If reproducible results are required, a random number generator may be provided via the random_state parameter.
cuDF: (n=None, frac=None, replace=False, weights=None, random_state=None, axis=None, ignore_index=False)
pandas: (n=None, frac=None, replace=False, weights=None, random_state=None, axis=None, ignore_index=False)
sorting values (sort_values, nlargest & nsmallest, groupby.nth, truncate):
sort_values:
Sort by the values along either axis
cuDF: (by, axis=0, ascending=True, inplace=False, kind=’quicksort’, na_position=’last’, ignore_index=False)
pandas: (by, *, axis=0, ascending=True, inplace=False, kind=’quicksort’, na_position=’last’, ignore_index=False, key=None)
nlargest & nsmallest:
Return the first n rows with the largest values in columns, in descending (& ascending for nsmallest) order. The columns that are not specified are returned as well, but not used for ordering. Including these two functions because it saves time from writing out: “ sort_values(columns, ascending=False).head(n) “ and is computationally faster.
cuDF: (n, columns, keep=’first’)
pandas: (n, columns, keep=’first’)
groupby.nth:
For use either after creating a DataFrame using groupby() or attached to the end of the line utilizing groupby(). Take the nth row from each group if n is an int, otherwise a subset of rows. Can be either a call or an index. dropna is not available with index notation. Index notation accepts a comma separated list of integers and slices.
Pandas only: if dropna, will take the nth non-null row, dropna is either ‘all’ or ‘any’; this is equivalent to calling dropna(how=dropna) before the groupby. Even though it is not specified in the docs, you can use the following args: slice, dropna (=’any’), as_index (=True or False). This is equivalent to calling dropna(how=dropna) before the groupby.
cuDF: (n)
pandas: (n, dropna=None)
value formats (astype, dtypes, select_dtypes, to_datetime)
astype:
Cast the object to the given dtype.
cuDF: (dtype, copy=False, errors=’raise’, **kwargs)
pandas: (dtype, copy=True, errors=’raise’)
dtypes:
find the data types in a dataframe
select_dtypes:
dtypes gives the format/type of data. select_dtypes returns a subset of the DataFrame’s columns based on the column dtypes.
cuDF: (include=None, exclude=None)
pandas: (include=None, exclude=None)
For reference on how data types work in python:
to_datetime (astype(‘datetime[]’):
Convert argument to datetime. cuDF likes it’s dates in the “Year-Month-Day” format. “Month/Day/Year” has to get converted. Converting dates (in multiple columns) into datetime can also be done by using the astype() function. As of writing there are some bugs with the ‘dayfirst’ arg in cuDF so avoid changing it to ‘True.’ Some have found using the astype() function faster than to_date() but I haven’t tested it out myself.
cuDF: (arg, errors=’raise’, dayfirst=False, yearfirst=False, utc=None, format=None, exact=True, unit=’ns’, infer_datetime_format=False, origin=’unix’, cache=True)
pandas: (arg, errors=’raise’, dayfirst=False, yearfirst=False, utc=None, format=None, exact=True, unit=None, infer_datetime_format=False, origin=’unix’, cache=True)
Description of Dataset Used for Illustrative Purposes
The dataset used for all the examples were provided by the NFL on Kaggle for their 2023 Data Bowl competition. I love the NFL so thought it’d be fun to use that dataset for this post.
The data provided was a combination of general NFL player & game descriptions (players.csv, games.csv), the first 8 weeks of last season (week1.csv, etc.), NFL NextGen data (plays.csv), and ProFootball Focus scouting reports on each game (pffScoutingData.csv).
The competition was focused on offensive and defensive line play, so had lots of data on players along the line of scrimmage and what they did each play.
Generating NFL Databowl 2023 DataFrame in cuDF
The Kaggle data files can be downloaded and used using the Kaggle API, directly downloaded as zip files (using wget and zipfile, etc.) into pandas, or just downloading to your directory. I did the last just to keep things simple and allow me to rerun the jupyter notebook over and over again.
For this exercise all we need is to load the cuDF module. I included the others in the code just to show other things Rapids users can use.
#import pandas as pd
#import numpy as np
import cudf
#import cupy as cp
#import cuml
#import cugraph as cg
#import cuspatial as cs
#import cuxfilter as cf
#import cusignal as csig
#import cucim as cc
# data viz
#import matplotlib.pyplot as plt
#from matplotlib import rcParams
#import seaborn as sns
Load all the files provided by the NFL Databowl competition organizers provided. Using cuDF, all these files are loaded directly into the GPU’s VRAM. As long as all processes are kept inside the GPU using Rapids, the computer’s CPU, RAM, and disk storage will be minimally involved.
# Load given datasets
games = cudf.read_csv('games.csv')
pffScoutingData = cudf.read_csv('pffScoutingData.csv')
players = cudf.read_csv('players.csv')
plays = cudf.read_csv('plays.csv')
week1 = cudf.read_csv('week1.csv')
week2 = cudf.read_csv('week2.csv')
week3 = cudf.read_csv('week3.csv')
week4 = cudf.read_csv('week4.csv')
week5 = cudf.read_csv('week5.csv')
week6 = cudf.read_csv('week6.csv')
week7 = cudf.read_csv('week7.csv')
week8 = cudf.read_csv('week8.csv')
The first 8 weeks NFL NextGen stats of the 2021 season have to be put together. The concat() function will do that work.
Notice that as soon as I’m done with a dataframe, it is deleted immediately to conserve VRAM. This is not necessary if your dataset is small relative to your total VRAM capacity but becomes crucial as you get near the limit.
# Combine the 8 weeks of data
weeks = [week1, week2, week3, week4, week5, week6, week7, week8]
weeks = cudf.concat(weeks)
del[week1, week2, week3, week4, week5, week6, week7, week8]
Edit: the following alternative was suggested in feedback that is faster and more flexible:
#Load weekly data and concat in one line
weeks = cudf.concat([cudf.read_csv("week" + str(x)) for x in range(1,9)])
Using merge() function, combine the weekly data frame with the PFF scouting numbers.
# Combine weekly data with PFF Scouting data
df = weeks.merge(pffScoutingData, on = ['gameId', 'playId', 'nflId'])
del(weeks, pffScoutingData)
Add game information to the data frame.
# Add in the basic Games data in front
df = games.merge(df, on ='gameId')
del(games)
Attach more detailed information on what happened in each play
# Add the details of the plays at the end of the dataframe
df = df.merge(plays, on = ['gameId', 'playId'])
del(plays)
Our primary data frame now has all the game data for the first half of the 2021 NFL season. Now attach information about the players.
# Add in Player data, putting it in front of the dataframe
df = players.merge(df, on = 'nflId')
del(players)
The base database is now complete. We have all the information the contest providers have given us in an organized, singular data frame. There are many categorical and ordinal values throughout. To do statistical analysis on these columns, we use one-hot encoding to split everything into 0 (False) and 1 (True). This will vastly increase the number of columns the data frame will contain.
# One Hot Encoding of variables with multiple categories using Pandas
officialPosition = cudf.get_dummies(df.officialPosition)
down = cudf.get_dummies(df.down)
passResult = cudf.get_dummies(df.passResult)
offenseFormation = cudf.get_dummies(df.offenseFormation)
personnelO = cudf.get_dummies(df.personnelO)
personnelD = cudf.get_dummies(df.personnelD)
dropBackType = cudf.get_dummies(df.dropBackType)
pff_passCoverage = cudf.get_dummies(df.pff_passCoverage)
pff_passCoverageType = cudf.get_dummies(df.pff_passCoverageType)
pff_role = cudf.get_dummies(df.pff_role)
pff_blockType = cudf.get_dummies(df.pff_blockType)
foulName1 = cudf.get_dummies(df.foulName1)
pff_positionLinedUp = cudf.get_dummies(df.pff_positionLinedUp)
Here we meet our first difference between pandas and cuDF but it’s a very minor one. In pandas when you combine dataframes that happen to have the same column names but are treated as unique, they are treated as repeats. Pandas simply changes the name of the columns (like adding a ‘-x’ suffix).
Rapids AI’s cuDF does not do this.
It will stop the kernel and give you a warning. If you see columns with the same name in data frames you are combining, give each repeat column a unique name.
# Renaming columns to avoid issues with repeats
down.columns = ['SP/NaN', 'FirstD', 'SecondD', 'ThirdD', 'FourthD']
officialPosition.columns = ['Center', 'CornerBack', 'DBack', 'DEnd', 'DTackle', 'FullBack', 'FreeSafety', 'Guard', 'InsideLB', 'LineBacker', 'MiddleLB', 'NoseTackle', 'OutsideLB', 'QuarterBack', 'RunningBack', 'StrongSafety', 'Tackle', 'TightEnd', 'WideReciever']
passResult.columns = ['Complete','Incomplete', 'Interception', 'Scramble', 'Sack']
Now it adding the one-hot encoded information to the primary data frame should go problem free.
# Join One Hot Encoded info into the main dataframe
df = cudf.concat([df, passResult, down, officialPosition, pff_positionLinedUp, pff_role, pff_blockType, dropBackType, offenseFormation, personnelO, personnelD, pff_passCoverage, pff_passCoverageType, foulName1], axis=1)
del([down, officialPosition, pff_positionLinedUp, pff_role, pff_blockType, passResult, dropBackType, offenseFormation, personnelO, personnelD, pff_passCoverage, pff_passCoverageType, foulName1])
And that’s it. we have our complete database for analysis.
There’s about 8 million rows, 290 columns, and takes up around 8 GBs of VRAM. Pretty damn big. Using RAPIDS, it gets done in less than 5 seconds and almost all of it is taken up by moving the initial csv files from the SSD hard drive to the GPU.
From here on in the database will be used to illustrate how to use the most common pandas functions used to prepare (data munging) data sets for data science and machine learning.
In the vast majority of cases, cuDF and pandas are like for like. There will be explicit notes in where there are differences
Thanks for reading.
EXTRA: Simplified NFL Database Using Pandas
Since I plan on posting this on the Kaggle discussion boards and most people will want to use regular old pandas, here’s how to create the same data set, minus a few columns and the information in the weekly csvs. If your PC can handle all of it (32+ GB RAM), take out the # marks in the code.
cd ~/Dropbox/NFL_Databowl_2023
# Load up modules
# data manipulation
import pandas as pd
# Load up the data files from NFL NextGenStats
games = pd.read_csv('games.csv')
pffScoutingData = pd.read_csv('pffScoutingData.csv')
players = pd.read_csv('players.csv')
plays = pd.read_csv('plays.csv')
#week1 = pd.read_csv('week1.csv')
#week2 = pd.read_csv('week2.csv')
#week3 = pd.read_csv('week3.csv')
#week4 = pd.read_csv('week4.csv')
#week5 = pd.read_csv('week5.csv')
#week6 = pd.read_csv('week6.csv')
#week7 = pd.read_csv('week7.csv')
#week8 = pd.read_csv('week8.csv')
# Drop Unnecessary data from games, players and plays dataframes
games = games.drop(['season', 'gameDate', 'gameTimeEastern', 'homeTeamAbbr', 'visitorTeamAbbr'], axis=1)
players = players.drop(['height', 'weight', 'birthDate', 'collegeName'], axis=1)
plays = plays.drop(['playDescription', 'quarter', 'possessionTeam', 'defensiveTeam', 'yardlineSide', 'gameClock', 'preSnapHomeScore', 'preSnapVisitorScore', 'absoluteYardlineNumber'], axis=1)
# Merge games, pffScoutingData, and players into a single dataframe 'df'
df = games.merge(pffScoutingData, on = 'gameId')
df = players.merge(df, on = 'nflId')
df = df.merge(plays, on = ['gameId','playId'])
del[plays, players, games, pffScoutingData]
# combine the 8 playing weeks of data into 'weeks'
# weeks = pd.concat([week1,week2,week3,week4,week5,week6,week7,week8])
# merge it with the dataframe
# df = pd.merge(df, weeks, on='gameId', how='outer')
# delete unused dataframes
#del[weeks, week1,week2,week3,week4,week5,week6,week7,week8]
# Let's encode Position and Down
officialPosition = pd.get_dummies(df.officialPosition)
down = pd.get_dummies(df.down)
# One Hot Encoding of variables with multiple categories using Pandas
passResult = pd.get_dummies(df.passResult)
offenseFormation = pd.get_dummies(df.offenseFormation)
personnelO = pd.get_dummies(df.personnelO)
personnelD = pd.get_dummies(df.personnelD)
dropBackType = pd.get_dummies(df.dropBackType)
pff_passCoverage = pd.get_dummies(df.pff_passCoverage)
pff_passCoverageType = pd.get_dummies(df.pff_passCoverageType)
pff_role = pd.get_dummies(df.pff_role)
pff_blockType = pd.get_dummies(df.pff_blockType)
foulName1 = pd.get_dummies(df.foulName1)
pff_positionLinedUp = pd.get_dummies(df.pff_positionLinedUp)
# Join all the codes to main dataframe
df = pd.concat([df, down, officialPosition, pff_role, pff_blockType, passResult, dropBackType, offenseFormation, personnelO, personnelD, pff_passCoverage, pff_passCoverageType, foulName1, pff_positionLinedUp], axis=1)
# Save Work
df.to_csv('NFL_2023_Databowl_Data_with_codes.csv')