Fast Data Pipelining with String UDFs

Brandon B. Miller
RAPIDS AI
Published in
6 min readNov 28, 2022
A graph of quantum entanglement from string theory in physics is shown as a sly joke about strings.
Strings can lead to significant entanglements where ever you find them. © BBC 2021.

Ever wish you could apply your own custom functions to string data columns in cuDF? Well, now you can! Recently cuDF introduced support for string data within user defined functions through the Series and DataFrame apply APIs. This means that you can build some logic that works with a single string in python and cuDF will map it across a whole column (or columns) of data, subject to a few constraints. This added flexibility should let you prototype things faster and, in some cases, even reuse existing python logic that you already have handy.

User Defined Functions

As a reminder, a user defined function (UDF) can be used in cuDF to perform a custom transformation to one or more input columns, producing a single output column. As a quick refresher, let’s imagine you have a series containing the temperature in Celsius, and you want to convert it to Fahrenheit:

# Create an array with some temperatures, which might be null
import cudf
celsius = cudf.Series([0, 100, cudf.NA])

# functions destined for Series.apply take the form of a lambda with one return value
# i.e. one element in, one element out
def ctof(temp):
# null case
if temp is cudf.NA:
return cudf.NA
else:
return 1.8 * temp + 32

farenheit = celsius.apply(ctof)
print(farenheit)
# 0 32.0
# 1 212.0
# 2 <NA>
# dtype: float64

To be clear, the same result can be obtained with the following code:

import cudf
celsius = cudf.Series([0, 100, cudf.NA])
farenheit = celsius * 1.8 + 32

print(farenheit)
# 0 32.0
# 1 212.0
# 2 <NA>
# dtype: float64

However the approach that uses the standard python function to encapsulate the logic may be advantageous in several scenarios:

  1. You happen to already have the function ctof written, tested, and available where you work
  2. You are in a situation where it’s easier to express a transformation in terms of a single row of data
  3. You have some complicated logic that can benefit from being fused into a single CUDA kernel for performance reasons

The machinery that makes this work is built on top of a python package called Numba which can translate a certain subset of python code into GPU code. cuDF uses this functionality with a compilation pipeline to parallelize the execution of a UDF across cuDF data structures, like Series and DataFrame. However UDFs that involve python strings are not included in the subset of python that Numba natively knows how to translate. This means if you have a Series or DataFrame containing string data and try to apply a UDF to it, you’ll get a nasty error. Recently however, this has started to change.

String UDFs in cuDF

Beginning in 22.10, cuDF is providing experimental support for string data within UDFs through the optional strings_udf conda package. cuDF will work the same with or without the presence of the package. However, when installed, users will find that certain string operations are available inside UDFs. Inside a UDF, a single data element of a string column may be treated like a standard python string. For example, its len may be taken:

import cudf
sr = cudf.Series(["a" ,"bcd", "efgh"])

def string_udf(st):
return len(st)

result = sr.apply(string_udf)
print(result)
# 0 1
# 1 3
# 2 4
# dtype: int32

Initially, only string operations and methods that return non-string data are supported within the body of the udf. For instance, lenis allowed, because it maps a string to an integer. Another example would be startswith, which maps two strings to a boolean value. upper would be an example of not yet provided functionality, because it maps a string to a new string. Operations on strings that return a new string are in development.

A complete list of supported operations are found in the user guide to UDFs. Our plans for the future are to support the full spectrum of functions and methods associated with python’s str class. The goal is that, from a user’s perspective, strings in rows where a UDF is applied behave exactly as python strings would, and that the same methods are supported. There are exceptions, such as printing, that are not available due to the nature of the execution environment. As you start working with string UDFs, attempting to use unsupported features will usually result in a missing attribute error. It’s also worth noting that external/importable libraries such as the regex library are not usable within UDFs.

Use It Today!

String UDFs are great for creating complex filters or categorizations of string data. For example, here’s a quick and easy to write function for rudimentary sentiment analysis of reviews of deep dish pizza restaurants in Chicago:

import cudf
df = cudf.DataFrame({
"restaurant": [
"lou malnatis",
"lou malnatis",
"pequods",
"giordanos",
"giordanos"
],
"review": [
"great! delicious! amazing!",
"cheese was good, sauce was disappointing",
"the very best in chicago",
"amazing, I would eat this every day forever",
"loved it, hated the price"
]
})

def pizza_good_or_bad(row):
review = row["review"]

# Filter invalid reviews
if review is cudf.NA or len(review) < 10:
return -1

positives = (
review.count("good") +
review.count("great") +
review.count("delicious") +
review.count("best") +
review.count("love") +
review.count("amazing")
)
negatives = (
review.count("terrible") +
review.count("bad") +
review.count("disappointing") +
review.count("hate")
)

result = positives - negatives

# exclamations!
if "!" in review:
return result * 1.5
else:
return result

df['sentiment'] = df.apply(pizza_good_or_bad, axis=1)
result = df.groupby('restaurant').mean()
print(result)
# sentiment
# restaurant
# pequods 1.00
# lou malnatis 2.25
# giordanos 0.50

As we can see from the above, Lou Malnatis has the best pizza in Chicago.

Kernel Compilation Considerations

One advantage of the above approach is that Numba compiles the entire operation into a single kernel launch. The sequence of DataFrame operations required to replicate the result would require several kernel launches as well as materializing intermediate results. There is initial compilation overhead required to build the kernel. With the UDF approach, this is only incurred on the first kernel launch, at which point cuDF contains machinery that caches the kernel — meaning subsequent launches of the same string UDF in the same python session should be much faster and incur far less compilation overhead. This creates some opportunities to experiment between the two approaches when developing pipelines and likely some scenarios where a UDF performs well relative to the equivalent DataFrame approach, even outside of the context of fast prototyping where UDFs often come up.

Null Data

One key difference between cuDF’s results and pandas that is worth highlighting is the fact that nulls will not prevent the continued execution of a function. In pandas, for instance, if the data contains nulls, and one attempts to execute len on that null, an error will result. In cuDF, the null will propagate and the execution will continue. The same is true for any functions of strings that involve nulls: If the input is null, the output will be null. That is, for instance, NA.startswith(other) == NA, whereas this will throw an error in pandas preventing the computation of the result.

Limitations

Some things to keep in mind that were mentioned above, but bear repeating here:

  • Functions that return strings are not usable yet inside any part of the UDF
  • A separate package is required to be installed to use this feature
  • Only the standard library of methods and functions provided by python is supported, with a few exceptions

Conclusion

Support for string data inside UDFs through apply is an ongoing effort inside cuDF, and an initial set of features is now available. These features should unlock the ability to both rapidly prototype workflows when string data is present as well as make better use of existing functions and business logic that may be directly reusable by cuDF without modifying any code. The resulting data should mock the exact result that would have been obtained by sequentially applying the UDF across the rows of the input data in pandas.

We hope you try this out today and give us feedback on github, and follow us on twitter, @RAPIDSai for, the latest updates in the world of RAPIDS.

--

--