Announcing RAPIDS 22.10
Serving cake, pip, and Windows at our fourth birthday party!
It’s time to celebrate another year of RAPIDS accelerating data science and driving advances everywhere. A once small team building a passion project has evolved into a community that spans the globe. And with this release, we have some exciting announcements that make it easier to use RAPIDS no matter where you’re working!
As we enter our fifth year we expand our effort to accelerate data science for everyone, starting with some massive updates that make RAPIDS available to even more developers!
Making RAPIDS more accessible
RAPIDS pip packages
We heard the community requests and we are pleased to announce that RAPIDS now has experimental packages available through pip, Python’s most popular package manager! This makes it easier than ever to install RAPIDS into your environment. For more info on RAPIDS and pip, check out our dedicated pip webpage.
RAPIDS runs on Windows via WSL!
With 58% of Python developers using Windows (Jetbrains), supporting this community has been a goal of ours. And with this release we are excited to announce that RAPIDS is supported on Windows 11 through Windows Subsystem for Linux 2! For more details on how to get started, check out the instructions.
As well as making RAPIDS more accessible, we continue to expand the functionality and performance of the RAPIDS libraries! Check out some key updates from the 22.10 release below!
IO and Serialization Enhancements across the board
Reading and writing data and models just got better!
Experimental: New nested JSON reader.
The cuDF read_json
function now accepts JSON data with arbitrary nesting as input. This new experimental functionality allows you to read in more complex JSON files, retaining the important structural information, all while leveraging the performance of GPUs. Here are a couple of examples of it in action:
json_str = '[{"list": [0,1,2], "struct": {"k":"v1"}},
{"list": [3,4,5], "struct": {"k":"v2"}}]'
cudf.read_json(json_str, engine='cudf_experimental') list struct
0 [0, 1, 2] {'k': 'v1'}
1 [3, 4, 5] {'k': 'v2'}json_str = '{"a": [{"k1": "v1"}]}\n{"a": [{"k2":"v2"}]}'
cudf.read_json(json_str, engine='cudf_experimental', lines=True) a
0 [{'k1': 'v1', 'k2': None}]
1 [{'k1': None, 'k2': 'v2'}]
Zstandard compression support
When working with Parquet or ORC data, you can read and write to a file in with cuDF using Zstandard compression by setting the compression='ZSTD'
flag. (You’ll need to set the NVCOMP_POLICY
to ALWAYS
):
df.to_parquet(‘f.pq’, compression=’ZSTD’)
Reading text with cuDF is now 40 times faster!
Over the last few releases cuDF has worked to speed up the existing read_text
function. You should see speedups across the board with this function, with the biggest wins on single-character delimiters.
Treelite updated to v3.0
We’ve updated Treelite support in cuML to version 3.0. This means you can save any Gradient Boosted Decision tree or Random Forest model to Treelite and ensure that it will be readable in any future Treelite 3.x version.
Nearest Neighbor Algorithm enhancements
Out-of-sample prediction and soft clustering in HDBSCAN
cuML’s HDBSCAN implementation now contains the approximate_predict()
method, which enables the prediction of clusters for unseen data points, and the all_points_membership_vectors
method which carries out soft clustering. Soft clustering provides probabilities that data points might belong to one, many, or no clusters at all. The performance of these new features both showed 100x+ speedups on medium sized datasets. Stay tuned for more details in upcoming blogs!
Updates to the RAFT library for computational primitives
This release also comes with a new state-of-the-art implementation of the approximate nearest neighbors algorithm IVF-PQ. This implementation won the recent Big-ANN benchmarks competition and we are in the process of integrating it into the FAISS library, which will also enable its integration with popular libraries like Milvus.
With the 22.10 release, the RAFT team have started rolling out a cleaner, easier to use, and more stable public API based on the C++ mdspan (multi-dimensional span), a C++ analog to numpy’s ndarray in Python. Most RAFT dense APIs have been converted to accept mdspan.
New distance metrics for DBSCAN, UMAP & t-SNE
DBSCAN now supports cosine distance and the t-SNE and UMAP algorithms support a much expanded set of distance metrics, including (but not limited to) cosine, correlation, Manhattan, and Hellinger.
Hello to String UDFs!
String inputs for User Defined Functions
For those times where you can’t express your desired DataFrame transformation easily in terms of columnar operations, User Defined Functions (UDFs) are there for you! You could already use UDFs on a range of numeric inputs, but with cuDF 22.10 we are excited to announce experimental support for string inputs for UDFs. For now, you’ll have to install the strings_udf
package like so:
conda install -c rapidsai strings_udf -y
With the package installed, you can use strings in your UDF input. Take a look at the example below as a starting point.
df = cudf.DataFrame({'word': ['apple', 'banana', 'carrot'],
'letter': ['a', 'n', 'e']})
df.apply(lambda x: x['word'].count(x['letter']))
0 1
1 2
2 0
dtype: int32
So whether you’re running complex NLP pipelines or just analytics on text data, you’ll be able to define your own UDFs that map from strings and return numeric types.
Multi GPU Data Processing Updates
Dask-SQL now on GPUs
Dask-SQL on GPUs is now generally available. With its ability to run on both GPU and CPU with no need for modification, you can now develop Dask-SQL workflows locally, then seamlessly deploy to GPUs when you need that extra acceleration.
1.7x faster DGL RGCN now available based on cuGraph
The cuGraph team has worked on an improved version of the relational graph convolutional network (RGCN) model, released by DGL. This version gives a 1.7x speed-up in end-to-end training.
New multi-GPU algorithms in cuGraph
The cuGraph team continues to expand their set of algorithms which are supported on multi GPU architectures. The 22.10 release includes implementations of Jaccard, Overlap, Sorensen and Random walks, all for unweighted graphs.
General data processing updates!
New pre-processing methods for feature engineering in cuML
This cuML release includes three new pre-processing methods — you can now use PowerTransformer
, QuantileTransformer
and KernelCenterer
as part of your GPU accelerated machine learning pipelines. Additionally the TargetEncoder()
preprocessor has been updated so that you can now choose between encoding with the median or mean.
New features, more compatibility and documentation improvements in cuSpatial
The cuSpatial team has been busy — stay tuned for a full blog post covering their extensive updates from thai release in detail. Here are some highlights:
New Distance Functions:
This release introduces pairwis_point_linestring_distance
, which accepts a pair of GeoSeries as input, either of which can be multigeometry series:
from shapely.geometry import LineString, Point, MultiPoint
points = cuspatial.GeoSeries([Point(0.0, 0.5)])
bounds = cuspatial.GeoSeries(
[LineString([(-1, 0), (1, 0), (0, 1)])])
cuspatial.pairwise_point_linestring_distance(points, bounds)
0 0.353553
dtype: float64
This release also adds a function to accelerate computation of nearest points between pairs of points and linestrings (analogous to. shapely.ops.nearest_points) and returns index information about where the closest point lies on the linestring.
from shapely.geometry import LineString, Point, MultiPoint, MultiLineString
points = cuspatial.GeoSeries([Point(5.0, 4.0)])
lines = cuspatial.GeoSeries([MultiLineString([
(1.0, 0.5), (2.0, 2.4), (3.8, 3.6)],
[(6.0, 4.8), (3.7, 3.8)]])])
cuspatial.pairwise_point_linestring_nearest_points(points, lines)
point_geometry_id linestring_geometry_id segment_id geometry
0 0 1 0 POINT (4.86645 4.30715)
Improved compatibility:
The cuSpatial team continues to improve compatibility of the library with other tools and frameworks: The new cuspatial.read_polygon_shapefile
function enables you to read shapefiles directly to the GPU; Users can call cuspatial.from_geopandas
to convert data from GeoPandas and use GeoSeries.to_geopandas
to convert data back; and you can now initialize GeoDataFrame
using the GeoDataFrame({'name':Series})
dictionary notation, improving compatibility with GeoPandas and cuDF.
Documentation updates:
Finally, Using and understanding cuSpatial has never been easier. The Python documentation has received a major rewrite, and now includes a Python User Guide that provides examples of using all cuSpatial functions. The new Python Developer Guide provides detailed coverage of library design, how to set up a development environment, and how to contribute to cuSpatial. There is also a new C++ Developer Guide which covers design and developer guidelines for the cuSpatial C++ and CUDA layer.
Wrapping up
The RAPIDS team is committed to bringing the power of the GPU to data science workloads across environments, use cases, and new needs as they come up. This release prioritizes expanding the community and making our existing libraries better. Thanks for joining us on our journey.
We talked about this release and more during our RAPIDS talks at GTC in September 2022. Whether you’re just starting out with RAPIDS, or you’re a seasoned community member, you’re not going to want to miss John Zedlewski’s Deep Dive into RAPIDS for accelerated Data Science and Data Engineering.
As always, we’re looking forward to hearing from you on how you are using the new capabilities in RAPIDS. You can reach us on GitHub, follow us on Twitter, and check out our documentation and getting started resources.