Time Moves Forward — RAPIDS Hits Its Three Year Milestone
October 15, 2018, is a day that some of us on the RAPIDS team, and some team alumni, will remember for the rest of our lives. This was the day RAPIDS went live at GTC Munich. The vision for RAPIDS grew out of two facts: the PyData ecosystem was becoming the lingua franca of Data Science; and then-recent developments in the application of neural net methods, which quickly grew to billions of nodes in a single model, demonstrated the potential for GPUs to deliver enormous speedups for data-centric problems. RAPIDS was built by data scientists for data scientists.
Release 21.10 continues to work toward the vision of making GPU speed available to everyone, whether you are quickly experimenting with engineering features for a model, doing ETL for the world’s most data-intensive enterprises, solving NLP problems, or offering a revolutionary end-to-end suite of tools for doing cyber security data science.
Let’s get into the RAPIDS core library updates.
RAPIDS cuDF (DataFrames)
For everyone who has to work with time formats, cuDF includes a huge number of features to make your life relatively easier no matter what speed you’re going. These include
Series.dt.is_leap_year. There is now support for
groupby.rolling variance and standard deviation, and groupby first and last aggregations. You can calculate
days_in_month, use Nulls in Time Series Generator, and use
Series.ceil()on DateTime series as well.
RAPIDS cuML (Machine Learning)
The value of time, your time as a data scientist, has always been core to the mission of the cuML team. This release includes several speed-ups to existing methods. GLM is now faster via an improved eigendecomposition algorithm. Random Forest Poisson has added Impurity Criterion and been refactored to be faster than ever. Exact Nearest Neighbors is faster via a 2-Dimensional Random Ball Cover algorithm. This release also graduates the hierarchical DBSCAN (HDBSCAN) implementation from “experimental” to fully-supported status. Check out the associated HDBSCAN blog for more details.
Beyond improvements to existing algorithms, there are new features and methods. ARIMA now supports missing observations and padding; Random Forest now has vector leaf predicting; support for categorical variables has been added to the Forest Inference Library (FIL); cuML now has complete Naive Bayes capabilities with the addition of Categorical Naive Bayes; and there are three new distance metrics: Kullback–Leibler divergence, Jensen-Shannon Divergence, and the Russel-Rao Coefficient.
RAPIDS cuGraph (Graph Analytics)
The cuGraph team is releasing pylibcugraph: a python library for supporting cuGraph as a backend to other python libraries (e.g. CuPy). The Sorensen coefficient is now available to users, and the team has continued to work on improving memory use and performance as well as general code clean-up.
RAPIDS Node.js (Visualisation)
We’re excited to be having Allan Enemark present on “GPU Accelerating Node.js with the Node-RAPIDS Data Science Framework” at NodeConf in mid-October. He and his team have continued to make strides in bringing the speed of GPUs to the Node community since the initial release in July. If you use Node or know others that do, spread the word about this amazing new tool to speed things up.
A lot of data science and statistics is about making a good guess about what will happen in the future. As we look forward to celebrating the four-year anniversary of RAPIDS this time next year, our best guess is that the tools and software will be more comprehensive and faster. You can join us on this journey! Request features or (even better!) contribute your thinking and code on GitHub. Follow us on Twitter. Check out the RAPIDSFire podcast. I’ll talk to you in December. Until then, keep moving forward.