Run Analytic Queries on Cassandra with the Power of GPUs for ML & AI
Author: Alexander Cai
Most organizations keep high-speed transactional data in fast NoSQL databases like Apache Cassandra®. With its elastic scalability and peer-to-peer architecture, Cassandra makes the perfect data storage companion.
But how do you extract insights from this data using analytics? Traditionally, to obtain insights from Cassandra data, you’d use massively parallel processing analytics systems like Apache Spark that run on central processing units (CPUs).
However, today’s analytics ecosystem is quickly embracing artificial intelligence (AI) and machine learning (ML) techniques whose computation relies heavily on graphics processing units (GPUs). So where does that leave us?
In this post, we explore a cutting-edge approach for processing data stored on Cassandra by parsing them directly into GPU device memory using three main technologies:
This helps you reach insights faster with minimal setup and also makes it easy to migrate existing analytics code written in Python. Thanks to these three technologies, you can now analyze your Cassandra data much faster and simpler on a GPU for machine learning and artificial intelligence use cases.
What is Apache Cassandra?
Cassandra is a distributed, cloud-based, open-source NoSQL database. Because it can handle up to petabytes or terabytes of mission-critical data with zero downtime, global companies like Netflix, Uber, and Spotify use Cassandra.
Cassandra distributes any amount of data and replicates it across multiple nodes, making it extremely reliable and fault-tolerant. If you’re building large applications that require massive volumes of data, your data will always be available and easily accessible on Cassandra.
Cassandra 4.0, recently released in 2021, is by far the most stable and extensively tested major release of Cassandra to date. Thanks to various innovations and enhancements, Cassandra 4.0 presents improved scaling of operations, significant performance and security, and reduced costs.
Although Cassandra is famously known for its performance, it can take some time and effort to set it up initially. DataStax, one of the leading Cassandra experts, addresses this head-on with DataStax Astra DB — a serverless Cassandra-as-a-service.
Astra DB simplifies cloud-native Cassandra application development and reduces deployment time from weeks to minutes. You can now focus on building and deploying cloud-native applications without needing to install, operate, and scale Cassandra. It’s also completely free up to 80 GB storage and 20 million operations monthly.
Simply create a free Astra DB account or log in with your GitHub or Google accounts to deploy a Cassandra NoSQL database in minutes.
Analyzing data on Cassandra
Cassandra is great for transactional queries that look at one or more rows inside the database to get you the information you need really, really quickly. But if you want to conduct analytics, or statistics, or machine learning, Cassandra isn’t natively built for that.
Spark distributes your computation across multiple nodes so you can scale and analyze your data efficiently. However, Spark takes some effort to set up and you’d need to migrate existing analytics code that isn’t written with Spark.
Alternatively, you could use pandas, NumPy, and scikit-learn, given that Python is the de facto standard for machine learning nowadays. But can you run these existing analytics workflows using Python libraries on Cassandra data?
One approach is to fetch the data from your Cassandra table using a driver, and then run analytics on that data. But if you have a large dataset, querying the entire dataset is computationally expensive. Plus, there’s a risk of slowing down your transactional operations when you query the Cassandra cluster, leading to bad user experiences.
Can we extract data from Cassandra without impacting our Cassandra cluster? Let’s find out.
Under the hood: how Cassandra stores data
Cassandra stores most of its data inside memory tables when you query the database. But after saving the file or data to the disk, if you want to persist it for long-term storage, then Cassandra stores the data in the form of Stored String Tables, or SSTables.
An SSTable is a persistent file format that takes the in-memory data stored in memtables, orders it for fast access, and stores it on disk in a persistent, ordered, immutable set of files.
Cassandra parses the SSTables to retrieve the data and then sends it to you, whether you’re using SQL or a Cassandra driver. But this strains your Cassandra database, which isn’t something we want.
Is it possible to fetch data directly from the access tables without using Cassandra? I spent most of the summer working on this, and it’s not so easy. Here are a few diagrams of what these table files look like internally:
After reading through the documentation and the code, I figured out how to parse everything out. Although this is difficult, it’s possible.
The solution: sstable-to-arrow
sstable-to-arrow fetches data directly from the SSTables without burdening your Cassandra database. The big picture goal of the tool is to allow GPU-accelerated analytic queries on Cassandra. This would enable you to do more analysis using the data and open the path to future developments. sstable-to-arrow is completely open source and you can read the source code here.
sstable-to-arrow reads SSTables directly without going through Cassandra and transforms the data into an Apache Arrow format. With sstable-to-arrow, you can easily query for the data you need and easily migrate existing Python analytics code using RAPIDS.
Why should you use RAPIDS?
RAPIDS is a suite of open-source libraries for running data analytics, and it enables end-to-end data science and analytics pipelines entirely on a GPU. It emerged from CUDA, a programming toolkit/language developed by Nvidia to write code for the GPU and make the most use of its power.
RAPIDS takes common AL or ML APIs like pandas dataframes and scikit-learn and makes them available for GPU acceleration (see Figure 2). An example is cuDF — the equivalent of pandas in RAPIDS. While pandas works on the CPU, cuDF does many of the same operations but using the GPU.
RAPIDS can make migrating your existing code really, really easy if you have some code that’s already written, for example, with pandas or scikit-learn, both of which have equivalents in RAPIDS.
GPUs for machine learning
GPUs can run data analytics processes and machine learning training much faster than current CPUs can. Data science, and particularly machine learning, requires a lot of parallel computations.
You can run more processes at a faster rate on a GPU since it has more cores and can “multitask” at a few orders of magnitude higher than current CPUs, which is especially important for machine learning or artificial intelligence.
For use cases like data transformation and data wrangling especially, the larger the dataset, the faster you’ll be able to process the data on a GPU like cuDF on RAPIDS. For machine learning, GPU libraries like cuML can give you increased performance and fast results at reduced costs compared to scikit-learn or SparkML.
RAPIDS uses the Apache Arrow in memory format to move data on and off GPU accessible memory (device memory) without serializing and deserializing.
What is Apache Arrow?
Apache Arrow is an in-memory data structure based on columns that delivers efficient and fast data interchange with the flexibility to support complex data models. There are two main purposes to Arrow: data analytics and across-language in-memory implementation.
Apache Arrow defines a language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware like CPUs and GPUs.
It also comes with an inter-process communication (IPC) mechanism used to transfer an Arrow record batch (i.e. a table) between processes. The IPC format is identical to the in-memory format, eliminating any extra copying or deserialization costs while providing extremely fast data access.
The data science community has grown in the last decade and there are now different tools for data analysis. Since CPU-based tools have shortcomings, Arrow tries to address at least some of them. APIs or applications that are built on top of Arrow can also avoid some of the pain points that users face, such as in-memory computing or running out of datasets.
If you want to take advantage of GPUs for data analytics, you can get started with RAPIDS and Arrow for free with a relatively flat learning curve. The same steps that you’d do in a pandas data frame can be done in cuDF data frame.
How does it all come together?
sstable-to-arrow reads the data from SSTables and transforms it into Arrow format so we can use them on the GPU. Once the data is transformed, the RAPIDS ecosystem does all the work for you to get the data onto the GPU.
In Figure 4, if you wanted to run the pandas code on a GPU, all you need to do is switch out pandas with cuDF in “import” and use the rest of the APIs in the same way.
If you want to automate this process, run it in a cloud-based orchestration, or run on a GPU in the cloud, all you need to do is to change the syntax without making any other logic adjustments. Let’s get started with these tools!
Hands-on: Getting started with sstable-to-arrow
In this hands-on exercise, you’ll use sstable-to-arrow on your own Cassandra data. If you don’t have Cassandra SSTables currently, you can use the bundled sample data on our GitHub repository. Here’s a brief overview of how you can use sstable-to-arrow:
Using sstable-to-arrow with RAPIDS
By now, you’d have tried out sstable-to-arrow on your own machines. But if you don’t have a machine that’s compatible with CUDA, we’ll show you a brief demo of how sstable-to-arrow can help get data onto the GPU using RAPIDS.
Those of you who tinkered with machine learning might be familiar with the Iris dataset. It contains four features (length and width of sepals and petals) of three species of Iris flowers across 50 samples. These measures were used to create a linear discriminant model to classify the species.
Scikit-learn, a Python library for CPUs that do machine learning, already has a couple of examples of using the Iris dataset for simple classification. We’ll use sstable-to-arrow to show how you might take the data from a Cassandra table and load it onto the GPU using RAPIDS. See this process in action in this YouTube tutorial.
In the end, the data points in the Iris dataset will be classified into different species–blue, yellow, and brown, all on the GPU.
In practice, you’ll have multiple SSTables that correspond to a single Cassandra table. Unfortunately, sstable-to-arrow doesn’t support the deduplication of SSTables yet, but you can do it manually using SQL on the client side. BlazingSQL, a Python library that allows you to run SQL queries using the GPU, is a useful tool for this. We demonstrate this process in this YouTube tutorial.
To sum up, sstable-to-arrow is a data analytics tool that brings the RAPIDS ecosystem and Cassandra ecosystem together. With sstable-to-arrow, you can analyze your data from Cassandra on a GPU without putting an extra load on your Cassandra clusters. To learn more about sstable-to-arrow, check out this blog post and this GitHub repo.
If you’d like to find more tutorials on Cassandra, check out our YouTube channel and Cassandra certification courses on DataStax Academy. Tag us on Twitter if you have any questions, or join DataStax Community, the stack overflow for Cassandra.
Don’t forget to follow the DataStax Tech Blog to get notified about more developer stories like this.
- Apache Cassandra®
- Astra DB
- sstable-to-arrow GitHub
- Apache Arrow
- Apache Spark
- DataStax Enterprise
- YouTube tutorial: Analyzing Cassandra Data with GPUs
- Analyzing Cassandra Data using GPUs, Part 1
- Analyzing Cassandra Data using GPUs, Part 2
- DataStax YouTube Channel
- Apache Cassandra certification courses
- DataStax Academy