Cassandra Data Loading: 8 Tips for Loading Data into Astra DB

Want to know the easiest way to load a large amount of data into DataStax Astra DB, the Cassandra-as-a-service quickly? In this blog, you’ll learn about nine helpful tips on how to make the most of DataStax Bulk Loader, a widely used command line tool for loading and unloading data from Cassandra and Astra DB.

The most commonly asked Apache Cassandra® and DataStax Astra DB question is: What is the easiest way to load large amounts of data into Astra DB quickly? The answer is the DataStax Bulk Loader.

The DataStax Bulk Loader tool (dsbulk) is a command line tool for loading and unloading data from Cassandra and Astra DB. In this blog, we’ll expand on the documentation we provide for dsbulk with nine tips from the DataStax engineering team to help you optimize the data loading process.

If you haven’t installed dsbulk yet, you can set up the tool using the following commands:

curl -LO https://downloads.datastax.com/dsbulk/dsbulk.tar.gz

Then, unpack the downloaded distribution:

tar -xzvf dsbulk.tar.gz

To learn more about dsbulk setup, take a look at our documentation.

Tip #1: Run the DSBulk Loader on a virtual machine

While running your migration, we recommend using a virtual machine (VM) in the same region as your database to decrease latency and increase throughput (number of rows you can load per second).

DSBulk can be easily installed on a VM using the installation commands above. We strongly recommend using a virtual machine instead of running DSBulk directly on your laptop.

Tip #2: Load data directly from AWS S3 or Google Cloud Storage

For data that doesn’t fit on a single machine’s hard drive, or even just to leverage the convenience of cloud object storage, dsbulk can load large amounts of data directly from AWS S3 or Cloud Storage on Google Cloud Platform (GCP).

Load a single CSV file hosted on GCP by passing dsbulk a file url:

dsbulk load -url https://storage.googleapis.com/bucket/filename.csv -k ks -t table -b ~/scb.zip -u client_id -p client_secret

Load multiple CSVs hosted on GCP by passing dsbulk a list of file names:

dsbulk load --connector.csv.urlfile https://storage.googleapis.com/bucket/files.csv -k ks -t table -b ~/scb.zip -u client_id -p client_secret

Tip #3: The DSBulk Loader works well with Astra DB

To connect to Astra DB you need a Secure Connect Bundle (SCB), and application token. You can download the secure database bundle and obtain your application token from the DataStax Astra DB web console.

dsbulk is compatible with Astra DB by passing your SCB to the -b flag, client id to the -u flag and client secret to the -p flag.

Tip #4: Dealing with rate limits

Astra DB’s default rate limit is 4,098 ops/second. Once you’ve hit the limit, you’ll get the following message from the server: “rate limit reached”.

The message appears because Astra DB caps the throughput for free databases. If you want more throughput, upgrade to a pay-as-you-go Astra DB plan.

Tip #5: DSBulk tool pooling options

Astra DB works better with more client connections. You want to set the number of connections to 16 in the Java driver when you run dsbulk. To do so, add the following flag to your DSBulk command:

--driver.advanced.connection.pool.local.size 16

Tip #6: Tuning DSBulk

Performance tuning is about understanding the bottlenecks in a system and removing them to improve performance. But what is performance? In the case of bulk loading we optimize for throughput (as opposed to latency) because the goal is to get as much data into the system as fast as possible. This is different from a traditional Cassandra operational environment where we might optimize for query latencies.

For a deeper dive into the relationship between latency and throughput (under concurrency) take a moment to review Little’s Law.

In practice, as we try to push data faster with DSBulk (the client), we may see latencies increase on Astra DB (the server). If we don’t, that’s a sign that we still have plenty of database capacity and that we can continue to increase the rate in DSBulk. If on the other hand, your latencies are increasing without an increase in throughput, you may have to wait for your database to autoscale or open a support request to get better performance.

DSBulk throughput can be controlled with a few different flags:

  1. --maxConcurrentQueries
  2. --dsbulk.executor.maxPerSecond
  3. --dsbulk.executor.maxInFlight

All three of these flags control the same thing (target client throughput). They just do so by three different means. So remember to pick only ONE. The documentation recommends tuning maxConcurrentQueries because it’s technically the most efficient. However, we find that maxPerSecond is easier for users to understand, so we recommend it for almost all scenarios.

To keep a closer eye on the client-side latencies, use the -report-rate flag. You can also watch the database side latencies in your AstraDB Health Tab.

Tip #7: Handling Errors

If your bulk load is pushing the system to its limits you may want to configure errors and retries so that your job doesn’t just stop when it hits too many errors. Note DSBulk logs any failed inserts in the logs directory, and you can re-process any missed queries in a subsequent run:

Before calling a row an error, set the maximum number of errors before stopping the process with --dsbulk.log.maxErrors and the maximum number of retries with --driver.advanced.retry-policy.max-retries.

Tip #8: Onboarding engineers

Need additional help with your data load? No problem. We’ve got a team of engineers working round the clock, five days a week. Click the chat icon on the bottom right corner of the Astra portal to start a chat and get immediate help from an engineer. All you’ve got to do is let them know the amount of data and the deadline to upload it.

The Final Command

Here’s what your command might look like with all the options set:

dsbulk load -url https://storage.googleapis.com/bucket/filename.csv -k ks -t table -b ~/scb.zip -u client_id -p client_secret --driver.advanced.connection.pool.local.size 16 --dsbulk.executor.maxPerSecond 10000 --dsbulk.log.maxErrors 100 --driver.advanced.retry-policy.max-retries=3 --report-rate 10

Conclusion

Loading very large datasets onto Astra DB can be a breeze if you follow the best practices in this article. We hope you find these helpful.

If you prefer to learn about DSBulk via video, check out this quick overview from Steven Smith.

Need additional help loading your data into Cassandra or Astra? Reach out to us at hello@datastax.com.

Follow the DataStax Tech Blog for more developer stories. Check out our YouTube channel for free tutorials and follow DataStax Developers on Twitter for the latest news in our developer community.

Resources

  1. Astra DB
  2. DataStax Bulk Loader
  3. Apache Cassandra®
  4. AWS S3
  5. Google Cloud Storage
  6. YouTube Tutorial: Offline Migration to Astra DB Using DSBulk
  7. DataStax Community Platform
  8. DataStax Academy
  9. DataStax Certifications
  10. DataStax Workshops
  11. DataStax Developers on Twitter

--

--

--

We’re huge believers in modern, cloud native technologies like Kubernetes; we are making Cassandra ready for millions of developers through simple APIs; and we are committed to delivering the industry’s first and only open, multi-cloud serverless database: DataStax Astra DB.

Recommended from Medium

Introduction To Python Part 2

Laravel Eloquent Builder Vs Scopes

Developing a DevOps tool for SSO with G Suite on AWS

Encentive Provides MAP Developers with Liquidity and Turnkey DEXs

Simple server-less public API’s with AWS API Gateway, Lambda and DynamoDB

Ridiculously Easy Code Optimizations in R: Part 1

The impostor syndrome — do you suffer from it?

How I got placed at Meetbeans | Newton School

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
DataStax

DataStax

DataStax is the company behind the massively scalable, highly available, cloud-native NoSQL data platform built on Apache Cassandra®.

More from Medium

Importing data from GCS to Databases (via JDBC) using Dataproc Serverless

How to recover failed AWS MSK + Kafka MongoDB Source Connector

Migrating from SQL to NoSQL with Spring PetClinic and Apache Cassandra®

Pulsar: Boost your python Google Cloud Functions with a pulsar skeleton(part 4)