How to use Apache Cassandra’s Stress Tool: Using YAML files — Part 2
Author: Jaqueline Caamal & Mike Uc
Unlike the popular belief, psychiatrists and psychologists -among others- are not the only ones who have to deal with stress. With the use of the right tools, programmers and software engineers can use calculated amounts of “stress” to take any digital application to its maximum limit, or even over the limit. The use of this so called ‘stress tool’, applied on Apache Cassandra, is perfectly explained in this article (we really think you should read it before continuing). And today, we will take the conversation to another level: using YAML files on the Apache Cassandra stress tool.
See it this way: learning the basics to use a stress tool is like teaching your application to walk…
…while doing it with YAML files, would be like taking it for a run.
Stress Tests Specific to Your Application. AKA, Cassandra on Steroids
Sometimes, we want to see what the performance of a specific scheme is, to find out if this will be scalable in our application. For example, when the database grows rapidly, how long can the data insertion remain stable? The goal of the YAML files with the Apache Cassandra stress tool is to take your application to the limit so you can see how efficient it is ‘under stress’, and then convincing yourself of applying to your project.
Take Twitter, for instance, who uses Apache Cassandra for analytics, Facebook, for inbox search, or Reddit, who gets the best of it as a persistent cache. Practically, any application who uses big data (banking, e-commerce, etc.) will have the YAML files doing its magic on Cassandra.
But, what are the YAML files?
YAML is a data-serialization language designed to be human-friendly and work with modern programming languages such as Perl, Python, PHP, Ruby, and JavaScript. This type of document was specifically created to work well for common uses cases such as configuration files, log files, interprocess messaging, cross-language data sharing, object persistence, and debugging of complex data structures. YAML has a consistent model to support generic tools. While running this test (that is you came to see, right?), we will use it with Apache Cassandra stress tool.
Want to learn more about YAML, check this out.
JSON vs YAML
Both JavaScript Object Notation (JSON) — more commonly used on database — and YAML, aim to be human-readable data interchange formats. However, YAML can be viewed as a natural superset of JSON, offering improved human readability and a more complete information model, for example, to build Cassandra Model information (the keyspace, table definitions, and query definitions).
While JSON is more popular and many APIs use JSON only (let’s face it, it is really intuitive), YAML, on the other hand, demands a deeper knowledge on the structure in exchange of perfect order. Putting it into fancy words: JSON is chaotic (and you can be drawn to chaos if it delivers faster search results, as is the case); but YAML gives the structure and control you always dreamed for your application.
Let’s create a YAML file!
To show you how a YAML performs on Apache Cassandra, we’ve made a YAML file containing 5 essential parts:
DDL
The DDL section is the first part of the file. Here you can define the keyspace and the information of the table. And put SQL in both to use the create statement.
Column Distributions
In the columnspec section, we can describe the different distributions that we will use for each column. You can set the distributions according to the type of data or the distribution that you would see in real life.
These are the distributions we’re going to use:
- GAUSSIAN(min..max,stdvrng): A gaussian/normal distribution, where mean=(min+max)/2, and stdev is (mean-min)/stdvrng.
- UNIFORM(min..max): A uniform distribution over the range [min, max].
- FIXED(val): A fixed distribution, always returning the same value.
The size is the length of the field, and in the text data type, we’ve placed uniform distribution. This means that it will generate a text string from 5 to 10 characters.
Insert
In this part, we will specify how the data is inserted during the stress process. We put the partitions: fixed (1), which means the insertion of a fixed number of rows in each partition in each batch.
DML
You can place any CQL query you need, but you should know that these queries are executed with the values previously inserted.
Running the stress test
So far, we’ve given some high-performance sneakers to our application. Now, are going to run the command below to start the test.
- Inserts:
$tools/bin/cassandra-stress user profile=tools/catalog-stress.yaml ops\(intert=1)
Given only these parameters, the stress tool will execute the inserts starting with 4 threads and increase them until it reaches a limit. You will see something like this:
- Queries:
$tools/bin/cassandra-stress user profile=tools/catalog-stress.yaml ops\(singleclothes=1\)
In our query, we select a single clothing item (our test is simulating a clothing store) and the output will be similar to the insert command.
Analogous to the previous article, the complete explanation of each result can be found in the following link: Cassandra-stress data description.
Conclusion
Managing data in a perfectly structured way -without the messy part- can make a whole difference in your business. Imagine simultaneously trying to find Wally (the one from the children’s puzzle books) every time you want to find specific data on your application. Must be weary, right? That’s why we’ve run this test using YAML files on the Cassandra stress tool, and presented on a clothing catalog scenario example as a quick ‘how to’ so you can manage them.
On the way, there will be other types of metrics that can also be estimated (like queries or inserts), and you’ll have the option to perform different types of inquiries with, for example, a separate set of values. Everything, perfectly organized.
Remember, the importance of running a stress test for Apache Cassandra -with YAML files- is perfect for Cassandra to take your application to the limit in terms of stress, to make sure it will run smoothly on your application. Let Cassandra do the dirty job, so you don’t have to.
If you have questions or suggestions, please leave a comment below.