September 14, 2015
As many of you may know, Cassandra Summit is fast approaching! In preparation, the team at Cask decided to integrate Cassandra and CDAP.
Apache Cassandra is an open-source behemoth of a project, one of the most popular databases in the world. It stores the data in a token ring, where each host or node is assigned a portion of that token ring, a token range. As a No-SQL database, it stores information in column families and supports so-called “wide rows.” It excels in scalability, though write — and especially read — operations can be costly as Cassandra uses “eventual consistency”.
Since Apache Cassandra is such a popular database, we decided to integrate it as a batch source and sink and as a realtime sink for the ETL Pipeline in the Cask Data Application Platform (CDAP).
Apache Cassandra originally supported Thrift clients to communicate with Cassandra. Recently, however, the database has shifted towards using the Cassandra Query Language, or CQL, for communication with Cassandra. As a result of this change, we decided to use the new CQL package of input and output formats for Hadoop-Cassandra integration, aiming to use CqlBulkOutputFormat to minimize the overhead of write operations.
I quickly discovered, however, that the Hadoop/cql3 package included a bug that made it incompatible with Hadoop2. As a workaround, even though they had been deprecated, I tried to use the older Thrift-based output formats, namely BulkOutputFormat, for integration with Hadoop. However, since Thrift clients cannot recognize tables created by CQL, I quickly eliminated using this output format as a viable solution.
More research revealed that one version, Apache Cassandra 2.1.0, did not include the same bug that made the Hadoop/cql3 package incompatible. This version unfortunately did not include CqlBulkOutputFormat, so I used CqlOutputFormat, ultimately deciding to prioritize for compatibility with CQL over minimizing strain on the cluster. Luckily, this version worked with few additional pitfalls. I similarly used CqlInputFormat to integrate Cassandra as a batch source.
For you, as a user of Apache Cassandra, this process means that you can now read from or write to Cassandra through CDAP. I’ve discovered some limitations of the versions of CqlInputFormat and CqlOutputFormat that may help you use this integration more efficiently. Most notably, when using Apache Cassandra as a source, at least one mapper will be created for every token in the token ring. You can modify the value of the num_tokens property in the cassandra.yaml configuration property accordingly to a number of tokens more appropriate for the amount of data you anticipate. Note that this property can only be modified before data is written to Cassandra. Other quirks are noted in the documentation for using Cassandra with CDAP.
To install the Apache Cassandra plugin, you must use the cdap-plugins repository. More information about installing the plugins in that repository can be found at the README file for cdap-plugins.
Hopefully, for Apache Cassandra users, this new feature will allow you to see the power of CDAP as you can use any of CDAP’s existing transforms (or your own custom transforms) to modify the data before reading from or writing to Cassandra. For CDAP users, this feature gives you the opportunity to explore the power and scalability of Apache Cassandra with your CDAP applications.
If you are interested in learning more about CDAP, check out cdap.io or reach out to the CDAP User forum.
The team from Cask will be at Cassandra Summit next week (September 22 to September 24). Come visit us at Booth 120 see what we are up to with Cassandra. And be sure to check out our CEO, Jonathan Gray, give a talk on CDAP: Application Development for Cassandra and Hadoop.