All you need to know about Apache Kudu

Published in

Analytics Vidhya

5 min readMar 2, 2020

In this article I will describe and simplify everything you should know about Apache Kudu since you already know the basic terminology and that you have worked with distributed systems. In this way I will simplify what I have learned from the experience of working with kudu, reading and researching in different sources.

Why did Cloudera create Apache Kudu?

A Kudu cluster stores tables that look like the tables you are used to from relational databases (SQL).

Reasons why I consider that Kudu was created:

1. In order to fill the gaps of HDFS and HBASE in streaming processing, more and more companies need to control streaming processes (I consider batch processes obsolete). Unlike HBASE, Kudu allows scanning large volumes of data in real time more efficiently

2. Compared to HBASE, Kudu allows operations on a specific record and the scanning functions are faster

3. Large volumes of data remain difficult to control and volumetry continues to grow. Hadoop and other platforms can be complex for some equipment.

4. Development teams are changing, and many use cases require a focus on real-time generated data for real-time analysis.

Kudu architecture essentials

Apache Kudu is growing, and I think it has not yet reached a stage of maturity to exploit the potential it has
Kudu is integrated with Impala and Spark which is fantastic.
The creation of tables in Kudu is very similar to the tables we create in relational databases, that is, you must define one or several primary key (PK) in the table and the (PK) cannot be changed once created.
Unlike other databases, Apache Kudu has its own file system where it stores the data. That is to say, the information of the table will not be able to be consulted in HDFS since Kudu stores it in its own file system. The information can only be accessed by SQL queries. On the other hand, if the table structure is stored in an HDFS path, this mainly happens when we create external tables, in case of deleting the table completely we can go to the HDFS path defined in the ‘create table’ and retrieve the creation structure

Partitioning

As for partitioning, Kudu is a bit complex at this point and can become a real headache. Choosing the type of partitioning will always depend on the exploitation needs of our board. Defining the right partitioning will save us many problems in the future mainly when the data volume grows and the latency in the read and write queries are affected.

So how do I choose partitioning?

Kudu has two types of partitioning; these are range partitioning and hash partitioning.

Partitioning by range:

It will allow for good access to time-based queries that you specify in the range of the table. To partition by range, we must define in our table the time period or intervals that will contain in the range. What happens if we insert records in a range that does not exist in our table? These records will not be inserted.

It is important to note that partitioning by range provides good distribution of the scripts. For a given range all the inserted data will reach all the servers, each one containing a subset of sensors that return the same amount of metrics. If there are more sensors within a range than there are in another range, this will cause an imbalance in the partitions and affect performance. In case of hash partitioning this problem would not occur

Partition by hash:

The hash partition is done on the PK type fields that have been defined in the table. Kudu divides each record into a fixed number of containers by applying the hash function to the values of the specified columns, therefore there will be no imbalance in the partitions.

I recommend as a good practice in tables that require it to maintain both types of partitioning (hash and range). The combination of the two allows for good write-in interpretations (because hash scripts extend across multiple

partitions) and good read interpretations (large scans can be executed in parallel across multiple servers)

The performance of a table can be good with a partition level of 8 containers, but with 15 partitions the same table can have a worse performance and in turn with 24 partitions the same table can have better write and read times. It is confusing to perform a partition tuning and as I said before it will always depend on each case.

Partitioning by hash. Source: https://kudu.apache.org/docs/kudu_impala_integration.html

Partitioning byrange and hash. Source: https://kudu.apache.org/docs/kudu_impala_integration.html

Can we partition dynamically?

Dynamic partitioning does not exist in Kudu, which seems to me to be a major limitation since in a productive environment the volumetry will always increase, the tables are limited to 10GB of data storage. Therefore, in a data growth scenario, data from these tables will need to be migrated which may require time (there are different migration alternatives).

Finally, I leave you with a series of considerations that we must have when working with Kudu:

Char, Varchar, Date and Array types are not allowed in Kudu.
A Kudu table cannot have more than 300 columns.
Column names must not exceed 256 characters and must be valid UTF-8 strings.
A column can only store up to 54Kb
In case of adding a new node to the cluster, Kudu will not redistribute the tablet servers to ensure a balance, in this case, you must manually move the tablet servers already created or you must delete and re-create the tables
No support for Sqoop
Kudu administrators can enable HIVE support for data migration
Queries must be made through the Impala engine, therefore, there is absolute dependence

I hope you liked the article and that it was helpful.

Please leave me your opinion