Multitenancy for Hadoop: Namespaces — Part II

Published in

cdapio

6 min readApr 19, 2019

June 18, 2015

Bhooshan Mogal is a Software Engineer at Cask, where he is working on making data application development fun and simple. Before Cask, he worked on a unified storage abstraction for Hadoop at Pivotal and personalization systems at Yahoo.

We introduced the concept of namespaces and how it helps to bring multitenancy to Apache Hadoop in a previous blog. We also briefly introduced the use of namespaces in CDAP, leaving out the implementation details. In this blog we’ll discuss some of the requirements that influenced the design of namespaces in CDAP, as well as cover selected implementation details.

The main aim of namespaces in CDAP is to provide application and data isolation. The same application or dataset should be able to exist independently in multiple different namespaces. There are multiple advantages to this, such as providing quality of service and security guarantees for different workloads, users, or teams. Some of these are common use-cases while building solutions using the Hadoop ecosystem; for example, isolation of workloads of multiple customers in their data pipelines, and isolation of workloads of different units in organizations.

Some CDAP-specific requirements for namespace functionality include:

A namespace must have an identifying name, a description, as well as a set of properties. These properties will be accessible for a deployed application. They will also be used in the future to define namespace characteristics such as quotas.
Namespaces should not incur overhead on users that do not need multitenancy. For example, users could have a single-tenant environment that does not require isolation. Or, you could be testing a CDAP application in a CI environment, where you would be more concerned about verifying the correctness of an application as opposed to its behavior in isolation. For such cases, CDAP should provide a default namespace.
Namespaces should propagate to storage providers, whereby data in a namespace would be isolated from other namespaces.
Existing CDAP installations should be able to upgrade to a newer version containing namespaces in an automated manner, with reasonable downtime.

Key Design Decisions

First, namespaces must live within a single CDAP instance, as opposed to multiple CDAP instances, one instance per namespace. This can be achieved by partitioning a single CDAP instance’s metadata. In this way, namespaces achieve logical (and not physical) isolation.

Next, a namespace becomes a property of an application at deploy and run time. In other words, application development should be namespace-agnostic. While developing the application, users should not have to be aware of the namespace that the app would be deployed in. Users should only be aware of the namespace during application deployment and later. As a result, users should be able to deploy the exact same application to different namespaces and have it exist independently.

In the first iteration, CDAP system services are not namespaced. The same set of system services such as the transaction service, dataset service, Kafka, etc., serve all namespaces. For metrics and logging, we use the same Kafka topics for all namespaces. In the future, however, there may be a requirement to isolate some of these services.

Namespaces are not hierarchical. A namespace cannot ‘contain’ another namespace. Another key design decision was to reuse namespace implementations in storage providers wherever they were available.

Implementation Details

To achieve logical isolation, the metadata of all entities in CDAP contains an extra property: a namespace. When storing metadata in HBase, this meant prefixing the row key of each record with a namespace. This prefix can serve as a primary index for fast lookups using prefix scans.

To achieve data isolation, a CDAP namespace is propagated down to storage providers. On HDFS, each namespace has its own directory at the root level. This directory is set up when the namespace is created. Application as well as dataset jars are deployed in subdirectories of their namespace. This gives users the freedom to deploy different versions of the same application or dataset in different namespaces. As far as data is concerned, FileSet datasets are stored in subdirectories of their namespaces. Application logs are also collected in subdirectories of their namespaces.

Apache HBase added support for namespaces in v0.96. However, in its first release with namespaces (v2.8), CDAP supported HBase 0.94 as well. As a result, two different namespacing techniques are used:one which used table name prefixes (for HBase 0.94) and another which delegated to HBase namespaces (for HBase 0.96 and above). As CDAP already contained compatibility modules (see hbase-compat-* here) for different HBase versions, we simply added a version-specific implementation of namespaces in each compatibility module.

Apache Hive has a concept of databases which can be viewed as analogous to namespaces; hence, CDAP namespaces translate one-to-one to databases in Hive.

For all CDAP entities that are namespaced, the REST APIs have a namespace ID component, almost always as the first component in the API path (metrics APIs have namespace as a query parameter). This fits in nicely with the RESTful hierarchical paradigm, since namespaces can be viewed to be containing the apps and data in them.

The integration with the CDAP CLI was designed so that in the CLI, users can select a namespace and, from then on, all operations execute in the selected namespace. This was done so that users do not have to remember to use a namespace with every command in the CLI. This is very useful in use-cases that are satisfied within a single namespace. The CDAP CLI exposes a ‘use namespace <namespace-id>’ command for specifying a namespace.

The interaction with the CDAP UI is similar to the CDAP CLI. Users select a namespace using a namespace drop-down menu, and all operations occur within the selected namespace.

Namespaces changed the way that both data and metadata are stored in CDAP. It presented a major challenge when upgrading an older version of CDAP to a newer version with namespaces. Rewriting existing data is an expensive process that can cause delays in the upgrade/migration process amounting to fairly significant downtime. Also, storage providers such as HBase currently do not support an easy way to rename tables.

To ease the upgrade process, we made certain changes in a backward-compatible manner: for example, we chose to store and read from the HBase default namespace in the exact same manner that we did in earlier versions of CDAP that did not have namespaces. Hence, the newer CDAP version could read older data as-is, without requiring rewriting/copying. Of course, there were certain exceptions to this rule, for which we provided migration tools. So, in most cases, only existing metadata had to be rewritten into a new format to work with namespaces, which reduced upgrade time significantly. The CDAP upgrade tool was enhanced to make this possible. At the crux, it reads existing CDAP metadata and stores it in a newer format, as part of the default namespace.

Upgrading an existing system to one with namespaces later taught us that you should account for such features with a wide impact early in the architecture. Fortunately, as multitenancy was considered in CDAP from an early stage, there were some provisions for it in the design, making the eventual implementation somewhat easier.

We hope you enjoyed this two-part blog on namespaces. Stay tuned to blog.cask.co for more interesting blogs covering a wide range of topics in distributed data processing systems. Feel free to reach out to us for any questions and consider helping us to develop the platform.

Multitenancy for Hadoop: Namespaces — Part II

Key Design Decisions

Implementation Details

Written by cdapio