Expanding CDAP Horizons

Terence Yim
Jun 17 · 4 min read

Since the inception of CDAP, the goal has consistently been about being an open platform for building and operating data applications. As an open platform, it has been our mission to support multiple environments, be it the laptop, on-premises data center, or the cloud. We strongly believe that by providing consistent user experience and APIs across all different environments help our users innovate faster.

CDAP started with a sandbox SDK that runs on a laptop, and a distributed CDAP that runs on Apache Hadoop, with the guarantee that they will provide the consistent behavior and user experience for all CDAP applications across varied environments. Support for compute profiles and cloud runtime have enabled users to select the environment to run their applications in, be it an on-premises Hadoop or in the cloud. With this capability, user workloads are no longer restricted to be running in the same Hadoop cluster as the CDAP itself. This greatly increases the flexibility, stability, and isolation of the runtime environment for all applications.

CDAP in Kubernetes

Starting CDAP 6.0, we added the ability to run CDAP in Kubernetes. This capability allows CDAP to execute entirely inside a Kubernetes cluster. We chose Kubernetes because it is the most widely used resource orchestration system in the cloud. With CDAP supporting both Hadoop YARN and Kubernetes, we can provide the same set of features and experience to our users across a broad set of environments.

Figure 1. CDAP services in Kubernetes

To support CDAP in Kubernetes, we introduced a new open source project, the CDAP Operator, to provide an easy deployment solution. By using the CDAP Operator, new instances of CDAP can be easily deployed and managed through custom resources defined by the Custom Resource Definition in Kubernetes. The CDAP controller is constantly observing for changes in the CDAP resources and reacting to it. When there is a new CDAP resource deployed to the Kubernetes cluster, the CDAP controller will deploy a combination of Deployments, StatefulSets, and Services to the cluster to start all the necessary CDAP services. In addition to the CDAP operator, CDAP itself is also enhanced to support running system applications inside Kubernetes cluster to provide more interactive experiences for the user interface.

Figure 2. CDAP Operator Logical Flow

By integrating with the Kubernetes API, we also removed the hard dependency on Apache ZooKeeper, which was used for service discovery and task coordination for CDAP on Hadoop. To further simplify the deployment stack, we also enhanced the log service so that it can collect logs without a hard dependency on Kafka.

Storage Engine for CDAP

The capability to operate in Kubernetes allows CDAP to operate in non-Hadoop environments. However, before CDAP 6.0, it only supported Apache HBase as the storage engine for production use, which may not be easily available in all environments. We realized that we need to broaden the list of supported storage engines to make CDAP accessible to different ecosystems. In CDAP 6.0, we decoupled CDAP storage from Apache HBase, and introduced a new Storage SPI together with implementations for LevelDB, PostgreSQL, and Apache HBase. We also added a new Metadata SPI with two implementations, one for Elasticsearch, and one for Apache Hbase. There will be more storage engine implementations available in the future, based on the needs of user community.

Hybrid Cloud

The addition of Kubernetes support together with flexible storage options enable CDAP to provide the same user interface, and a portable set of artifacts, plugins, and programmatic APIs for our users to use, wherever they operate CDAP. This helps our users to build hybrid infrastructures that could span across on-premises clusters and different cloud providers with a unified experience provided by CDAP. We believe this will enable a whole new set of possibilities for data infrastructure and data applications. Our users will no longer be limited to operate in a single environment, but instead, they can choose the environment that fits best, based on concerns such as security, cost, and ease of operation.

Figure 3. Conceptual Mockup for Hybrid Data Pipeline

In the future, CDAP may allow our users to split a data pipeline such that part of it will be executing in an on-premises cluster, while the other part executes in the cloud. For example, there can be a lightweight on-premises stage to remove sensitive information before sending the data to the cloud for heavy lifting data processing.

Conclusion

We believe that making CDAP operable in diverse environments will help our users to build modern data infrastructures and applications. Our users can expect a consistent set of features and user experience from CDAP, regardless of where it runs. Applications that they build are inter-operable between different environments and can cooperate between them.

Last but not the least, being an open source software, it is always important to keep growing our community. With the inclusion of Kubernetes support, it presents a huge opportunity for us to further expand developer and user base.

cdapio

CDAP is a 100% open-source framework for build data analytics applications

Terence Yim

Written by

Software Engineer. Passionate about distributed system, big data, and open source software.

cdapio

cdapio

CDAP is a 100% open-source framework for build data analytics applications

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade