CDAP 3.3.0 is out — check out what’s new!

cdapio
cdapio
Published in
3 min readApr 23, 2019

January 21, 2016

Ali Anwar is a software engineer at Cask, where he is building on the Cask Data Application Platform (CDAP). Prior to Cask, Ali attained his undergraduate degree in Computer Science from the University of California, Berkeley.

I am very excited to announce the release of version 3.3.0 of the Cask Data Application Platform (CDAP). This release of CDAP includes new functionality and improvements to CDAP Metadata, Cask Hydrator, as well as improving the overall installation experience. It also adds support for CDH 5.5.

CDAP Metadata Improvements

CDAP allows annotating various CDAP components with richer Metadata. We introduced basic support for Metadata and lineage in previous releases. In the latest release we have added capabilities to automatically annotate CDAP components with system properties and tags. For instance, for all Datasets that are added, the type of the dataset and the schema fields are automatically added. We have also enhanced the searching of metadata and the combination of these two features makes it easy for users to look for entities in CDAP — for example: “Which datasets that are deployed in CDAP has SSN fields in schema” or “Which Datasets are accessible through MapReduce”.

What’s new with Cask Hydrator?

With this release, we have significantly enhanced Cask Hydrator to include several new features. ETL in Hydrator supports DAGs in pipeline, this allows users to have a non-linear execution of pipeline stages. For instance, users can write raw data into a destination, while simultaneously forking and doing additional processing to write to a different destination. The diagram below highlights one such example where all the raw tweets are stored in a table called tweets, while the ones with text containing English language is stored in a different table.

In addition to improved schema validation during pipeline deploy, this allows users to get an early feedback by performing correctness while publishing the pipeline. Also, there are a number of plugins that are added in this release — Python transform plugin, Hive source and sink plugins, Kafka producers to name a few. We have also added experimental support to run a pipeline in MapReduce or Spark — users can control the “engine” property which configuring the pipeline to choose between Spark and MapReduce. This feature is available only via CLI in 3.3.

Installation Improvements

CDAP 3.3.0 will also give users an improved installation experience by providing capabilities in CDAP Master service to check for prerequisites. For instance, during startup, CDAP checks file system permissions, availability of components such as YARN and HBase, resource availability; and if any of the pre-requisites fail, CDAP will fail to start up with the appropriate message.

In addition, we have also added a number of platform enhancements — dynamic dataset instantiation that allows users to load datasets dynamically at runtime, an intelligent way to consume multiple partitions of datasets by MapReduce programs, support for CDH 5.5 and more…

I encourage you to check out CDAP 3.3.0 and give it a go. We look forward to your feedback, suggestions and comments to help improve the platform constantly. Engage with the CDAP community and help us build CDAP!

--

--

cdapio
cdapio
Editor for

A 100% open source framework for building data analytics applications.