CDAP 4.2: Interactive Spark, enhanced self-service Data Preparation and more

Published in

cdapio

4 min readApr 24, 2019

June 7, 2017

Sreevatsan Raman is the Head of Engineering at Cask where he is driving the company’s engineering initiatives. Prior to Cask, Sree designed and implemented big data infrastructure at Klout and Yahoo!

I am very happy to announce the general availability of Cask Data Application Platform 4.2. Over the last few months, we have been focussing on enhancing the user experience and usability of the product — CDAP 4.2 comes with several features that offer a great first five minute user experience of new users of CDAP via enhanced self service Data Preparation capabilities, improved interactive Apache Spark experience in CDAP Data pipelines, and Change Data Capture (CDC) from SQL Server and Oracle. In addition, we have several new platform enhancements that enable event driven data schedules, broaden distro support for EMR versions 5.x and much more. In this post, we will take a look at the cross section of all the latest and greatest with CDAP 4.2.

Interactive Spark

We have significantly enhanced the Spark experience with CDAP in this release. Starting 4.2, users of CDAP will have an option of using Spark 1.x or Spark 2.x on their cluster which allows a broader choice in usage of Spark versions. Administrators can pin down a default version of Spark to be used on the cluster from any of the supported Spark versions.

Also with the current release, users can use Spark interactively with CDAP. We have provided new plugins in CDAP that allow users to port existing legacy Spark applications and run them as a part of data pipelines simply by copying and pasting the code. Furthermore, users can add any custom code as a part of Spark compute plugin which allows them to do dynamic, on-the-fly transformations on any data that is being read by the data pipeline in batch or in real-time. This feature allows users to have greater flexibility and faster turn-around time to port or write complex code in Spark easily, while at the same time enjoy the richer platform capabilities of CDAP including metadata, audit, lineage, and security.

Enhanced Self Service Data Prep

In CDAP 4.2, we have optimized for providing our users with a great first five minute user experience with CDAP by enabling a number of enhancements to Data Preparation. Users will be able to browse their HDFS or Local file system, pick up the data from the files and perform numerous directives in the UI. They will also have the ability to browse databases and bring in data for interactive transformation with a few clicks of a button.

Event Driven Schedules

Previous releases of CDAP added time based scheduling for CDAP programs. Users can now schedule to start CDAP programs based on data availability, specifically partitions of incoming data in HDFS. Users can specify constraints such as starting a workflow when there are at least five partitions, or restricting the number of concurrently executing workflows to five. Users can also limit the time window in which a schedule can execute, so as to avoid peak usage hours. For a complete set of advanced scheduling features, refer to the scheduler documentation.

Change Data Capture for EDW Offload

The last release of CDAP provided out-of-the box capabilities for EDW offloading, where users can move the data from Enterprise Data Warehouses (EDW) into Apache Hadoop using batch data pipelines. In the new 4.2 release, we have enhanced our EDW offloading solution by offering real-time change data capture from SQL Server and Oracle, offering a low latency data path to access data in Hadoop for analysis. The CDC solution is available from Cask Market, which means that users can now very easily create Spark streaming pipelines to bring data into Hadoop with a few clicks of buttons.

Like to see CDC in action? Join Sagar Kapare on a live webinar on 6/28, covering Realtime Change Data Capture with Spark Streaming.

Data Pipeline Enhancements

This new release has a number of new plugins to process data from Azure Event Hub, Azure Blob Store and Azure Data Lake Store.

Distro Support

CDAP 4.2 users now have a greater choice of distributions on which they install CDAP. We have added support for EMR 5.x (5.0 to 5.3). In addition, we have added support for Hive 2.x where users of CDAP datasets can leverage LLAP for faster SQL access to CDAP datasets.

Do give CDAP 4.2 a spin by downloading it from here and do reach out to our user group in case you have any questions at CDAP User Google Group.