Announcing GA Release of CDAP 4.3 — Use Cases, Features and Capabilities
August 30, 2017
We would like to thank all our users and customers for the great conversations we have had around use cases, the challenges you face with operationalizing a data lake and/or building data analytics solutions, and your candid feedback on CDAP usability. These interactions are invaluable and we always love hearing from you. You have offered a lot of insights to our product team on how to make CDAP even better.
In this blog, we will describe the enhancements we made in the latest release of CDAP, after internalizing your feedback. We will also describe some new frameworks and tools we are offering for use with CDAP.
Here is a short list of about what is available in CDAP 4.3:
- Data Preparation
- User Defined Directives (UDD) for extending the capabilities of Data Prep
- Restricting Directive usage and ability to alias Directives — for your IT Administrators to control Directive access
- Data Pipelines
- Conditions in Pipelines for ETL / ELT Developers
- Triggered Pipelines for cross-team, cross-pipeline interconnectivity to build complex workflows
- Improved Pipeline Studio for ETL / ELT Developers
- Upgrading Pipelines for ETL / ELT Developers
- Custom Icons and Labels for Plugins for Pipeline Plugin Developers
- Pipeline Operational Insights for DevOps or Operations teams
- Governance & Security
- Apache Ranger Integration for IT Security Team
- Data Science
- PySpark & Spark Dataframe Support for Data Scientists
In addition to the new capabilities in CDAP 4.3, we are also offering additional capabilities built on top of CDAP:
- Microservice Framework for IoT, Realtime use cases and
- Distributed Rules Engine for Business Analyst
For those who would like to read further, we have compiled the details for each of the features and improvements that have made it into CDAP 4.3 GA below, as well as details on the new frameworks and tools. We would love to hear from you how the new CDAP 4.3 feature set is helping solve your problems. If you are new to CDAP and would like to take it for a spin, download it from here. You can also create an instance on the AWS or Azure Cloud Sandboxes for CDAP. If you run into an issue or have a question, you can chat with the CDAP Community on Slack or, if you prefer email, you can use the CDAP User Google Group.
Data Preparation
User Defined Directives for Data Preparation
… How can we extend the functionality of Data Preparation, we would like to add a directive that performs custom data transformation ….
User Defined Directives (UDD) allow you to extend the capabilities of the CDAP Data Preparation Tool. Users can use a combination of APIs to develop, deploy and manage UDDs within CDAP.
UDDs are built upon the CDAP Plugin Framework. The UDD framework includes a Java API for building new directives and a Testing Rig to test directives within your unit testing framework like JUnit. More information on how to build a UDD is available here.
Get started with building directives by cloning github example-directive, or watch a video tutorial.
Restricted Directives and Directive Aliasing
… We would like to restrict the directives are available for users as well we would like to rename directives to match organization jargons for data processing …
This new feature allows organizations to create a whitelist of directives as well as rename directives that are accessible to users within an organization.
The directive restricting list and renaming of directives are configured using REST APIs. Applying security policies on APIs would further restrict who can make changes to the list.
More information on how you can configure the restrictive list or rename directives is available here.
Data Pipelines
Conditional Branching in Workflows
…Often times we have cases where we would like to check for a condition based on previous stages on the workflow to decide what paths of the workflow should be triggered…
With Conditions in pipelines, users can now control the flow of an executing workflow based on a boolean expression. This feature allows users to conditionally execute parts of a pipeline or to perform an early termination of the pipeline.
Boolean expressions can use runtime arguments, counters from previous stages of pipelines, and global information like pipeline name, logical time offset, etc.
More information on how you can use this capability in your data pipeline is available here.
Triggered Workflows
… We are looking for ways to avoid creating large monolithic workflows by breaking them into small manageable and functionally separated workflows. Also, need to a way to define dependencies between the workflows …
Triggering workflows based on the execution status of other workflows provides additional control and flexibility to build more complex and manageable solutions using CDAP.
Available in the CDAP UI under Pipelines, this feature allows setting triggers to execute the pipeline based on the execution of other deployed pipelines. Setting dependencies between pipelines has been simplified through an easy-to-use UI, but can also be done using REST APIs.
In combination with Conditions, more complex conditions for triggering workflows can be configured.
Improved Pipeline Studio
… Building pipelines in CDAP are at times cumbersome, we really wish that mistakes could be forgiven with undo and redo capability, complex graphs are easy to trace, ….
We have observed multiple users of CDAP very closely and tried to understand the challenges they face while building pipelines in the studio. We then prioritized our finding and scheduled them across multiple releases. This release takes the usability of the CDAP Pipeline Studio to next level.
The CDAP Pipeline Studio has been re-designed with the goal to make most of the information easily available. Undo/Redo capabilities allow you to easily correct mistakes, and they give you some freedom to experiment. Edge selection and tracking edges in complex graphs have been simplified. Metrics of processing are now available on nodes, and interacting with them surfaces deeper graphs and provides statistics on processing.
Download latest CDAP from here to play with CDAP studio.
Pipeline Upgrade
… We have a number of pipelines and plugins that were built using an older version of CDAP; is there any way I can migrate them seamlessly to a newer version of plugins in CDAP? ….
Previously, importing pipelines built with an older version of CDAP required a laborious conversion process.
With this release, CDAP has improved the import flow that provides interactive conversions to a newer version of CDAP. The import will auto map plugins and for the plugins that cannot be found, it provides an option to install them from Cask Market if available.
To access this feature follow — CDAP > Pipeline > (+) > Create Pipeline > Import
Custom Icon and Label Support for Pipeline Plugins
… We have built a new plugin for a pipeline, and we would like to add a custom icon and have a human-readable label for the plugin ….
Presentation of plugins is critical for your users to visually locate plugins in the studio. Adding an icon that visually represents the plugin or plugin functionality is critical and a first step in the usability of the plugin.
In order to enhance the usability of plugins, CDAP has introduced the capability for plugin developers to add icons for the plugins developed. Plugin developers can include cdap-maven-plugin to simplify the plugin build process and also include icons for the plugin they develop.
More information on how to include cdap-maven-plugin in a plugin project is available here.
Pipeline Operational Insights
… Our operations team would like to have insights into historical runs of the pipelines, and into different runs of a pipeline, and be able to view operational statistics of runs…
Each run of a CDAP Pipeline aggregates a large number of operational metrics that can be used for understanding the performance of a pipeline. The metrics can be used to diagnose performance and data issues.
This release introduces visualization of the performance metrics for the pipelines. You can investigate metrics by looking at trends across previous runs of the pipeline, you can inspect each individual node to understand the processing times, records processed, etc.
This feature is accessible from the deployed pipeline view in the UI, under “Summary”. The statistics for each node pipeline are available by a click on each node in the pipeline to understand its performance.
Governance and Security
Apache Ranger Integration — GA
… We use Hortonworks Data Platform (HDP) and would like to use Apache Ranger integration as our centralized security framework …
CDAP is dedicated to providing enterprise-grade security to its users. We believe integration with Apache Ranger will allow customers to secure data and applications using granular privileges to meet security compliance.
Apache Ranger integration includes the ability to provide granular privileges through the administrative panel, allowing administrators to manage all privileges to CDAP entities in one place, together with other services such as HDFS, Hive, etc.
Apache Ranger integration is provided as a CDAP Security Extension. Learn more about how to integrate CDAP with Apache Ranger here.
Data Science
PySpark & Spark Dataframe Support
… We have data scientists who would like to use PySpark and Spark Dataframe in pipelines for building models or transforming data using SQL …
The previous release of CDAP already included some improvements for Spark integration (read more about it here), but this release adds more enhancements by providing integration with PySpark and Spark Dataframe.
The CDAP plugin integration with PySpark allows developers and data scientists to use Python to perform data analytics on the Spark framework.In addition, it supports Spark SQL in Spark Compute plugins with the CDAP production-ready, enterprise-grade integration.
The PySpark plugin for CDAP is available in Cask Market.
New Frameworks and Tools
Microservices Framework
… We would like to create a loosely connected graph of processing (instead of a monolithic pipeline), where each node in the graph has a lifecycle that is independent of each other, and where each node is executing a specific function on events received by that module…
Whether you want to consume metadata events from CDAP to publish metadata to Apache Atlas or Cloudera Navigator or read device telemetry data from Amazon SQS, Apache Kafka, MapR Streams or WebSockets, you now can now deploy and configure Microservices to process them.
Microservices is a new framework introduced to be used with CDAP to build loosely connected graphs for data processing. The framework includes Java based APIs for building new Microservices combined with CDAP capabilities for providing an operational, managed and secured environment. The Microservice framework achieves communication isolation through a Channel Framework that allows Microservices to bind to different channel types, such as Amazon SQS, WebSocket, MapR Streams, MQTT, Kafka and Transaction Messaging Service (TMS, internal to CDAP).
Microservices for CDAP provide the following capabilities: process events asynchronously, process in real-time, high availability, autonomous and loose coupling, independent scaling, truly message-driven and low-latency, high throughput.
The Microservices framework is not open source and requires a separate license from Cask. If you would like to learn more or see a customized demo, please contact Cask.
Distributed Rules Engine (DRE)
… We have business analysts that would like to specify rules for data transformations or enforce policies during ingestion of vendor data. These analysts are not code savvy and would like to have an easier way to specify conditional actions without having to write code ….
Distributed Rules Engine is a sophisticated if-then-else statement interpreter that runs natively on big data systems like Spark and Hadoop. It provides an alternative computational model for transforming your data while empowering the business users to specify and manage the transformations and policy enforcements.
With DRE, business users can write, manage, deploy, execute and monitor business data transformations and policy enforcements. DRE provides a Business Rule Repository, Business Rule Editor and Rule Execution Core for executing the BR
Distributed Rules Engine is available under a separate license from Cask. If you would like to see a customized demo, please contact Cask.
You can try out this latest version of CDAP by downloading the CDAP Local Sandbox or spinning up an instance of the CDAP Cloud Sandbox on AWS or Azure. CDAP is also available to install in distributed environments. Reach out to us via chat or email should you have any questions or issues, or just want to give us your valuable feedback!