CDAP 3.5 — Enterprise Security, Drag-and-Drop Spark Streaming, and much more!

cdapio
cdapio
Published in
5 min readApr 23, 2019

August 25, 2016

Sagar Kapare is a Software Engineer at Cask where he is building software to simplify data application development. Prior to Cask, he worked on high performance digital messaging platform at StrongView Systems.

I am very excited to announce the release of Cask Data Application Platform (CDAP) version 3.5. The focus for CDAP 3.5 is security, with a number of significant new capabilities added to the platform, in addition to major improvements to the Extensions, Cask Hydrator and Cask Tracker.

CDAP 3.5 introduces authorization to the platform with support for fine-grained role based access control of CDAP entities and integration with Apache Sentry. In addition, CDAP and CDAP Apps can now be run as a specific user, and sensitive configurations like login credentials can be stored encrypted. Cask Hydrator adds support for Spark Streaming pipelines and to create pipelines with multiple inputs and extensive join capabilities. Cask Tracker introduces data usage analytics and improved metadata management capabilities. Read on to learn more details about these and other new features added in CDAP 3.5.

Enterprise-Grade Security

Fine-grained Authorization

Working with customers, Cask has learned the importance of security and isolation of data in the enterprise. Among the most requested features by CDAP users in large enterprises has been the ability to perform granular access control within the platform. CDAP 3.5 introduces support for fine-grained, role-based authorization on CDAP entities such as Namespaces, Applications, and Datasets. Roles and permissions can be managed using CDAP REST APIs, CDAP CLI, or Cloudera Manager Hue (part of CDH). With this feature, users can provide specific levels of access to certain users for various actions within CDAP, for instance — creating a namespace, deploying applications, starting and stopping programs.

Secure Store for Configuration

CDAP 3.5 also introduces support for the safe storage and usage of sensitive data like passwords and access keys using the CDAP Secure Store. The store can be accessed through CDAP Programs and Hydrator Pipelines. Only authorized users can get access to the Secure Store. In-Memory and Standalone modes use the file-based JCEKS storage provider while distributed mode uses Hadoop KMS as a storage provider.

Secured Impersonation

Secured impersonation in CDAP 3.5 allows a superuser to perform operations on behalf of another user. For example, Programs can be executed on the cluster as any configured user and Dataset operations are performed as the user who submitted the Program. Administrators can configure a specific user at a Namespace level to run programs and perform dataset operations.

Cask Hydrator

Drag-and-Drop Spark Streaming

Cask Hydrator now has support for real-time pipelines that run using Spark Streaming. Users can leverage the powerful capabilities of Spark Streaming such as windowing and computing aggregates while also being able to leverage existing transformations and sinks.

Joins & Actions

Data is usually normalized across multiple sources in order to minimize data redundancy. Join plugins added to Cask Hydrator allow users to join data from multiple datasets and support a rich set of join semantics including inner and outer joins.

The new Action plugin type for Cask Hydrator introduced in this release allows users to combine data and control flow into the same pipeline. Out of the box, user will be able to use SSH, Database, and HDFS Actions. The SSH Action allows users to execute any arbitrary code on any desired machine as a part of the pipeline. Database Action allows users to execute a database command as part of the pipeline., HDFS Action allows file operations such as mv and delete to be performed on an HDFS cluster.

Runtime Arguments and Macros

Hydrator pipelines now support runtime arguments. This allows the user to reuse the same pipeline to read from different sources, write to different sinks etc. Macro substitution also allows user to change the behavior of the pipeline on per-run basis. Unlike runtime arguments, however, macros provide an extra level of flexibility. At runtime, all plugin fields configured with macro syntax will be parsed and substituted. This syntax can be interspersed with other text. Therefore, pipeline operators can form complex combinations of macros and text such as configuring a hostname as: “${address}:${port}/${path}”. Additionally, “macro functions” can be used to perform extra logic before a substitution such as computing the logical start time of a pipeline or looking up information from secure storage. For example, the image below shows some sample configuration for the Database plugin. Macro enabled fields are marked in the UI. The username field uses secure store to get the name of the user, while importQuery configuration uses the logicalStartTime as a part of the where clause.

Hydrator Plugins

Lots of new plugins are also available in CDAP 3.5. We are introducing new sources such as COBOL copybook, XML, Excel, and FTP to allow users to load data to Cask Hydrator. Plugins such as XML Parser, Normalizer, and Denormalizer provide additional functionality to transform data, and a new Solr sink was added by popular demand.

Cask Tracker

Data Usage Analytics

Cask Tracker now provides richer data insights, answers important questions about your data such as what data is being accessed, what programs are using the data, what is popular on your cluster, etc. We are also introducing the beta version of the Tracker Meter which is used to measure the overall quality and dependability of the data based on profiling and social metrics.

Metadata Taxonomy

On top of the existing tagging and metadata features of the platform, the Tracker UI now supports the direct adding and management of Tags and Properties for Datasets. In addition, the notion of Preferred Tags has been introduced, allowing users to upload and manage a set of company-specific tags for data, and highlighting these tags throughout the interface.

In addition to all the awesome new features above, we have also added a number of other enhancements to the CDAP platform — datasets and logging performance improvements, support for CDH 5.8, ability to run long running custom actions as part of a Workflow. All the details of everything that is part of CDAP 3.5 has been documented in the release notes.

I encourage you to download CDAP 3.5.0 and give it a try. We look forward to your feedback, suggestions, and comments to help improve the platform constantly. And please don’t hesitate to engage with the CDAP community and help us build CDAP!

--

--

cdapio
cdapio
Editor for

A 100% open source framework for building data analytics applications.