Published in


Validation Framework for CDAP Plugins

CDAP provides an interactive UI to build data pipelines to apply code-free transformations on data. CDAP data pipeline is an acyclic graph composed of multiple plugins as its nodes and connections between them representing data flow. Each plugin in the pipeline can be configured by providing configuration properties, input and output schema for the plugin.

A CDAP data pipeline solving a real world use case can contain ~10 or more nodes in the graph. While building such CDAP pipelines, pipeline developers can provide invalid plugin configurations or schema. For example, the BigQuery sink plugin can have output schema which does not match with underlying BigQuery table or GCS source has invalid bucket name.

When a deployed pipeline fails due to invalid configurations, a common pipeline development flow will be to check the logs to identify the invalid configuration, clone the data pipeline and run it. This iterative process increases the data pipeline development time. Imagine how much time it will take to build a data pipeline with ~10–20 nodes. Also, during this process, Pipeline developers have to go through technical logs, stacktraces or even code base in order to identify problems such as NullPointerExceptions.

In the upcoming CDAP release 6.1.1, a new validation framework has been introduced to fail fast and collect all the validation failures. This framework also exposes a Failure Collector API to surface contextual error messages on CDAP UI.

Lets see what this framework provides and how it can be used in plugins to surface contextual error messages:

The validation framework should be able to collect multiple error messages in order to provide better user experience while building the pipeline. To do that, below FailureCollector APIs are exposed from the framework:

Error collection using FailureCollector Api

CDAP plugins override method configurePipeline() which is used to configure the stage at deploy time. The same method is called through validation endpoint as well. In order to collect multiple validation failures, FailureCollector API is exposed through stage configurer to validate the configurations in this method. The sample usage of FailureCollector API looks as below:

Adding ValidationFailures to FailureCollector

A validation failure is made up of 3 components:

  • Message — Represents a validation error message
  • Corrective action — An optional corrective action that represents an action to be taken by the user to correct the error situation
  • Causes — Represents one or more causes for the validation failure. Each cause can have more than one attribute. These attributes are used to highlight different sections of the plugins on UI.


In bigquery source if the bucket config contains invalid characters, a new validation failure will be added to the collector with a `stageConfig` cause attribute as below:

Pattern p = Pattern.compile("[a-z0–9._-]+");
if (!p.matcher(bucket).matches()) {
"Allowed characters are lowercase characters, numbers,'.', '_', and '-'",
"Bucket name should only contain allowed characters.'")

While a ValidationFailure allows plugins to add a cause with any arbitrary attributes, ValidationFailure API provides various util methods to create validation failures with common causes that can be used to highlight appropriate UI sections. Below is the list of common causes and associated plugin usage:

1. Stage config cause

Purpose: Indicates an error in the stage property

Scenario: User has provided invalid bucket name for BigQuery source plugin


collector.addFailure(“Allowed characters are lowercase characters, numbers,’.’, ‘_’, and ‘-’”, “Bucket name should only contain allowed characters.’”).withConfigProperty(“bucket”);

2. Plugin not found cause

Purpose: Indicates a plugin not found error

Scenario: User is trying to use a plugin/JDBC driver that has not been deployed


collector.addFailure(“Unable to load JDBC driver class ‘com.mysql.jdbc.Driver’.”, “Jar with JDBC driver class ‘com.mysql.jdbc.Driver’ must be deployed”).withPluginNotFound(“driver”, “mysql”, “jdbc”);

3. Config element cause

Purpose: Indicates a single element in the list of values for a given config property

Scenario: User has provided a field to keep in the project transform that does not exist in input schema


collector.addFailure(“Field to keep ‘non_existing_field’ does not exist in the input schema”, “Field to keep must be present in the input schema”)
.withConfigElement(“keep”, “non_existing_field”);

4. Input schema field cause

Purpose: Indicates an error in input schema field

Scenario: User is using BigQuerysink plugin that does not record fields


collector.addFailure(“Input field ‘record_field’ is of unsupported type.”,
“Field ‘record_field’ must be of primitive type.”)
.withInputSchemaField(“record_field”, null);

5. Output schema field cause

Purpose: Indicates an error in output schema field

Scenario: User has provided output schema field that does not exist in BigQuerysource table


“Output field ‘non_existing’ does not exist in table ‘xyz’.”,
”Field ‘non_existing’ must be present in table ‘xyz’.”)
.withOutputSchemaField(“non_existing”, null);

Cause Associations

While validating the plugin configurations, the validation failure can be caused by multiple causes. Below are a few examples of associated causes:

Example 1:

Database source has username and password as co-dependent properties. If username is not provided but password is provided, the plugin can just add a new validation failure with 2 causes as below:

collector.addFailure(“Missing username”,
“Username and password must be provided’”)

Example 2:

Projection Transform received incompatible input schema and output schema for a field such that input field can not be converted to output field. In that case a new validation failure can be created with 2 different causes as below:

collector.addFailure(“Input field ‘record_type’ can not be converted to string”,”Field ‘record_type’ must be of primitive type’”) .withConfigProperty(“convert”).withInputSchemaField(“record_type”);


Validation framework reduces pipeline development time by failing fast and providing contextual error messages. This framework is available in upcoming CDAP release 6.1.1 to provide better user experience. Try out CDAP today and if you would like to explore such exciting challenges, consider contributing to CDAP.




CDAP is a 100% open-source framework for build data analytics applications

Recommended from Medium

How to Build HTML Inputs to Submit an Array of Hashes with a Rails Form Tag and Strong Parameters

Create and E-Mail Bulk Certificates with Python. For Free.

Blocktix Dev Update #9


Azure Kubernetes Service and Istio Service-Mesh integration

Connecting to Google Cloud SQL

Copy Files to/from Pods in Java using Fabric8 Kubernetes Client

Multi-Tenancy Support With Spring Boot, Liquibase, and PostgreSQL

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Vinisha Shah

Vinisha Shah

Software Engineer, Google

More from Medium

Using Apache Nifi for Enterprise Workflow Automation

SeaTunnel Application and Refactoring at Kidswant

Kafka Streams, Change Data Capture, GoldenGate For Big Data

Designing nRT Read and write optimized CDC Framework for Hadoop Data Lake — Part 2