Validation Framework for CDAP Plugins

Vinisha Shah
Jan 6, 2020 · 4 min read

CDAP provides an interactive UI to build data pipelines to apply code-free transformations on data. CDAP data pipeline is an acyclic graph composed of multiple plugins as its nodes and connections between them representing data flow. Each plugin in the pipeline can be configured by providing configuration properties, input and output schema for the plugin.

A CDAP data pipeline solving a real world use case can contain ~10 or more nodes in the graph. While building such CDAP pipelines, pipeline developers can provide invalid plugin configurations or schema. For example, the BigQuery sink plugin can have output schema which does not match with underlying BigQuery table or GCS source has invalid bucket name.

When a deployed pipeline fails due to invalid configurations, a common pipeline development flow will be to check the logs to identify the invalid configuration, clone the data pipeline and run it. This iterative process increases the data pipeline development time. Imagine how much time it will take to build a data pipeline with ~10–20 nodes. Also, during this process, Pipeline developers have to go through technical logs, stacktraces or even code base in order to identify problems such as NullPointerExceptions.

In the upcoming CDAP release 6.1.1, a new validation framework has been introduced to fail fast and collect all the validation failures. This framework also exposes a Failure Collector API to surface contextual error messages on CDAP UI.

Lets see what this framework provides and how it can be used in plugins to surface contextual error messages:

The validation framework should be able to collect multiple error messages in order to provide better user experience while building the pipeline. To do that, below FailureCollector APIs are exposed from the framework:

FailureCollector.java

Error collection using FailureCollector Api

CDAP plugins override method configurePipeline() which is used to configure the stage at deploy time. The same method is called through validation endpoint as well. In order to collect multiple validation failures, FailureCollector API is exposed through stage configurer to validate the configurations in this method. The sample usage of FailureCollector API looks as below:

Adding ValidationFailures to FailureCollector

A validation failure is made up of 3 components:

Example:

In bigquery source if the bucket config contains invalid characters, a new validation failure will be added to the collector with a `stageConfig` cause attribute as below:

Pattern p = Pattern.compile("[a-z0–9._-]+");
if (!p.matcher(bucket).matches()) {
collector.addFailure(
"Allowed characters are lowercase characters, numbers,'.', '_', and '-'",
"Bucket name should only contain allowed characters.'")
.withConfigProperty("bucket");
}

While a ValidationFailure allows plugins to add a cause with any arbitrary attributes, ValidationFailure API provides various util methods to create validation failures with common causes that can be used to highlight appropriate UI sections. Below is the list of common causes and associated plugin usage:

1. Stage config cause

Purpose: Indicates an error in the stage property

Scenario: User has provided invalid bucket name for BigQuery source plugin

Example:

collector.addFailure(“Allowed characters are lowercase characters, numbers,’.’, ‘_’, and ‘-’”, “Bucket name should only contain allowed characters.’”).withConfigProperty(“bucket”);

2. Plugin not found cause

Purpose: Indicates a plugin not found error

Scenario: User is trying to use a plugin/JDBC driver that has not been deployed

Example:

collector.addFailure(“Unable to load JDBC driver class ‘com.mysql.jdbc.Driver’.”, “Jar with JDBC driver class ‘com.mysql.jdbc.Driver’ must be deployed”).withPluginNotFound(“driver”, “mysql”, “jdbc”);

3. Config element cause

Purpose: Indicates a single element in the list of values for a given config property

Scenario: User has provided a field to keep in the project transform that does not exist in input schema

Example:

collector.addFailure(“Field to keep ‘non_existing_field’ does not exist in the input schema”, “Field to keep must be present in the input schema”)
.withConfigElement(“keep”, “non_existing_field”);

4. Input schema field cause

Purpose: Indicates an error in input schema field

Scenario: User is using BigQuerysink plugin that does not record fields

Example:

collector.addFailure(“Input field ‘record_field’ is of unsupported type.”,
“Field ‘record_field’ must be of primitive type.”)
.withInputSchemaField(“record_field”, null);

5. Output schema field cause

Purpose: Indicates an error in output schema field

Scenario: User has provided output schema field that does not exist in BigQuerysource table

Example:

collector.addFailure(
“Output field ‘non_existing’ does not exist in table ‘xyz’.”,
”Field ‘non_existing’ must be present in table ‘xyz’.”)
.withOutputSchemaField(“non_existing”, null);

Cause Associations

While validating the plugin configurations, the validation failure can be caused by multiple causes. Below are a few examples of associated causes:

Example 1:

Database source has username and password as co-dependent properties. If username is not provided but password is provided, the plugin can just add a new validation failure with 2 causes as below:

collector.addFailure(“Missing username”,
“Username and password must be provided’”)
.withConfigProperty(“username”).withConfigProperty(“password”);

Example 2:

Projection Transform received incompatible input schema and output schema for a field such that input field can not be converted to output field. In that case a new validation failure can be created with 2 different causes as below:

collector.addFailure(“Input field ‘record_type’ can not be converted to string”,”Field ‘record_type’ must be of primitive type’”) .withConfigProperty(“convert”).withInputSchemaField(“record_type”);

Summary

Validation framework reduces pipeline development time by failing fast and providing contextual error messages. This framework is available in upcoming CDAP release 6.1.1 to provide better user experience. Try out CDAP today and if you would like to explore such exciting challenges, consider contributing to CDAP.

cdapio

CDAP is a 100% open-source framework for build data…

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store