Validation Framework for CDAP Plugins

Vinisha Shah
Jan 6 · 4 min read

CDAP provides an interactive UI to build data pipelines to apply code-free transformations on data. CDAP data pipeline is an acyclic graph composed of multiple plugins as its nodes and connections between them representing data flow. Each plugin in the pipeline can be configured by providing configuration properties, input and output schema for the plugin.

A CDAP data pipeline solving a real world use case can contain ~10 or more nodes in the graph. While building such CDAP pipelines, pipeline developers can provide invalid plugin configurations or schema. For example, the BigQuery sink plugin can have output schema which does not match with underlying BigQuery table or GCS source has invalid bucket name.

When a deployed pipeline fails due to invalid configurations, a common pipeline development flow will be to check the logs to identify the invalid configuration, clone the data pipeline and run it. This iterative process increases the data pipeline development time. Imagine how much time it will take to build a data pipeline with ~10–20 nodes. Also, during this process, Pipeline developers have to go through technical logs, stacktraces or even code base in order to identify problems such as NullPointerExceptions.

In the upcoming CDAP release 6.1.1, a new validation framework has been introduced to fail fast and collect all the validation failures. This framework also exposes a Failure Collector API to surface contextual error messages on CDAP UI.

Lets see what this framework provides and how it can be used in plugins to surface contextual error messages:

The validation framework should be able to collect multiple error messages in order to provide better user experience while building the pipeline. To do that, below FailureCollector APIs are exposed from the framework:

FailureCollector.java

Error collection using FailureCollector Api

CDAP plugins override method configurePipeline() which is used to configure the stage at deploy time. The same method is called through validation endpoint as well. In order to collect multiple validation failures, FailureCollector API is exposed through stage configurer to validate the configurations in this method. The sample usage of FailureCollector API looks as below:

Adding ValidationFailures to FailureCollector

A validation failure is made up of 3 components:

  • Message — Represents a validation error message
  • Corrective action — An optional corrective action that represents an action to be taken by the user to correct the error situation
  • Causes — Represents one or more causes for the validation failure. Each cause can have more than one attribute. These attributes are used to highlight different sections of the plugins on UI.

Example:

In bigquery source if the bucket config contains invalid characters, a new validation failure will be added to the collector with a `stageConfig` cause attribute as below:

Pattern p = Pattern.compile("[a-z0–9._-]+");
if (!p.matcher(bucket).matches()) {
collector.addFailure(
"Allowed characters are lowercase characters, numbers,'.', '_', and '-'",
"Bucket name should only contain allowed characters.'")
.withConfigProperty("bucket");
}

While a ValidationFailure allows plugins to add a cause with any arbitrary attributes, ValidationFailure API provides various util methods to create validation failures with common causes that can be used to highlight appropriate UI sections. Below is the list of common causes and associated plugin usage:

Purpose: Indicates an error in the stage property

Scenario: User has provided invalid bucket name for BigQuery source plugin

Example:

collector.addFailure(“Allowed characters are lowercase characters, numbers,’.’, ‘_’, and ‘-’”, “Bucket name should only contain allowed characters.’”).withConfigProperty(“bucket”);

Purpose: Indicates a plugin not found error

Scenario: User is trying to use a plugin/JDBC driver that has not been deployed

Example:

collector.addFailure(“Unable to load JDBC driver class ‘com.mysql.jdbc.Driver’.”, “Jar with JDBC driver class ‘com.mysql.jdbc.Driver’ must be deployed”).withPluginNotFound(“driver”, “mysql”, “jdbc”);

Purpose: Indicates a single element in the list of values for a given config property

Scenario: User has provided a field to keep in the project transform that does not exist in input schema

Example:

collector.addFailure(“Field to keep ‘non_existing_field’ does not exist in the input schema”, “Field to keep must be present in the input schema”)
.withConfigElement(“keep”, “non_existing_field”);

Purpose: Indicates an error in input schema field

Scenario: User is using BigQuerysink plugin that does not record fields

Example:

collector.addFailure(“Input field ‘record_field’ is of unsupported type.”,
“Field ‘record_field’ must be of primitive type.”)
.withInputSchemaField(“record_field”, null);

Purpose: Indicates an error in output schema field

Scenario: User has provided output schema field that does not exist in BigQuerysource table

Example:

collector.addFailure(
“Output field ‘non_existing’ does not exist in table ‘xyz’.”,
”Field ‘non_existing’ must be present in table ‘xyz’.”)
.withOutputSchemaField(“non_existing”, null);

While validating the plugin configurations, the validation failure can be caused by multiple causes. Below are a few examples of associated causes:

Example 1:

Database source has username and password as co-dependent properties. If username is not provided but password is provided, the plugin can just add a new validation failure with 2 causes as below:

collector.addFailure(“Missing username”,
“Username and password must be provided’”)
.withConfigProperty(“username”).withConfigProperty(“password”);

Example 2:

Projection Transform received incompatible input schema and output schema for a field such that input field can not be converted to output field. In that case a new validation failure can be created with 2 different causes as below:

collector.addFailure(“Input field ‘record_type’ can not be converted to string”,”Field ‘record_type’ must be of primitive type’”) .withConfigProperty(“convert”).withInputSchemaField(“record_type”);

Summary

Validation framework reduces pipeline development time by failing fast and providing contextual error messages. This framework is available in upcoming CDAP release 6.1.1 to provide better user experience. Try out CDAP today and if you would like to explore such exciting challenges, consider contributing to CDAP.

cdapio

CDAP is a 100% open-source framework for build data analytics applications

Vinisha Shah

Written by

Software Engineer, Google

cdapio

cdapio

CDAP is a 100% open-source framework for build data analytics applications

More From Medium

More on Cdap from cdapio

More on Cdap from cdapio

More on Big Data from cdapio

More on Big Data from cdapio

Wrangler Functions Cheat Sheet

Tony Hajdari
Feb 11 · 8 min read

10

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade