A Definitive guide to Wrangler User Defined Directive (UDD)

Nitin Motgi
Sep 3 · 5 min read
CDAP Wrangler User Interface

CDAP Wrangler makes it delightful to transform, cleanse, standardize, harmonize, DQ checks, and enrich data in a code-free manner within data pipelines[1]. While Wrangler provides a ton of built-in functions and Directives to manipulate data, there will always exist gaps. In order to fill gaps, Wrangler provides an extensible framework through User Defined Directives (UDD) that helps define custom directives for manipulating data.

UDDs are similar to User-defined Functions (UDFs) that have a long history of usefulness in SQL-derived languages and other data processing and query systems. While SQL can be rich in their expressiveness, there’s just no way they anyone can anticipate all the things a user wants to do. Thus, the custom UDF has become commonplace in our data manipulation toolbox. Similarly, In order to support customization, Wrangler provides the ability to build your own functions for manipulating data through UDDs.

In this blog, we take you through the steps on how to build, test and deploy a Wrangler UDD.

Concepts

Directive

A Directive is a single data manipulation instruction, specified to either transform, filter, or pivot a single record into zero or more records. Visual interactions on data adds a Directive to the Recipe. Can use SDK to build Custom Directives

Recipe

A Recipe is a collection of Directive. It consists of one or more Directive. Recipe is then executed by a Transform type plugin within a pipeline

Record

A Record is a collection of field names and field values.

Column

A Column is a data value of any of the supported Java types, one for each record.

UDD APIs

User Defined Directive (UDD) or Custom Directives are easier and simpler way for users to build and integrate custom directives with Wrangler.

Building a custom directive involves implementing four simple methods :

  • Ddefine() — Define how the framework should interpret the arguments.
  • Iinitialize() — Invoked by the framework to initialize the custom directive with arguments parsed.
  • Eexecute() — Execute and apply your business logic for transforming the Row.
  • Ddestroy() — Invoke by the framework to destroy any resources held by the directive.
@Plugin(type = Directive.TYPE)
@Name(“MyDirective”)
@Description(“My directive’s description”)
@Categories(categories = {“category1”, “category2”})
public final class MyDirective implements Directive {
@Override
public UsageDefinition define() {
. . .
}
@Override
public void initialize(Arguments args) throws
DirectiveParseException
{
. . .
}
@Override
public List<Row> execute(List<Row> rows,
ExecutorContext context)
{
. . .
}
@Override
public void destroy() {
. . .
}
}

UDD Annotation

A UDD is uniquely identified by a type, name, and the artifact (jar) it came from. Type and name are determined by annotations.

@Plugin(type = Directive.TYPE)
@Name(“Directive”)
@Description(“One liner description of directive")
@Categories(categories = {“category1”, “category2”})
public class MyDirective implements Directive {
. . .
}
  • @Plugin — With type Directive.Type declares the class a Directive
  • @Name — Identifies a directive with a unique name
  • @Description — Short description for the directive
  • @Categories — Specifies the categories the directive belongs to (Optional)

Defining UDD Arguments

When UDD require arguments, they can be defined under define()

@Plugin(type = Directive.TYPE)
@Name(“Directive”)
public class MyDirective implements Directive {
@Override
public UsageDefinition define() {
UsageDefinition.Builder builder =
UsageDefinition.builder(“Directive”)
builder.define(“param-1”, TokenType.TEXT, Optional.TRUE)
. . .
return builder.build();
}
}

Parameter Types

Directive Initialization

initialize() is executed when directive is initialized at runtime. Initialize any expensive resources during initialization.

@Plugin(type = Directive.TYPE)
@Name(“Directive”)
public class MyDirective implements Directive {
@Override
public void initialize(Argument args) throws DirectiveParseException {
ColumnName source = args.value(“p1”);
NumericList numbers = args.value(“p2”);
. . .
}
}

Transform Record / Row

execute() is invoked on a set of records or rows that need to be transformed. This is the API where rows are transformed based on the functionality of the directive.

@Plugin(type = Directive.TYPE)
@Name(“Directive”)
public class MyDirective implements Directive {
. . .
@Override
public List<Row> initialize(List<Row> rows, Executioncontext
context) throws DirectiveExecutionException, ErrorRowException {
. . .
}
}

UDD Testing

A testing framework provided for testing directives. Tests can be integrated with any testing framework like JUnit. The steps below

  • Include maven artifact io.cdap.wrangler:wrangler-test
<dependency>
<groupId>io.cdap.wrangler</groupId>
<artifactId>wrangler-test</artifactId>
<version>${wrangler.version}</version>
<scope>test</scope>
</dependency>
  • Create a TestRecipe instance
TestRecipe recipe = new TestRecipe();
  • Add Directives to recipe
recipe.add(“parse-as-csv :body”);
recipe.add(“drop :body”);
recipe.add(“my-directive :name”);
  • Create a TestRows instance
TestRows rows = new TestRows();
  • Add test rows
rows.add(new Row(“body”, “A,B,C”).add(“col2”, “X”));
rows.add(new Row(“body”, “X,Y,Z”).add(“col2”, “C”));
  • Create a RecipePipeline instance to create recipe execution pipeline
RecipePipeline pipeline = TestingRig.pipeline(MyDirective.class, recipe);

In case there are multiple directives to be included in the test, then specify an array of classes.

  • Execute the pipeline
List<Row> actuals = pipeline.execute(rows.toList());
  • Validate the rows returned
Assert.assertEquals(2, actuals.size());
Assert.assertEquals(“val1”, actuals.get(0).get(“value”);

Full Example of UDD Testing

public class SimpleUDDTest {
@Test
public void testSimpleUDD() throws Exception {
TestRecipe recipe = new TestRecipe();
recipe("parse-as-csv :body ',';");
recipe("drop :body;");
recipe("rename :body_1 :simpledata;");
recipe("my-directive ...");

TestRows rows = new TestRows();
rows.add(new Row("body", "root,joltie,mars avenue"));
RecipePipeline pipeline = TestingRig.pipeline(MyDirective.class, recipe);
List<Row> actual = pipeline.execute(rows.toList());
Assert.assertEqual(2, actuals.size());
}
}

Bundling UDD

The UDD is bundled using Apache Felix Maven Plugin. Specify the package of UDD to be exposed in _exportcontents

<plugin>
<groupId>org.apache.felix</groupId>
<artifactId>maven-bundle-plugin</artifactId
<version>3.5.1</version>
<extensions>true</extensions>
<configuration>
<instructions>
<Embed-Dependency>*;inline=false;scope=compile</Embed-Dependency>
<Embed-Transitive>true</Embed-Transitive>
<Embed-Directory>lib</Embed-Directory>
<!--Only @Plugin classes in the export packages will be included as plugin-->
<_exportcontents>my.company.directives.*;</_exportcontents>
</instructions>
</configuration>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>bundle</goal>
</goals>
</execution>
</executions>
</plugin>

Building a UDD

  • Build using maven
mvn clean package
  • Packaged JAR and JSON are placed in target/ directory
target/my-directive-1.0.0-SNAPSHOT.jar
target/my-directive-1.0.0-SNAPSHOT.json

Packing UDD

  • User Defined Directives are packaged the same way as any other CDAP Plugins
  • Package using CDAP Maven Plugin (io.cdap:cdap-maven-plugin)
<plugin>
<groupId>io.cdap</groupId>
<artifactId>cdap-maven-plugin</artifactId>
<version>1.1.0</version>
<configuration>
<cdapArtifacts>
<parent>system:wrangler-transform[4.0.0,5.0.0)</parent
<parent>system:wrangler-service[4.0.0,5.0.0)</parent
</cdapArtifacts>
</configuration>
<executions>
<execution>
<id>create-artifact-config</id>
<phase>prepare-package</phase>
<goals>
<goal>create-plugin-json</goal>
</goals>
</execution>
</executions>
</plugin>

Deploying UDD

There are multiple ways the custom directive can be deployed to CDAP. The two popular ways are through using CDAP CLI (command line interface) and CDAP UI.

CDAP CLI

In order to deploy the directive through CLI. Start the CDAP CLI and use the load artifact command to load the plugin artifact into CDAP.

$ $CDAP_HOME/bin/cdap cli
cdap > load artifact my-directive-1.0.0-SNAPSHOT.jar config-file my-directive-1.0.0-SNAPSHOT.json

or you can use CDAP UI to deploy directive.

Deploying Wrangler UDD using CDAP UI

Conclusion

Custom directives make it easy for extending the functionality of Wrangler. They provide an easier way for standardizing data transformations within an organization. So, give it a try, deploy it in CDAP as well as in Cloud Data Fusion.

cdapio

CDAP is a 100% open-source framework for build data analytics applications

Nitin Motgi

Written by

Nitin Motgi is Founder and CTO of Cask, where he is responsible for developing the company’s long-term technology and driving company engineering initiatives.

cdapio

cdapio

CDAP is a 100% open-source framework for build data analytics applications

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade