cdapio - Medium

Announcing CDAP 6.2.0 Release

Edwin Elia — Mon, 01 Jun 2020 13:00:08 GMT

On behalf of the CDAP community, it is my pleasure to announce the release of CDAP version 6.2.0. This release introduces Replication, an easy way to replicate changes from transactional databases into analytical data warehouses. It also enhances the Google Cloud Dataproc runtime provisioner to use the native Google Cloud Dataproc’s job APIs. Additionally, it includes a few improvements to the Pipeline Studio that enhance the user experience of building pipelines.

Replication

Replication allows users to create replication pipelines easily. The user interface guides users through the steps of configuring the source database and then selecting the tables and columns from the database to be replicated. Once users have done adding the target configuration, the system will run an assessment of the configuration to determine whether there is any potential issue that needs to be addressed before deploying the pipeline. An assessment stage also reports on the possible issues during replication, including data type mappings between the source and target databases.

Select tables and columns to replicate

Google Cloud Dataproc Runtime Improvement

Previously, Google Cloud Dataproc runtime was utilizing SSH for job submission. This resulted in a requirement that port 22 be open for the environment running CDAP. With this improvement, the job submission uses native Google Cloud Dataproc APIs, thus not requiring port 22 to be open anymore.

Pipeline Studio Improvements

Users can now select multiple plugins by dragging and making selections. Once the plugins are selected users can move, copy, or delete the plugins. Additionally a right click is now possible in the Pipeline Studio canvas. By right clicking, users can add a new wrangler connection or do common actions such as zooming and aligning the plugins.

Right click on the canvas to open the menu

Download CDAP 6.2.0 today and take it for a spin! Also consider helping us develop the platform by reaching out to the community with any comments, feedback, suggestions, or improvements or by creating and following JIRA issues and submitting pull requests.

For Hadoop distributions packages, you can build them from the following repositories:

Announcing CDAP 6.2.0 Release was originally published in cdapio on Medium, where people are continuing the conversation by highlighting and responding to this story.

CI/CD and Change Management for Pipelines — Part 3

Tony Hajdari — Wed, 08 Apr 2020 13:15:40 GMT

CI/CD and Change Management for Pipelines — Part 3

Welcome to the third installment of this four part series. In the first article I discussed some of the concepts related to continuous integration and testing. In the second article we got into some hands-on examples for extracting pipelines from CDF/CDAP and used GitHub as a repository for storing pipelines and related artifacts.

In this article we’ll discuss the process of migrating artifacts from GitHub into a TEST, QA, or PROD environment, and explore automation options by leveraging the API more broadly.

Now that you have your pipelines checked into GitHub, deploying those pipelines onto another environment, like Cloud Data Fusion on GCP for example, is fairly straightforward. Once again, there are two ways we can accomplish this task. Either by using the UI in CDF/CDAP or by using the API. When dealing with one or two pipelines, using the UI is fairly easy and convenient, but when you have many pipelines to migrate this process can become cumbersome.

So, once again, it’s important to look at your checklist for what components a pipeline relies on, things like namespace preferences, custom plugins or UDDs, etc., and plan your migration accordingly.

GitHub Workflow

I’ve decided to use a migration strategy that relies on named branches in GitHub that correlate to my CDF/CDAP environments. Thus, I have branches named; development, test, qa, and so on. Any changes merged into the development branch can be merged into the test branch so that the pipelines in that branch can be tested on an environment that may resemble production.

Merge Development to Test

We start off by creating a pull request against the test branch so that our development work is merged to the test branch.

Next you create the pull request and provide a description for the PR.

Once the PR has been created you can merge it to the test branch.

Finish by clicking on the confirm merge button.

Once merged you will see a confirmation. The test branch now reflects all the changes in the dev branch.

Clone Test Branch

With the development branch merged into test, I’m now ready to clone the test branch and deploy my pipelines to the TEST environment. I’m specifying that I want to clone the test branch with the “-b test” parameter. Once the repo is cloned I can navigate into the pipelines folder and see the pipelines available for testing. My shell also provides a visual cue as to which branch I’m rooted in.

Deploy Pipeline

The procedure you use for deploying pipelines can vary when it comes to loading a number of pipelines in bulk, but the strategies to consider include using the CLI in combination with shell scripting, or using the REST API and writing a deployment utility in the language of your choice — similar to the extract-pipelines.py utility I wore for downloading all the pipelines from a CDAP instance, regardless of which namespace they belong to.

The CLI option is convenient because it’s included with your CDAP sandbox and you can automate the majority of functions you need for automating your deployments. But, if you really want maximum flexibility and the ability to programmatically control all aspects of the deployment, configuration, validation, and execution phases then you’ll want to use full fledged language like Python, Java, Go, or whatever programming language you fancy for working with REST APIs.

I have two pipelines that I need to deploy, based on what’s in the pipelines folder, and since I did not organize the pipelines into a namespace hierarchy I can deploy them as I like, into any namespace on the target system. But, what if I had few namespaces in my TEST or PROD environments, how would I know which namespace to deploy them to? Preferably the development environment should be configured to contain the same namespaces you intend to use in production so that the export script can create folders with the namespace name and place the associated pipelines for that namespace under the respective folder.

As and example, here’s how you can deploy a pipeline via REST:

curl -X PUT “http://localhost:11015/v3/namespaces/NAMESPACE_NAME/apps/Titanic_02" -d “@./Titanic_02-cdap-data-pipeline.json”

If we need to loop through all the pipelines that need to be deployed then we can write a script/program to replace the namespace name and the pipeline name in the URL as necessary.

To illustrate, in the example below I deploy the pipeline to a namespace called BAR.

Deployment Gotchas

As I mentioned in the previous article, you’ll want to have a checklist of all the components that a pipeline relies on when you deploy it in another environment. Here’s what to look out for — especially if you will be writing your own automation utility.

Namespaces

As mentioned earlier, you can use the CDAP sandbox CLI to access a remote CDAP instance. To connect to a remote instance via the CLI add the “ — uri [IP_ADDRESS|HOSTNAME]” to the cdap cli command. If you have added the CDAP executable to your path then the command would look like this:

cdap cli — uri http://my_cdap_server.example.com:11015

A CDF/CDAP instance can have any number of namespaces and it’s probably a good idea to validate that the namespaces match from one environment to another. Here’s how to get a list of namespaces using both the CLI and the REST API:

CLI:

REST:

curl — request GET \

— url http://localhost:11015/v3/namespaces

Now that you know how to retrieve the list of namespaces from a CDAP environment, it’s just as easy to create a namespace. Using curl the command looks like this:

curl — request PUT \

— url http://localhost:11015/v3/namespaces/NAMESPACE_NAME

Make sure to replace the namespace name with your own namespace. Once again, if you are scripting this then you can loop through namespace names, either from the folder hierarchy that was created by the export pipelines utility, or if you exported the namespace names into a separate config file — which is what I’d recommend.

Preferences

The next thing to watch out for is preferences that may have been set for each namespace and/or pipeline. There are a number of levels where preferences can be set, so make sure to take a look at the official documentation here for a better understanding. It’s important to at least capture any preferences related to a namespace as this is where global settings may have been set for macro key:value pairs associated to a namespace.

Here’s how to retrieve preferences for a namespace via REST:

curl — request GET \

— url http://localhost:11015/v3/namespaces/NAMSPACE_NAME/preferences

Plugins

Probably the biggest gotcha is the absence of a plugin in the target environment when you deploy a pipeline. Unlike the convenience of the UI, the REST API does not provide a single interface for handling all the plugin related issues you may encounter with your pipelines.

It is up to you to check whether a plugin exists on the target system and if it is at the correct version given your pipeline’s generation and the version of CDF/CDAP you are deploying to. Therefore, your plugin deployment utility has to perform all of the validation tasks in order for the pipeline to work properly on a target environment.

You need to start by checking the contents of the pipeline JSON for the version of CDAP and the list of plugins used in that pipeline, along with their respective versions. Next you need to check the target system to determine what plugins it has in any given namespace — again, different namespaces can have different user scope plugins. Based on this information you can make a comparison of the source and target environments and decide how to resolve the discrepancies with versioning and missing plugins.

Missing plugins would need to be uploaded, and if the plugin versions are newer or older than the source environment, then you need to either upload the matching versions as defined in your pipeline or modify the pipeline JSON so that the plugin versions match with the versions on the target environment.

Yikes…that’s quite a lot! Don’t fret though, the documentation is available here.. When it comes to a production workflow, you’ll most likely want to keep your development environment identical to the rest of the environments (TEST & PROD) so that you don’t have to deal with all of the versioning issues. This way you can just focus on missing plugins, as would be the case if you are using custom built plugins or plugins from the hub. As before, you can use the CLI or REST API.

Let’s look at the CLI first. Make sure you are rooted at the correct namespace first — “use namespace NAMESPACE_NAME”.

Once in the desired namespace load the plugin artifacts:

load artifact /path/to/plugin.jar config-file artifact /path/to/plugin.json

Example using REST API:

curl -w”\n” -X POST “http://localhost:11015/v3/namespaces/FOO/artifacts/trash-plugin" \

-H ‘Content-Type: application/octet-stream’ \

-H “Artifact-Extends: system:cdap-data-pipeline[6.0.0-SNAPSHOT,7.0.0-SNAPSHOT)/system:cdap-data-streams[6.0.0-SNAPSHOT,7.0.0-SNAPSHOT)/system:cdap-etl-batch[6.0.0-SNAPSHOT,7.0.0-SNAPSHOT)/system:cdap-etl-realtime[6.0.0-SNAPSHOT,7.0.0-SNAPSHOT)” \

— data-binary @./trash-plugin-1.2.0.jar

Testing

The whole point of cloning the test branch was to test our pipelines, so let’s get to it. As before, we would want to automate this process as much as possible as well. So what do we need to make sure we can run a test?

First thing you need to make sure is that the data source you used in your test will be available to you in your TEST environment. We provided the source data in the repo to automate the testing so all you need to do is copy the file to the target environment if the test data is not already there. In this case the cloned repo has everything we need, if you cloned the test branch onto the TEST server that is. In most scenarios this would not be the case, as the test environment may be remote to the system orchestrating the tests.

Most likely you will be using macros to define what source data you will be working with, so it makes sense to make these values dynamic and configure them with namespace preferences — see, this is why preferences are important!

Finally, if you have multiple pipelines to test it would be far more practical to create a script to loop through all the pipelines to test rather than executing them one by one.

curl -X POST “http://localhost:11015/v3/namespaces/default/apps/Titanic_02/workflows/DataPipelineWorkflow/start" \

-d ‘{ “input.path”:”/Users/veton/code/misc/cdap_blog/cdap-pipelines”, “output.path”:”/Users/veton/code/misc/cdap_blog/cdap-pipelines” }’

In the Titanic-02 pipeline I used macros, so I can pass those parameters in as a JSON object to the API call. Once again, if we were to write a program or a script to loop through all the pipeline tests we would extract the macro and preference values from a file and pass them in at runtime.

As you can see, when we invoke the pipeline execution via the REST API the UI swings into action and gives visual feedback of the activity taking place. And of course, we can see the results of our test in the output destination specified by the macro variable.

Conclusion

In this article we learned how to use GitHub to promote pipeline artifacts from one branch to another to perform testing. You can repeat this process to promote your pipeline artifacts all the way through to production. Besides the git workflow for source control we discussed techniques for automating your deployment process using the CLI and REST API.

You can get as complex as you need to in order to have a fully automated system for deployment and testing, and I leave it up to you to come up with some creative ideas on how to accomplish this goal. There is a lot that can be done by leveraging the API with programming languages like Python, Java, Go, and many others, and for general automation shell scripting works great. Experiment with some of these and see how much mileage you can get out of a simpler approach first.

As we look to wrap up this series, we’ll dive a little deeper into automation and take a look at Jenkins as a way to automate some of the tasks we’ve done manually so far. Stay tuned for the next article.

Until next time, stay safe and healthy, and mind your social distancing!

CI/CD and Change Management for Pipelines — Part 3 was originally published in cdapio on Medium, where people are continuing the conversation by highlighting and responding to this story.

CI/CD and Change Management for Pipelines — Part 2

Tony Hajdari — Mon, 16 Mar 2020 13:01:01 GMT

CI/CD and Change Management for Pipelines — Part 2

Welcome to the second article in this four part series. In the first article I discussed some of the concepts related to continuous integration and testing. In this article we’ll get into some hands-on examples for extracting pipelines from CDF/CDAP and use GitHub as a repository for storing pipelines and related artifacts.

I will cover the following topics in this article:

Creating a checklist of all the artifacts you will need to test in a target environment
How to set up a GitHub project to house our pipeline artifacts
How to export pipelines from CDF/CDAP using the export tools in the UI and how to do it via the ReST API
How to check in our development pipelines to GitHub
How to deploy a pipeline to an alternate environment using both the UI and the ReST API

Prepare a Checklist

A pipeline exported in JSON format from CDF/CDAP defines a Directed Acyclic Graph (DAG) containing the sequence of operations that were designed visually, the plugins that are used by that DAG, and some configuration information for other components of the ecosystem where the pipeline will be executed.

So, if you already have the exported pipeline JSON, what else might you need?

The exported pipeline JSON itself is enough to recreate the visual representation of the pipeline on another instance of CDF/CDAP, but you will also need all the configuration information from the source system which is not contained in the pipeline itself. Therefore, it’s prudent to create a checklist of all the information you will need when promoting a pipeline from one environment to another.

Here’s all the information you need to take into consideration:

Pipeline — The exported JSON of the pipeline DAG.
Plugins — These are the actual JAR files containing the code of the plugins themselves. Your pipeline might have different versions of the plugins or it may have custom plugins and UDDs that may not exist on the target system. If you used custom plugins and UDDs then you will need to transfer those artifacts as well.
Datasets — There is a good chance that the dataset you used for testing locally on your CDAP Sandbox instance might not be the same as the one on the target system.
Preferences — You can set preferences globally at the system level or at the namespace level, and if your pipelines use any of these preferences then you will want to ensure that the target environment is configured the same.
Macros — In order to make your pipelines portable across environments it makes sense to set any field that will change dynamically, based on the environment that it will be executed on, as macros.

I’ll discuss how to invoke a pipeline test on a target environment with the requisite system preferences and macro setting in the last article in the series. For now, make sure you stash your pipelines, plugins, and datasets into the git repository. Preferences and macros are key-value pairs and can be represented nicely in JSON format. This is the format I will use in later articles for storing and porting that information.

GitHub Repo

If you are new to GitHub and unfamiliar with git commands I highly recommend you read up on the topic from the multitude of sources available on the internet. This will not be a tutorial on Git, but you should be able to follow along with the git workflow. The scenario I will use in this project is a two person team that works in tandem to both develop and review the work of the other party.

The github repo for these examples can be found here: https://github.com/vhajdari/cdap-pipelines

You can start off by creating a new fork of the GitHub repo I set up for this project. Once you have forked the project you can clone it to your local machine.

git clone https://github.com/vhajdari/cdap-pipelines.git

This will create a local folder named cdap-pipelines containing the contents of the git repo.

I created another user in github and forked the https://github.com/vhajdari/cdap-pipelines repo into the new account. Keep in mind that you will need to create your own repo in order to check in new files.

Export Pipelines

I developed a rudimentary pipeline with three stages that looks like this:

To export this pipeline from the UI click on the Actions icon on the top right of the screen and select Export. This will bring up a new window that will let you inspect the pipeline JSON. Click Export to save the file. The pipeline JSON can now be added to your GitHub repo.

Bulk Export Pipelines

Exporting a one pipeline at a time can become very cumbersome if you have a large number of pipelines that you need to export. Unfortunately there is no way to bulk export all your pipelines from the UI. So, how can we get around this problem? This is where the ReST API comes to the rescue.

If you are familiar with curl or use tools like Postman or Insomnia, you can quickly get a handle on how the ReST API works. Go get a listing of all the pipelines in CDF/CDAP you invoke the following HTTP request using the GET option:

http://localhost:11015/v3/namespaces/default/apps

This gives you a listing of all the deployed pipelines in the default namespace. In this example I used Insomnia. The curl version of this request is:

curl — request GET \

— url http://localhost:11015/v3/namespaces/default/apps \

— header ‘content-type: application/x-www-form-urlencoded’

You will notice that this API request simply returns a listing of all the pipelines, but our goal was to export all the pipelines all at once! So, how can we accomplish that?

Well, once we have the list of the pipelines we can iterate over that list and extract the JSON of each pipeline. The API request to get the pipeline JSON for Titanic_01 is as follows:

curl — request GET \

— url 'http://localhost:11015/v3/namespaces/default/apps/Titanic_01'

To iterate over a list of pipelines and extract the one we want we would simply replace the last portion of the URL that has the pipeline name with the desired pipeline name.

This API returns the same JSON that we saw in the UI, and you can use the output option in curl to write out the content to a file. As you can ascertain from these API calls, we could easily script these HTTP requests to extract all the contents to the file system — and this is exactly what I have done with a simple Python script that extracts all pipelines from all namespaces and writes them out to disk. You can find the export_pipelines.py file in the util folder of the repo you cloned earlier.

Here’s the export script in action:

Once all your pipelines are exported you can copy the desired files to the pipelines folder in the git project. The python script I mentioned here is by no means comprehensive, and you may find that there are lots of other things that you may want to extract as well, but if you are so inclined it’s a good way to get started for learning how to script with the ReST API.

By the way, this is also how you can extract all the system preferences and some additional settings from the source environment so that you can add those settings to a file to support CI/CD efforts down the road. For example, we’ll want to extract all the system preferences — this is how to get them:

curl — request GET \

— url http://localhost:11015/v3/namespaces/default/preferences

Push to Git Repo

OK, now that you know how to extract pipelines both individually and in bulk, the next step is to check these pipelines into your git repository. Assuming you forked my GitHub repo to your own repo, you would clone your forked repo and work off of that. For this example I created a new branch for each pipeline I want to push to my repo.

Start off by configuring some global settings for your repo. This will help you avoid any pesky error messages when you attempt to push your code.

git config — global user.email “you@example.com”

git config — global user.name “Your Name”

Create a branch for your work. In this example I used titanic-01 as the development branch for the first pipeline I checked in to git.

git checkout titanic-01

Once you’ve copied your pipeline JSON to the pipelines directory add all the files in the project folder to source control.

git add .

You can check the state of the git repo at any time by running:

git status

Now you are ready to commit your changes. Make sure to add a comment so that the commit is documented.

git commit -m “This is a message that will explain what this commit contains”

Almost done. All that is left to be done is to push all the changes to the GitHub repository.

git push origin titanic-01

Granted, all of the steps of exporting the pipelines and checking them into git can be scripted as well, but that would be left as an exercise for the reader since no two teams operate the same way, let alone different enterprises. You can get as elaborate as you want for this process, but to minimize bugs it’s always best to keep things simple.

Pull Requests and Merges

When working in your own feature branch, you have the freedom to make all the changes you want locally and check in whatever you would like to be merged with the upstream project. In order for the upstream project to reflect the changes we pushed to our git repository we need to create a pull request. You do this in the GitHub page.

Make sure to create the pull request against a branch that the maintainer is expecting to merge PRs in. In this case I’m creating a PR against the upstream development branch, and will leave it up to the maintainer to merge the development branch into testing, QA, or master (production) branch. These will come into play later when we configure CI/CD.

The maintainer will review, and can comment, accept, or reject the PR.

Deploy a Pipeline From Git

Deploying a pipeline from a cloned GitHub repo is a simple as performing an import in CDF/CDAP. Once again, the UI lets you do imports one pipeline at a time, but it also takes care a lot of the validation in the UI.

Pipeline validation includes things like:

Checking if the pipeline already exists
Checking plugin versions and upgrading to the latest versions as necessary
Upgrading to the latest version of plugins
Identifying missing plugins, and the ability to download from the hub.

To import from the UI locate the big green plus button. When you click it you will be presented with the following window:

Click the Import button on the Pipeline card and select the pipeline JSON from your file system. The pipeline with then load into edit mode in the studio where you can continue updating it or deploy it.

Alternatively, here’s an example of how to deploy a pipeline using the API with curl:

curl -X PUT “http://localhost:11015/v3/namespaces/namespace-id/apps/pipeline-name” -d “@/path/to/the/pipeline/JSON”

Make sure to substitute namespace-id and pipeline-name with your own values. In my case values are default, because I’m deploying it to the default namespace, and Titanic-01, which is the name I’ve given to this pipeline. The final component is the path to your pipeline JSON file. Don’t forget the @ symbol before the path.

A word of caution. Deploying a pipeline via API does not provide any validation like you have in the UI. Therefore, any validation that needs to take place needs to be encapsulated in your deployment code. Similarly to what I did with the pipeline export script in Python, you would have to perform each validation step if you want to ensure that the pipeline will have all the requisite artifacts on the target system.

Conclusion

There you have it. Your pipelines are now in source control on GitHub. Of course, you don’t have to use GitHub for your VCS, but it is one of the most popular options in the open source community and has broad adoption among a majority of open source projects, including CDAP.

In this article we learned how to extract pipelines via the UI and ReST, and how the export_pipelines.py script helps us export pipelines in bulk. I showed you how to check your pipelines into your repository with git, and how to create a PR to push your pipelines to the upstream repo for testing and promotion.

In the next article we’ll discuss the process for migrating artifacts from GitHub into a TEST, QA, or PROD environments. We’ll dig a little deeper into automation options and discover how we can leverage the API more broadly.

Until next time, stay safe and healthy, and wash your hands!

CI/CD and Change Management for Pipelines — Part 2 was originally published in cdapio on Medium, where people are continuing the conversation by highlighting and responding to this story.

CI/CD and Change Management for Pipelines — Part 1

Tony Hajdari — Mon, 09 Mar 2020 13:01:00 GMT

CI/CD and Change Management for Pipelines — Part 1

Typical CI/CD process for data pipelines

Welcome to my latest series on continuous integration of data pipelines with Cloud Data Fusion (CDF) and/or CDAP. This will be a 4 part series of articles where I’ll discuss the promotion process of data pipelines across multiple environments and all the tools and techniques that we’ll use along the way.

Change Control and CI/CD

Whenever we consider a development lifecycle in an enterprise setting there are a number of gates that a product has to go through before being released to production. Typically we do development in a segregated development environment, most often this is our very own laptop. Artifacts that have completed the development phase would be transferred to a test environment to undergo both unit and integration testing. Some organizations have a more rigorous Quality Assurance (QA) process that requires that all artifacts that pass initial tests undergo further testing in order to be certified for release to production. Finally, all artifacts that have been tagged for promotion will be published to a production environment and put to use with production data and secured accordingly.

There are two parts to this equation; governance and automation. Some or all of these activities can be automated with platforms like GitHub and similar systems, others require implicit human intervention. The governance process for developing and promoting code to production is now commonplace and very well understood by enterprises, but how does this apply to data integration pipelines that are not necessarily code?

If we come from the programming world we most likely have been exposed to Version Control Systems like Git. To automate the process of migrating artifacts from Git most enterprises leverage CI/CD tools like Jenkins or Travis CI. The human element in the governance process for change control becomes a simple workflow in GitHub, as in reviewing and merging a PR, but scheduling and prioritizing promotion of artifacts may still require approval and scheduling by committee.

In this article I’m not going to focus so much on the change control process and all the bureaucracy that may go with it, but I’ll primarily discuss the mechanics of how a CI/CD workflow can be implemented for developing and promoting CDF/CDAP artifacts from development through production.

CI/CD has many moving parts so I’ll break this topic down into multiple articles to focus on each area in greater detail as we go through this journey. The series of articles will be broken down as follow:

In this first article we’ll cover the overall process and define the concepts.
In the second article we’ll focus on how we can extract artifacts from a CDF/CDAP environment and store them on GitHub.
In the third article we’ll discuss the process for migrating artifacts from GitHub into a TEST, QA, or PROD environment.
And finally, in the fourth article I’ll discuss how we can automate the whole process so that we can invoke the whole CI/CD process with Jenkins.

What is CI/CD?

Often you will see the acronym CI/CD in reference to software development workflows. This stands for Continuous Integration and Continuous Development or Deployment. In simple terms this means we write some code and some tests for that code, and some automation system takes over once we’ve checked in our code to build it, run the unit and integration tests that were written, and deploy the resulting artifacts somewhere.

So why is CI/CD important? As you can imagine, most enterprises will have distributed teams working on different portions of a system that supports production goals, therefore automation for continuously building and delivering new features and functionality, as well as bug fixes, is highly valued. This also allows teams to be more agile and work on multiple tasks simultaneously.

Most important of all is that only artifacts that have been fully tested and validated make their way into production. CI/CD processes, and systems that enable those processes, are widely used to achieve this goal.

In the context of developing pipelines we don’t really have a build process as in the traditional sense of compiling code. The development process itself is defining the logical flow and processing steps for the data pipeline using a visual development environment, and testing will typically happen in preview mode while in the pipeline development studio of CDF/CDAP. Further testing is then conducted by publishing the pipeline in a development environment that most likely has only a subset of data. Asserting the validity of the test requires that sources and sinks are checked for the expected number of records, and this can change based on which environment the tests are run on and the volume of data.

From an automation perspective we would look at leveraging the REST API of CDF/CDAP to deploy repeatable sets of tests along with preparing the execution environment.

Organizing Artifacts Into Namespaces

One powerful feature of CDF/CDAP is that you can organize your artifacts into separate namespaces. Each namespace can have its own artifacts; including pipelines, plugins and applications, as well as its own preferences, a set of key value pairs that can store information for things like database connections or folder paths.

As an example you can have a DEV namespace that may be configured to point to a development database and a TEST namespace that would point to an entirely different database. Using this technique it makes it easy to test your promotion process without having to maintain multiple CDF/CDAP instances.

Segregated CDF/CDAP Instances

At times it is absolutely essential that a production environment is segregated into its own distinct set of instances with security and networking configured specifically for that purpose.

Oftentimes this means having a dedicated VPC and strict network ingress and egress rules, as well as very restrictive service accounts and user account controls.

In this context, CDF/CDAP may be just one component of a greater ecosystem of data storage and processing systems, and thus would need to be highly secured and would be required to work in a very restrictive environment. To effectively test such environments it would be necessary to have a similarly configured QA environment that mimics the same properties of the production environment such that the integrated tests will be representative of the workloads that would be experienced in production.

Testing a Pipeline

When developing a pipeline locally in CDAP Sandbox you can test the pipeline in preview mode, and when you are ready to test with real datasets you can publish the pipeline and run it locally. This is an effective way to test that the pipeline logic is working well, but this only tests the pipeline with a limited scope of data, and without true parallelism. To test the pipeline at scale you really need to deploy and test it at scale using an Apache Hadoop cluster like Dataproc and a sufficiently large input.

For quick tests across environments you will need to make sure that you have the following elements replicated to the target environment where you intend to do the testing:

Source Data — This data should be in a storage system reasonably equivalent to production, and be large enough to simulate your desired levels of scalability. We will use datasets from HDFS or GCS in CSV or Avro format for the purpose of this blog series.
Pipeline — The JSON export of the pipeline DAG.
Plugins — If you have any custom developed plugins or plugins you’ve downloaded from the hub.
Preferences — Key value pairs used in that namespace for macro substitution.

CDAP provides three ways to export a pipeline. You can export a pipeline directly from within the UI, via the CLI, or by using the API. I’ll show you two ways to export a pipeline that include the UI and the API. But, keep in mind that environment specific configurations like preferences and deployed applications are not part of the pipeline export — more on this in the next article.

Part 1 — Wrap Up

In the next article I will cover the following:

Creating a checklist of all the artifacts you will need to test in a target environment
How to set up a GitHub project to house our pipeline artifacts
How to export pipelines from CDF/CDAP using the export tools in the UI and how to do it via the ReST API
How to check in our development pipelines to GitHub
How to deploy a pipeline to an alternate environment using both the UI and the ReST API

See you next time!

CI/CD and Change Management for Pipelines — Part 1 was originally published in cdapio on Medium, where people are continuing the conversation by highlighting and responding to this story.

Wrangler Functions Cheat Sheet

Tony Hajdari — Tue, 11 Feb 2020 14:01:01 GMT

Photo by Dave Clubb on Unsplash

To really become a Ninja with Wrangler Directives you have to get to know all the functions that Wrangler supports. In this article I’m going to list out all the Wrangler functions with a short description of what each one does.

At the time of this writing the Wrangler code branch on GitHub is at version 4.1 for the latest release. The link to the Wrangler functions documentation can be found here.

In a previous article I discussed how you can use JEXL expressions in your directive, and these functions are no different. This article will primarily focus on the functions themselves, so that you have a good understanding what is available.

If you want to know what JEXL function types are available to Wrangler you can take a look at the code here, but, I’ll save you the trouble and list them here along with their associated Classes.

Function Prefix → Class Name
null → Global.class
date → Dates.class
json → JSON.class
dq → DataQuality.class
ddl → DDL.class
geo → GeoFences.class

… and a few more that are not covered in this blog! The rest of the functions allow you to call methods from the corresponding Java utility classes like Math, Sting, etc — more on this in future articles, so keep an eye out! But, for the impatient here’s now such a function would be used:

set-column :rounded_price math:ceil(price)

math → Math.class
string → StringUtils.class
strings → Strings.class
escape → StringEscapeUtils.class
bytes → Bytes.class
arrays → Arrays.class

Using Functions

Most commonly you will use these functions with a directive that can parse an expression, such as send-to-error.

send-to-error exp:{}

So, in order to use an expression you put the prefix in the exp part of the expression, unless it’s the Global variety, in which case you leave it empty. This would take the form of: FUNCTION(), date:FUNCTION(), dq:FUNCTION(), json:FUNCTION(), and so on.

For example, a directive that uses the Data Quality function would look like this:

send-to-error dq:isCreditCard(Credit_Card_Number)

As you can see in this example, dq is the function prefix and isCreditCard() is the function that is part of the Data Quality class, using the “:” to concatenate the expression.

This function checks to see that the value in the column “Credit_Card_Number” is in fact a valid credit card number.

Conversely, you can check for the inverse condition by prefixing the function name with a “!” — this is the result of the function evaluating to false.

send-to-error !dq:isCreditCard(Credit_Card_Number)

The following screenshots illustrate what happens to the record that are filtered out that have invalid credit card numbers.

Additionally, you can also use these functions with the set-column directive, to create a new column based on an expression applied on one or more other columns.

set-column :column-name exp:{}

For example,

set-column :Full_Name string:concat(first, ‘ ‘, last)

The following cheatsheet lists the functions along with the data type it accepts as input. The source code for all these functions can be found here.

Global Functions

In this section the functions do not have a prefix (null → Global.class), therefore you do not put anything before the function.

Conversion Functions

Example: set-column :Phone1 toDouble(Phone)

toDouble(String value) → Converts String value to double.
toFloat(String value) → Coverts a String value to float.
toLong(String value) → Converts a String value to Long.
toInteger(String value) → Converts a String value to integer.
toBytes(String value) → Converts a String value to byte array.

String Utility Functions

Example: set-column Salutation concat(“Mr. ”, Name)

concat(String a, String b) → Concatenates two string without any separator in between.
concat(String a, String delim, String b) → Concatenates two string with a delimiter.
coalesce(Object … objects) → Finds the first non-null object.
rcoalesce(Object … objects) → Finds the last non-null object.
format(String str, Object… args) → Formats the string in way similar to JAVA’s String format.
padAtStart(String string, int minLength, char padChar) → Returns a string, of length at least minLength, consisting of string prepended with as many copies of padChar as are necessary to reach that length.
padAtEnd(String string, int minLength, char padChar) → Returns a string, of length at least minLength, consisting of string appended with as many copies of padChar as are necessary to reach that length.
repeat(String string, int count) → Returns a string consisting of a specific number of concatenated copies of an input string.
unquote(String string) → Removes single or double quotes from a string if its quoted.

Data Quality Functions

Data quality functions validate the data being passed in, so send-to-error or filter-* type directives are good options to use in combination with these functions.

Example: send-to-error !dq:isEmail(Email)

columns(Row row) → Given a row, finds the length of the row
hascolumn(Row row, String column) → Finds if the row has a column.
inrange(double value, double lower, double upper) → Checks if the value is within the range.
strlen(String str → Returns the length of the string.
isnull(Object object) → Checks if the object is null.
isempty(String str) → Checks if the string is empty.
isDate(String date) → Validate using the default Locale.
isDate(String date, String pattern) → Validate using the specified pattern.
isNumber(String str) → Checks if a value is a number.
isBoolean(String str) → Checks if a value is a boolean.
isEmpty(String str) → Checks if a value is a empty
isDouble(String str) → Checks if a value is a double.
isInteger(String str) → Checks if a value is an integer.
isIP(String ip) → Checks if string is a valid IP address. Could be IPv4 or IPv6.
isIPv4(String ip) → Checks if string is a valid IPv4 address.
isIPv6(String ip) → Checks if string is a valid IPv6 address.
isEmail(String email) → Checks if string is a valid email address.
isUrl(String url) → Checks if string is a valid url address.
isDomainName(String domain) → Checks if string is a valid url domain.
isDomainTld(String domain) → Checks if string is a valid top-level domain.
isGenericTld(String domain) → Checks if string is a valid generic top-level domain.
isCountryTld(String domain) → Checks if string is a valid country top-level domain.
isISBN10(String isbn) → Checks if string is a valid ISBN-10.
isISBN13(String isbn) → Checks if string is a valid ISBN-13.
isCreditCard(String cc) → Checks if string is a valid credit card number.
isAmex(String cc) → Checks if string is a valid AMEX credit card number.
isVisa(String cc) → Checks if string is a valid VISA credit card number.
isMaster(String cc) → Checks if string is a valid Master Card credit card number.
isDiner(String cc) → Checks if string is a valid Diner’s Card card number.
isDiscover(String cc) → Checks if string is a valid Discover Card number.
isVPay(String cc) → Checks if string is a valid VPay card number.

Date Functions

The majority of the functions here expect ZonedDateTime as input, so make sure to parse the date filed before using the functions.

Example:

send-to-error !date:isDate(Birthday)

parse-as-simple-date :Birthday MM/dd/yyyy

set-column month_born date:MONTH_LONG(Birthday)

UNIXTIMESTAMP_MILLIS(ZonedDateTime date) → Converts a date to long unix timestamp in milli-seconds.
UNIXTIMESTAMP_SECONDS(ZonedDateTime date) → Converts a date to long unix timestamp in seconds.
MONTH(ZonedDateTime date) → Converts a ZonedDateTime to Month in year.
MONTH_SHORT(ZonedDateTime date) → Extracts a short month description from Date.
MONTH_LONG(ZonedDateTime date) → Extracts a long month description from Date.
YEAR(ZonedDateTime date) Extracts only year from Date.
DAY_OF_WEEK(ZonedDateTime date) → Extracts day of the week from the date.
DAY_OF_WEEK_SHORT(ZonedDateTime date) → Extracts day of the week from the date.
DAY_OF_WEEK_LONG(ZonedDateTime date) → Extracts day of the week from the date.
DAY_OF_YEAR(ZonedDateTime date) → Extracts Day of the year from the date.
ERA(ZonedDateTime date) → Extracts Era from the date.
ERA_SHORT(ZonedDateTime date) → Extracts Era from the date as short text.
ERA_LONG(ZonedDateTime date) → Extracts Era from the date as long text.
DAYS_BETWEEN(ZonedDateTime date1, ZonedDateTime date2) → Return number of days between two dates.
DAYS_BETWEEN_NOW(ZonedDateTime date) → Return number of dates between now and date days.
SECONDS_TO_DAYS(int seconds) → Converts seconds to days.
SECONDS_TO_HOURS(int seconds) → Converts seconds to hours.
SECONDS_TO_MINUTES(int seconds) → Converts seconds to mins.
SECONDS_TO_WEEKS(int seconds) → Converts seconds to weeks.
isDate(String value) → Checks if a column is a date column.
isTime(String value) → Checks if the value passed is a date time.

Geo Fencing Functions

Some Geo functions for good measure.

Example: send-to-error !geo:inFence(latitude,longitude,body)

inFence(double latitude, double longitude, String geofences) → Checks if Point is inside any of the given polygonal geofences based on the winding number algorithm.
isPointInside(Feature feature, Coordinates location) → Checks if geometry is inside the given location
isLeft(Coordinates vertex0, Coordinates vertex1, Coordinates gpC) → You’ll have to look at the source code to figure this one out!!!

JSON Functions

For those times when you need to work JSON data in a column. Make sure to parse the JSON string first with the parse-as-json directive for the functions that expect a JsonElement as input.

Example: set-column character_data json:parse(JSON)

select(String json, String path, String …paths) → Selects elements from a JSON string.
select(String json, boolean toLower, String path, String …paths) → Selects elements from a JSON string.
select(JsonElement element, String path, String …paths) → Selects elements from a JSON string.
select(JsonElement element, boolean toLower, String path, String …paths) → Selects elements from a JSON string.
drop(String json, String field, String … fields) → Removes fields from a JSON, inline, by recursively iterating through the JSON to delete one or more fields specified.
drop(JsonElement element, String field, String … fields) → Removes fields from a JSON, inline, by recursively iterating through the JSON to delete one or more fields specified.
keysToLower(JsonElement element) → Lowers the keys of the json. it applies this transformation recurively.
join(JsonElement element, String separator) → Join two JSON elements together
stringify(JsonElement element) → Converts an JavaScript Object to a JSON string.
parse(String json) → Parses a column or string to JSON. This is equivalent to JSON.parse()
parse(String json, boolean toLower) → Parses a column or string to JSON. This is equivalent to JSON.parse()

Conclusion

This article listed all the functions you have available in Wrangler which can be used in directives that support expressions. You may have noticed that the isDate function is available in multiple classes. The one you use will depend on the type of input you are validating and the logic you are implementing.

In future articles we’ll dive a little deeper to see how we can chain such functions together and use them in combination to really spruce up our Wrangler recipes.

Wrangler Functions Cheat Sheet was originally published in cdapio on Medium, where people are continuing the conversation by highlighting and responding to this story.

Announcing CDAP 6.1.1 Release

Mo Eseifan — Mon, 03 Feb 2020 14:01:01 GMT

On behalf of the Cask Data Application Platform (CDAP) community, it is my pleasure to announce the release of CDAP version 6.1.1. This release includes a new, intuitive UI for Field Level Lineage (FLL), and extends FLL support to more plugins. It also introduces a new failure validation framework for plugins for early detection of configuration errors in pipelines. Additionally, you can now connect to your data in Azure Data Lake Storage (ADLS) and transform it visually in Wrangler.

Field Level Lineage

The new field level lineage feature allows users to better visualize the journey of individual fields in datasets. The updated design also makes relationships between fields easier to see by connecting fields in the UI, and users can analyze relationships and transformations by selecting specific fields. For more information on Field Level Lineage, see this blog post series: part 1, part 2.

Example of Field Level Lineage visualization

Pipeline Improvements

This release also introduces a new way for validating plugin configurations that allows us to detect and highlight errors within a pipeline, so that you can detect configuration errors sooner and build pipelines more efficiently. All the configuration fields with errors are highlighted once the user opens the plugin properties and a helpful error message is also displayed below the field . This greatly reduces the amount of time spent configuring and validating pipelines. For more information on the validation framework, please see this blog post.

Example of validation in the Joiner plugin

This release also adds new widgets so users get better controls to configure their pipelines more intuitively. These include ConfigurationGroup — a wrapper component that can hold other widgets. It allows for easier visual grouping of related fields. Additionally, developers can now use dynamic filters to define custom conditions for when certain fields should be displayed/hidden depending on the contents of other fields within the ConfigurationGroup. For example, show the ZipCode field if the value of the Country field equals “USA”.

Platform Enhancements

This release contains performance improvements to several areas across the platform. The most notable improvements can be seen on the pipeline list page and from within the Studio. These improvements have significantly improved the performance of the UI under heavy load.

Download CDAP 6.1.1 today and take it for a spin! Also consider helping us develop the platform by reaching out to the community with any comments, feedback, suggestions, or improvements or by creating and following JIRA issues and submitting pull requests.

Announcing CDAP 6.1.1 Release was originally published in cdapio on Medium, where people are continuing the conversation by highlighting and responding to this story.

How to debug your plugin

Tony Hajdari — Mon, 27 Jan 2020 14:01:01 GMT

In a previous article I showed you how to get started with plugin development in CDAP. But, wouldn’t it be nice if we can attach a debugger to a deployed plugin and see how it performs inside of a pipeline with actual data?

I’m going to show you how to attach a debugger to a CDAP sandbox instance so that we can inspect the data that is being processed by the plugin. This will help us get a better sense of what the plugin is doing, and all the other interesting bits that we can observe from the objects we’ll have access to.

It’s always handy to be able to see what data is being passed along from a previous stage in a pipeline and how that data is being passed to the next stage in the pipeline. When you build a plugin it’s always best to start simple and layer on additional functionality as you go. This article builds on the knowledge from the previous plugin development article and once again I’ll use the example-transform plugin for illustration.

Clone and Build the Plugin

First, clone the example-transform plugin from the Git repo.

git clone https://github.com/data-integrations/example-transform.git

Open up the pom.xml in IntelliJ as a project — we’ll come back to it later.

At this point we won’t be making any changes to the source code. We just want to build the project as is and deploy the resulting plugin to CDAP. You can build it manually in the terminal or use Maven inside of IntelliJ to build the project. The command to use is:

mvn clean package

Once the build finishes you will find the plugin artifacts in the resulting target directory as illustrated.

Start CDAP in Debug Mode

In order for us to debug the plugin we must first start CDAP in debug mode so out debugger has something to attach to. To enable the debugging agent simply add the following command argument to your normal cdap start command:

— enable-debug

The full comand would thus be:

cdap sandbox start — enable-debug

Optionally, if you want to use a port other than the default port 5050, simply add the port number after the enable debug command.

Take note of the following line in the output.

Remote debugger agent started on port 5005

This indicates that the debugger agent service has been started on port 5005. We will see this port referenced again later when we attach to the debbuger to that port. Now that CDAP is running in debug mode deploy the plugin to CDAP and build a quick pipeline that will pass some data on to this example-transform. Keep in mind that this is a transform plugin, thus you will find it within the Transform bin in the pipeline Studio. Bellow is a a screencap that illustrates this process.

At this point we won’t be making any changes to the source code.

Start Debugger in IntelliJ

With the plugin now added to our pipeline we are ready to start debugging the plugin code with IntelliJ while the pipeline is running. We will run the pipeline in preview mode so that we can iterate over deployment of the plugin and observe the effects of any changes we make.

Open up the source file for the ExampleTransformPlugin class, navigate to the transform method and put a breakpoint somewhere in that method. From the IntelliJ menu select Run → Attach to Process. This will open up a dialog that will show you the process listening on 5005. Select the Java process and connect the debugger.

Switch over to the pipeline studio, put in preview mode and run it. Switch back to the IntelliJ screen and whooila, your breakpoint has been reached and you can now inspect the data that is coming from the Wrangler stage. Clicking on the fields variable will show you that you have an object with 12 elements. You can now inspect each of the elements within this object and see what results it produces. As you step over each iteration of the loop you will see the records being submitter in sequence through the pipeline.

While you’re in debug mode the preview for the pipeline will continue to run until you stop the debugger or the pipeline preview.

Conclusion

When working with plugin development debugging is an essential part of this process. Being able to pause the pipeline in particular blocks of code help us better understand what is happening within the the code and how the plugins accesses all the other information that is passed along in the DAG, and the object we have access to. In particular you can see the StructuredRecord object with the input schema and the filed it contains so that you can determine how to deal with these records in your code. Hopefully you now have a good handle on how to proceed with debuging your plugin and can use the debugger to create plugins with less effort.

Happy coding!

How to debug your plugin was originally published in cdapio on Medium, where people are continuing the conversation by highlighting and responding to this story.

Deploying and Running CDF pipelines with REST

Tony Hajdari — Mon, 13 Jan 2020 16:00:41 GMT

Photo by Matthew T Rader on Unsplash

By now you probably know that you run CDAP pipelines on Cloud Data Fusion (CDF), but did you know that you can also control your CDF instance remotely using the rest API. In this blog I will walk you through the process of deploying and starting a CDAP pipeline on CDF using only a REST client. We’ll use the handy curl utility since it is easy to work with and is available on most platforms.

Using the REST API you can do quite a lot with CDF. You can create an instance from scratch, retrieve information about running instances, deploy pipelines and run them, and all the sorts of things that DevOps folks get the warm and fuzzies over. The underlying CDAP API is well documented on the CDAP site, so it’s probably a good idea to head over there for a quick glance at the types of things you can use the API to automate your CDF and CDAP workflows. What you won’t find on the CDAP site is the documentation around all the Googley things that you need to set up for GCP. You can find the reference documentation for the CDF REST API on the Google site here.

Set up GCP

The first thing you will need is the Google Cloud SDK, so go on and download it and install it by following the instructions here. If you have not set up a CDF instance yet go ahead and do that as well. Instructions for setting up CDF are located here. You can select either Basic or Enterprise edition. For this blog I’ll be using the Basic edition.

After you’ve downloaded and installed the gcloud SDK you will need to authenticate and generate a Google OAuth 2.0 token. On your terminal type the following command:

$ gcloud auth login

Make sure to substitute with the value from the ID column in the project ID list.

Entering this command will also launch your browser and take you through a typical authentication process for Google where you will be asked to provide a Google account (Gmail account) and grant access to the Google SDK. Once complete the terminal window with acquire the authentication token and tell you that you are now logged in with the user account you specified.

With authentication out of the way we can now set the environment variables that will contain the authentication token and the link to the CDF instance. Set the two environment variables to make the subsequent curl commands easier to work with.

export AUTH_TOKEN=$(gcloud auth print-access-token)

export CDF_API=https://id-dot-region.datafusion.googleusercontent.com

Take note that the CDF_API variable contains information from your CDF instance, project ID and region that CDF is deployed in. This URI may not be the most intuitive to construct, so you can go to the following page to retrieve your link from the CDF API for projects.locations.instances.list.

Make sure to put in projects/PROJECT_ID/locations/- in the parent field, and substiture PROJECT_ID with your own and click on the blue EXECUTE button. Your JSON response in the output should look as follows:

The part that interests us is the value associated with the apiEndpoint key. Copy this string and add it to the environment variable as illustrated below.

In my case the id is rest-demo-blog4rest-212908 and the region is usw1.

Create, Deploy and Run the Pipeline

Let’s build a pipeline using CDAP sandbox and see how we can upload this to the CDF instance via REST. I created a simple pipeline and exported the JSON file.

The pipeline itself is rudimentary and its purpose is simply to demonstrate that I can deploy the pipeline to CDF and run it. As a preparatory step I uploaded a CSV file the the GCP bucket for the sink. Validate and test the pipeline, and if all is good you can export it.

We now have a pipeline that we can deploy to CDF. The command to deploy the pipeline JSON with curl is as follows:

curl -X PUT -H “Authorization: Bearer ${AUTH_TOKEN}” “${CDF_API}/v3/namespaces/namespace-id/apps/pipeline-name” -d “@/path/to/the/pipeline/JSON”

Make sure to substitute namespace-id and pipeline-name with your own values. In my case values are default, because I’m deploying it to the default namespace, and my-rest-pipeline, which is the name I’ve given to this pipeline. The final component is the JSON file comprising your pipeline that was exported earlier. Don’t forget the @ symbol before the path.

The command entered into the console should look like the illustration below:

It may take a few moments for the command to complete, but once it does you will see a Deploy Complete message on the console. You can now check your CDF instance and see that the pipeline does indeed show up in the list of pipelines.

Open the pipeline and inspect it to make sure that everything looks in order, but don’t run it manually just yet.

OK, so now we have a pipeline deployed via the REST API. Now let’s issue the API command to run it.

The command to run the pipeline via the REST is:

curl -X POST -H “Authorization: Bearer ${AUTH_TOKEN}” “${CDF_API}/v3/namespaces/namespace-id/apps/pipeline-name/workflows/DataPipelineWorkflow/start”

Once again make sure to make the necessary substitutions in the command. The command entered into the console should look like the illustration below:

I added a timestamp for purposes of comparison, and as we can observe the timestamp that shows up in the CLI is identical to the one displayed in the pipeline list view. The pipeline will follow the typical lifecycle of Provisioning → Starting → Running → Success.

In the next illustration we can see the pipeline has entered the starting phase.

Conclusion

As you saw in this article, we can now leverage the CDAP API with Cloud Data Fusion to manage a CDF instance, and we used some of the most rudimentary aspects of the API to deploy and run a pipeline remotely with curl. This of course is only the tip of the iceberg and you can do significantly more with the API.

Future uses to consider is infrastructure as code where you can provision whole environments of CDF and deploy pipelies to the newly created instances. Beyond this you can monitor and manage your CDF instances in an automated fashion which will surely put a smile on the faces of all those DevOps folks that had been waiting for this feature in GCP.

What use cases do you envision the the REST API?

Deploying and Running CDF pipelines with REST was originally published in cdapio on Medium, where people are continuing the conversation by highlighting and responding to this story.

Validation Framework for CDAP Plugins

Vinisha Shah — Mon, 06 Jan 2020 14:01:03 GMT

CDAP provides an interactive UI to build data pipelines to apply code-free transformations on data. CDAP data pipeline is an acyclic graph composed of multiple plugins as its nodes and connections between them representing data flow. Each plugin in the pipeline can be configured by providing configuration properties, input and output schema for the plugin.

A CDAP data pipeline solving a real world use case can contain ~10 or more nodes in the graph. While building such CDAP pipelines, pipeline developers can provide invalid plugin configurations or schema. For example, the BigQuery sink plugin can have output schema which does not match with underlying BigQuery table or GCS source has invalid bucket name.

When a deployed pipeline fails due to invalid configurations, a common pipeline development flow will be to check the logs to identify the invalid configuration, clone the data pipeline and run it. This iterative process increases the data pipeline development time. Imagine how much time it will take to build a data pipeline with ~10–20 nodes. Also, during this process, Pipeline developers have to go through technical logs, stacktraces or even code base in order to identify problems such as NullPointerExceptions.

In the upcoming CDAP release 6.1.1, a new validation framework has been introduced to fail fast and collect all the validation failures. This framework also exposes a Failure Collector API to surface contextual error messages on CDAP UI.

Lets see what this framework provides and how it can be used in plugins to surface contextual error messages:

The validation framework should be able to collect multiple error messages in order to provide better user experience while building the pipeline. To do that, below FailureCollector APIs are exposed from the framework:

https://medium.com/media/bdcc5014b4f1532869fbba4522038d8b/href

Error collection using FailureCollector Api

CDAP plugins override method configurePipeline() which is used to configure the stage at deploy time. The same method is called through validation endpoint as well. In order to collect multiple validation failures, FailureCollector API is exposed through stage configurer to validate the configurations in this method. The sample usage of FailureCollector API looks as below:

https://medium.com/media/9b34d65f96541b077cc7b7d18cd1744c/href

Adding ValidationFailures to FailureCollector

A validation failure is made up of 3 components:

Message — Represents a validation error message
Corrective action — An optional corrective action that represents an action to be taken by the user to correct the error situation
Causes — Represents one or more causes for the validation failure. Each cause can have more than one attribute. These attributes are used to highlight different sections of the plugins on UI.

Example:

In bigquery source if the bucket config contains invalid characters, a new validation failure will be added to the collector with a `stageConfig` cause attribute as below:

Pattern p = Pattern.compile("[a-z0–9._-]+");
if (!p.matcher(bucket).matches()) { 
collector.addFailure(
"Allowed characters are lowercase characters, numbers,'.', '_', and '-'", 
"Bucket name should only contain allowed characters.'")
.withConfigProperty("bucket");
}

While a ValidationFailure allows plugins to add a cause with any arbitrary attributes, ValidationFailure API provides various util methods to create validation failures with common causes that can be used to highlight appropriate UI sections. Below is the list of common causes and associated plugin usage:

1. Stage config cause

Purpose: Indicates an error in the stage property

Scenario: User has provided invalid bucket name for BigQuery source plugin

Example:

collector.addFailure(“Allowed characters are lowercase characters, numbers,’.’, ‘_’, and ‘-’”, “Bucket name should only contain allowed characters.’”).withConfigProperty(“bucket”);

2. Plugin not found cause

Purpose: Indicates a plugin not found error

Scenario: User is trying to use a plugin/JDBC driver that has not been deployed

Example:

collector.addFailure(“Unable to load JDBC driver class ‘com.mysql.jdbc.Driver’.”, “Jar with JDBC driver class ‘com.mysql.jdbc.Driver’ must be deployed”).withPluginNotFound(“driver”, “mysql”, “jdbc”);

3. Config element cause

Purpose: Indicates a single element in the list of values for a given config property

Scenario: User has provided a field to keep in the project transform that does not exist in input schema

Example:

collector.addFailure(“Field to keep ‘non_existing_field’ does not exist in the input schema”, “Field to keep must be present in the input schema”)
.withConfigElement(“keep”, “non_existing_field”);

4. Input schema field cause

Purpose: Indicates an error in input schema field

Scenario: User is using BigQuerysink plugin that does not record fields

Example:

collector.addFailure(“Input field ‘record_field’ is of unsupported type.”,
 “Field ‘record_field’ must be of primitive type.”)
 .withInputSchemaField(“record_field”, null);

5. Output schema field cause

Purpose: Indicates an error in output schema field

Scenario: User has provided output schema field that does not exist in BigQuerysource table

Example:

collector.addFailure(
“Output field ‘non_existing’ does not exist in table ‘xyz’.”,
”Field ‘non_existing’ must be present in table ‘xyz’.”)
.withOutputSchemaField(“non_existing”, null);

Cause Associations

While validating the plugin configurations, the validation failure can be caused by multiple causes. Below are a few examples of associated causes:

Example 1:

Database source has username and password as co-dependent properties. If username is not provided but password is provided, the plugin can just add a new validation failure with 2 causes as below:

collector.addFailure(“Missing username”,
 “Username and password must be provided’”)
 .withConfigProperty(“username”).withConfigProperty(“password”);

Example 2:

Projection Transform received incompatible input schema and output schema for a field such that input field can not be converted to output field. In that case a new validation failure can be created with 2 different causes as below:

collector.addFailure(“Input field ‘record_type’ can not be converted to string”,”Field ‘record_type’ must be of primitive type’”) .withConfigProperty(“convert”).withInputSchemaField(“record_type”);

Summary

Validation framework reduces pipeline development time by failing fast and providing contextual error messages. This framework is available in upcoming CDAP release 6.1.1 to provide better user experience. Try out CDAP today and if you would like to explore such exciting challenges, consider contributing to CDAP.

Validation Framework for CDAP Plugins was originally published in cdapio on Medium, where people are continuing the conversation by highlighting and responding to this story.

Getting started with CDAP plugin development

Tony Hajdari — Mon, 02 Dec 2019 14:01:01 GMT

Photo by Steve Johnson on Unsplash

One of my favorite features of CDAP is that the extensibility of the platform allows you to add new functionality yourself. If you need to perform a transformation, or need to source or sink data to or from a system that is not currently available in the plugin ecosystem, you can easily add your own plugin to provide that capability.

Getting started with plugin development is as simple as cloning one of the example plugins and modifying it to add your own implementation logic. In this article I will cover what you’ll need to know to get started with plugin development. I’ll discuss the key configuration elements that you need to be familiar with so you can iterate over the development of the plugin.

Prerequisites

When writing plugins for CDAP you will need a collection of tools to help you develop, organize, build, and run your code. For this we’ll be relying on a number of tools: Git, Java 8, IntelliJ IDE, and Maven.

Fist step is to make sure you have Java 8 SDK installed on your computer. To check which version of Java you have run the following command in a terminal:

java -version

If your computer has a newer version of Java you will need to install a version of Java 8 SDK. You can download this from the Java web site , and we’ll use it later in the Java IDE for you project.

If you are on Linux you can install it with your package manager, Yum or APT, for Centos/RedHat or Debian/Ubuntu, respectively. For installing Java on a Mac with Homebrew you can use the following commands:

brew tap AdoptOpenJDK/openjdk

brew cask install adoptopenjdk8

If you have the latest version of Java on your Mac and have issues changing the version of Java to 1.8 then refer to this Stack Overflow article.

For writing the actual plugin code we’ll use the IntelliJ Java IDE which you can download from here.

Once our code is written, we’ll use Maven to build the plugin so that we can test it on CDAP. You can download Apache Maven from here. After you’ve downloaded Maven extract it to a directory of your choice and set your PATH environment variable so that it is available globally in your terminal. In my case my .bash_profile file has the PATH variable set as follows:

export PATH=$PATH:/Users/vetoni/apps/tools/apache-maven-3.6.2/bin

With the environment setup we can now use Git to clone the sample-transform plugin form the repo with the following command:

git clone https://github.com/data-integrations/example-transform.git

Git will create a folder named example-transform, and in that folder you will find a pom.xml file. We will use this file to load up the project in IntelliJ. Launch IntelliJ and select open, and make sure to choose Open as Project.

After IntelliJ finishes loading all the dependencies you will see a list of directories created with the components that we’ll be using in this tutorial.

The screen will look something like this:

The plugin in its current state is ready to be built, so you can try building it to make sure that Java 8 and Maven are working happily together. First check to make sure that you have set your Java version to 1.8. From file, select project structure and take note of the Java version listed for the Project SDK.

If you don’t see Java 1.8 listed in the dropdown you can add it to the project by clicking the New… button and selecting JDK.

Pick the location where you downloaded your Java 8 SDK earlier. On my machine Java 8 is located at the following location :

/Library/Java/JavaVirtualMachines/jdk1.8.0_121.jdk/Contents/Home/

You are now ready to test building the plugin. On the right hand side of the screen you can select the Maven Projects tab and select package from the list of Lifecycle commands.

This will invoke the Maven build, test and package targets and create a new target directory with the JAR file containing the plugin — example-transform-1.1.0-SNAPSHOT.jar.

Project Structure

Before getting started with the actual coding, let’s take a look at the project structure for the plugin. The plugin has a number of files that need to be updated according to your specific needs. Looking at the folder structure you can see the four directories that contain the files we need to modify.

docs → ExampleTransform-transform.md
icons → ExampleTransform-transform.png
src → ExampleTransformPlugin.java
widgets → ExampleTransform-transform.json

In each of these directories, with the exception of the src folder, you will see a file named with the following pattern:

-.

Take note that the icons folder is optional, so when you first clone the repo you will not see the folder in your project structure. If you don’t create the folder and provide your own 64x64 pixel icon, CDAP will provide a generic plugin icon for you. For this tutorial I’m going to supply my own plugin icon.

In the Java class you will notice there is a @Name annotation that matches the prefix for all the files used in the directories mentioned above. Pay close attention to the name of the plugin as this is where you may run into issues later on as you refactor the code to change the plugin name.

NOTE: The plugin name, as specified via the @Name annotation, must match the file prefixes in docs, icons, and widgets directory.

At the root directory we also have the README.md file that provides information and any documentation you may want to provide to users, as well as the pom.xml file that is used to build the plugin artifacts.

We are now ready to start customizing the ExampleTransform plugin and modify it to suit your needs.

I’ll start will some refactoring so that the plugin will reflect the intent of the implementation. First I’m going to modify the pom.xml and update the following properties with my own values.

groupId → io.cdap.hydrator.plugins

artifactId → simple-mask

version → 1.0.0-SNAPSHOT

name → Simple Mask

exported-packages → io.cdap.hydrator.plugin.*

For this plugin we are going to implement a super simple Rot13 Caesar cipher to obfuscate text data. This cipher simply takes some input text and jumbles the text using the Rot13 algorithm. To help us with the actual implementation we’re going to leverage a pre-existing library that provides us a Rot13 implementation, and were simply going to add it to our pom.xml file.

Add the following XML snippet to your dependencies section.

https://mvnrepository.com/artifact/org.soulwing/rot13 →

org.soulwing

rot13

1.0.0

IntelliJ will download the dependencies and you will be able to use the cipher classes in your code.

Refactor

Now we can start refactoring the files I discussed earlier. In IntelliJ you can use Shift+F6 to rename the files, package (folder), and class to SimpleTransform. I chose to name it something more generic so I can come back to it in the future and add more Caesar ciphers. Later I will refactor this plugin once more to change its name.

After you refactor the project files will look as follows:

Now’s a good time to build the project once again to make sure that nothing is broken. From the terminal you can run the following command to build the project:

mvn clean package

When the build runs, it will scan the widgets and docs directories in order to build an appropriately formatted .json file under the `target` directory. This file is deployed along with your .jar file to add your plugin to CDAP.

We can see Maven creates a target directory where we can find the plugin artifacts:

-> simple-mask-1.0.0-SNAPSHOT.jar

-> simple-mask-1.0.0-SNAPSHOT.json

We can now deploy the plugin to see what it looks like in CDAP Studio. Launch CDAP studio and click on the green plus icon and upload the plugin as illustrated.

Time to Code

We now have all the prep and setup work out of the way, and we have validated that the plugin compiles and loads into CDAP correctly. Now we can turn our attention to coding. In the next step we’ll write the logic for the transform plugin that takes text as input and generates scrambled text as output. Along the way I will refactor the code as I go so that it matches my requirements and will rename the plugin to ScrambleText so that its intent and purpose is absolutely clear.

All plugins have an API that includes a number of annotations as well as some configuration, initialization, validation, and depending on the plugin type, a method that implements the plugin’s core functionality. In this case, the majority of the implementation for a transform plugin in done in the transform() method. For further details on the various plugin types that you can build and how the API works please refer to the CDAP documentation page for reference.

Replace this section of code…

So that it looks like the following snippet.

As you can see we make use of the Rot13 class to convert the input string and set it to the output value for that record. IntelliJ should auto import the class for you when you use this class.

Next, we need to handle user input from the plugin configuration in CDAP studio. Input that is collected from the user will dictate how the plugin needs to behave and what output it should generate. Input validation is another important function that needs to be performed as you can never be sure that you will be getting valid data from a user. The degree as to how you configure your widgets and how you handle use input is totally up to you and what features you want to include in your plugin. For this example we’re going to keep it as simple as possible just to illustrate the concepts. For now make sure your config class looks like the following illustration:

Now we can turn out attention to customizing the plugin configuration form so that it looks like the following illustration. Plugin presentation is configured in the widget JSON file. Refer to the CDAP documentation for all the presentation widgets that you can use in your plugins.

Replace the contents of the widgets JSON file with the following, and take note of the name used for the KeyValue widget as this is the variable that is used in the Config section of the code to work with form inputs.

Right now the plugin doesn’t do much validation, and it only looks for the existence of an input field that will contain text to be scrambled. It will take all the fields from the previous stage and apply the Rot13 algorithm to generate scrambled output tex. At this point you’ll need to either comment out the code in the test class, as it will fail to build, or skip the tests for now when running the Maven build.

Finally, time to test the plugin in CDAP studio. At this stage of the plugin’s evolution it should be able to take input from a node in a pipeline, scramble the text, and output the scrambled text to the next node in the pipeline. There is only one validation that the plugin performs — it checks for the existence of a scrambleText input from the plugin configuration. Even though the plugin does not actually do anything with this data it needs to be provided in the plugin configuration in order for the plugin validation to succeed.

Looking Ahead

There is still more to be done! As an exercise, I leave up to you to expand upon the code to add the form validation as well as select the input fields the plugin will operate upon.

Further enhancements you can consider would be to include additional ciphers that the user can select from the drop-down. You will also need to update the test class to match the inputs, transformations, and outputs you want this plugin to test.

Summary

In this article we reviewed the steps required to get started with plugin development. There are a number of example plugins that can be cloned from the data-integrations GitHub repo that provide a quick start to plugin development. Using the sample plugin as a starting point for your custom plugin saves you from having to code all the boilerplate, and you’ll have all the requisite classes and methods in place to give you a springboard into development.

There are over a dozen plugin types available in CDAP. Although we did not go into any details of the Plugin API itself, the plugin API is fairly well documented on CDAP’s documentation page, so that you can dive deeper into the API on your own and learn about all the different types of plugins you can build.

Testing you plugin is also an important part of the development process, but that’s a topic for another blog article. In future articles we’ll look at plugin validation and testing in grater detail.

Hopefully this has given you sense of how to get started with developing plugins for CDAP. So, jump right in and create a few cool plugins of your own. Happy coding!

Getting started with CDAP plugin development was originally published in cdapio on Medium, where people are continuing the conversation by highlighting and responding to this story.