Introduction to CDAP Wrangler

Nitin Motgi
Jul 22 · 4 min read

It’s often the case that you deal with incomplete or messy datasets all the time. Data from varied sources can be unusable in the beginning but once the data is transformed, mapped and cleansed it becomes usable. As majority of data scientists, data engineers and analysts time is devoted to transforming, cleansing and mapping data, rather focusing on extracting insights, building analytics pipelines and models that leverage that data is their goal.

In summary, taking messy data that it complex and make it useable for further analysis you need to wrangle with data. Furthermore if you need to operationalize the wrangling process, you would need operational data pipelines that can be easily scheduled, managed and tracked for data governance.

To make it easy for operationalizing data transformation, mapping, and cleansing, CDAP introduced Wrangler. CDAP Wrangler is an Accelerator (A CDAP Application) that provides simpler ways to map, transform, harmonize, applying data quality checks, and enriching data in a code-free manner (visually). Below is a screencast that quickly introduces the capabilities of CDAP Wrangler.

A Screencast (no voice) — Introduction to Wrangler

History

Data Pipelines were introduced in CDAP around 2014–2015. Users defined mappings using Javascript, Python or Java. Despite of simplicity it provided, larger percentage of users expressed concerns about writing procedural code in a non-traditional environment. In addition to having difficulty in debugging logic and identifying issues during development, they found it hard to maintain and evolve.

Around that time, one of the Cask customer expressed interest in partnering to develop a new approach for defining data mappings and transformations. Collaborating with customer for few months we collectively came up with a new mapping framework that satisfied following requirements :

  • Support parsing of various data formats,

With those few basic requirements from the customer, Wrangler in CDAP was born!

Overview

CDAP Wrangler was never designed to be a full fledged data preparation solution for for business analysts. Primary purpose is to provide simpler and easier ways for mapping and transforming data in an interactive way during the process of building data pipelines.

At the core CDAP Wrangler is a transformation & mapping library with a collection of pre-defined parsers, transformations and data quality checks built-in. In addition to being a reusable library, it’s also includes:

  • A data pipeline Transform (used in CDAP Data Pipeline) for operationalizing transformations,

Wrangler is a CDAP Application built using public CDAP APIs. It’s available on GitHub here under Apache 2.0 License.

Using CDAP Wrangler users today can:

  • Parse data,

Roadmap

Following are few major features planned to be integrated with CDAP Wrangler within next 24 months. The features below are in no priority order.

  1. Automatic mapping of data into a pre-defined semantic model,

Conclusion

CDAP Wrangler is a versatile tool that aids in building operational data integration pipelines by providing a visual code-free environment to transform, cleanse and map data. A lot of new exciting features are planned to be added in next 24 months to the Wrangler. We would love to hear from you on what features described in roadmap are of value to you. If you have any questions and/or concerns please don’t hesitate to reach us CDAP Google Group or on Slack Chat.

cdapio

CDAP is a 100% open-source framework for build data analytics applications

Nitin Motgi

Written by

Nitin Motgi is Founder and CTO of Cask, where he is responsible for developing the company’s long-term technology and driving company engineering initiatives.

cdapio

cdapio

CDAP is a 100% open-source framework for build data analytics applications

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade