Introduction to CDAP Wrangler

Nitin Motgi
Jul 22 · 4 min read

It’s often the case that you deal with incomplete or messy datasets all the time. Data from varied sources can be unusable in the beginning but once the data is transformed, mapped and cleansed it becomes usable. As majority of data scientists, data engineers and analysts time is devoted to transforming, cleansing and mapping data, rather focusing on extracting insights, building analytics pipelines and models that leverage that data is their goal.

In summary, taking messy data that it complex and make it useable for further analysis you need to wrangle with data. Furthermore if you need to operationalize the wrangling process, you would need operational data pipelines that can be easily scheduled, managed and tracked for data governance.

To make it easy for operationalizing data transformation, mapping, and cleansing, CDAP introduced Wrangler. CDAP Wrangler is an Accelerator (A CDAP Application) that provides simpler ways to map, transform, harmonize, applying data quality checks, and enriching data in a code-free manner (visually). Below is a screencast that quickly introduces the capabilities of CDAP Wrangler.

A Screencast (no voice) — Introduction to Wrangler

History

Data Pipelines were introduced in CDAP around 2014–2015. Users defined mappings using Javascript, Python or Java. Despite of simplicity it provided, larger percentage of users expressed concerns about writing procedural code in a non-traditional environment. In addition to having difficulty in debugging logic and identifying issues during development, they found it hard to maintain and evolve.

Around that time, one of the Cask customer expressed interest in partnering to develop a new approach for defining data mappings and transformations. Collaborating with customer for few months we collectively came up with a new mapping framework that satisfied following requirements :

  • Support parsing of various data formats,
  • Have the ability to specify a row and column data mapping without having to write procedural code,
  • Provide extensible framework for defining new mapping,
  • A user interface to apply transformations / mapping visually, and
  • The ability to transfer transformation / mapping recipe to data pipelines and back.

With those few basic requirements from the customer, Wrangler in CDAP was born!

Overview

CDAP Wrangler was never designed to be a full fledged data preparation solution for for business analysts. Primary purpose is to provide simpler and easier ways for mapping and transforming data in an interactive way during the process of building data pipelines.

At the core CDAP Wrangler is a transformation & mapping library with a collection of pre-defined parsers, transformations and data quality checks built-in. In addition to being a reusable library, it’s also includes:

  • A data pipeline Transform (used in CDAP Data Pipeline) for operationalizing transformations,
  • A CDAP Service for exposing capabilities through REST API (used by User Interface), and
  • A SDK (API and Testing Framework) for adding new transformations and mappings

Wrangler is a CDAP Application built using public CDAP APIs. It’s available on GitHub here under Apache 2.0 License.

Using CDAP Wrangler users today can:

  • Parse data,
  • Define data mappings,
  • Transform data,
  • Change type of data,
  • Filter data,
  • Cleanse data,
  • Enrich data,
  • Format data, and
  • Data quality checks

Roadmap

Following are few major features planned to be integrated with CDAP Wrangler within next 24 months. The features below are in no priority order.

  1. Automatic mapping of data into a pre-defined semantic model,
  2. Ability to define, manage lifecycle and share semantic models,
  3. Support Recipe (collection of Directives) creation, management, sharing and integration with data pipelines,
  4. Ability to profile full scale data and provide deeper insights for analysis,
  5. Merge concepts of Wrangler Connections and Data Pipeline Plugins,
  6. Support for Conditional Recipe execution. Conditional Sub-Recipe execution support,
  7. Multi-row support for windowed analysis,
  8. Support for joining or lookup Slowing Moving Dimensions (SMD) datasets,
  9. Macro support within Recipe,
  10. Support for aggregate functions, pivot/un-pivot, conditional splitting of rows,
  11. Role Based Access Control (RBAC) support to control access to Directive and Recipe,
  12. Improvements in translating directive operations into a natural language for describing Field Level Lineage (FLL) for Data Governance,
  13. Improvements in automatic information type detection,
  14. Visual integration supporting various sampling and re-sampling methods, and
  15. Additional directives in the area of NLP, Date/Time Formatting, Masking, Redaction, and more.

Conclusion

CDAP Wrangler is a versatile tool that aids in building operational data integration pipelines by providing a visual code-free environment to transform, cleanse and map data. A lot of new exciting features are planned to be added in next 24 months to the Wrangler. We would love to hear from you on what features described in roadmap are of value to you. If you have any questions and/or concerns please don’t hesitate to reach us CDAP Google Group or on Slack Chat.

cdapio

CDAP is a 100% open-source framework for build data analytics applications

Nitin Motgi

Written by

Nitin Motgi is Founder and CTO of Cask, where he is responsible for developing the company’s long-term technology and driving company engineering initiatives.

cdapio

cdapio

CDAP is a 100% open-source framework for build data analytics applications

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade