Published in


Hybrid data integration using CDAP

The data landscape is changing. Not only is the volume of data exploding, but the variety of data in terms of formats, locations and velocities is also compounding challenges for enterprises.

In this day and age of ever increasing data, effective data integration is rapidly becoming key to digital transformation. A data integration strategy that can deal with vast amounts, varied characteristics and formats, as well as locations is essential to foster data-driven business decision making today. This is because a few broad trends in data integration are coming together to put more pressure on data professionals than ever before:

  1. Increasing volume, variety and velocity: More data from a variety of sources in more formats (structured and unstructured) is arriving quicker than ever before.
  2. Increasing, disparate data locations: Data no longer lives just in a single on-premises data center; it is scattered across multiple public clouds, private clouds, application-specific silos, organizational silos, and on-premises
  3. Increasing demand and expectations: New use cases are emerging, that are creating more demands, and expectations from data. IoT, streaming and machine learning mean that a large amount of previously untapped data is now critical.

The Data Deluge

The following key highlights emphasize the magnitude of data being generated today:

  • 10.7 Billion devices represents 27% growth over 2017 alone
  • 163 zettabytes — that’s 163 billion terabytes of data
  • Of unstructured data, less than 1% is used in any form, which includes data from sensors that people install specifically to get data from
  • Less than 50% of structured data — typically data that was generated specifically with decision making in mind — is used to make decisions.

The enterprise conundrum

These complexities, coupled with the data deluge leads to many conundrums for CIOs and CDOs. For example, in the context of data location, the following two quotes represent one of the classical conundrums for leaders:

Cloud is a no-brainer. Cost effective, easy, secure, scalable. I can snap my fingers and offer an enterprise class service now

but at the same time:

Transition to cloud is hard. I can’t stop everything and migrate — these are mission-critical apps and data. Also, I will never move some of my workloads to the cloud due to privacy restrictions

Hybrid Data Integration

Classical data integration techniques will come up short in this new environment because they force users to make choices — choice between on-prem or cloud, structured or unstructured, big or small, relational or NoSQL, etc. A hybrid data integration strategy is the need of the hour, since it allows users to get value from their data, without forcing technology choices on them upfront. It removes restrictions and dependencies, and gives them the flexibility to keep their business running in the midst of a rapidly changing data and technology landscape.

It’s not about forcing people to make technology choices upfront. Its about giving them options, and the flexibility to choose a technology they are comfortable with, while modernizing at their own pace. An ideal data integration toolkit doesn’t force you to pick a technology OR the other, a given location OR another; it works allows you ALL options

A hybrid data integration strategy should scale to handle all kinds of data irrespective of its size, format, location and intended use case is key to success in this new environment. It should handle all the complexity of data environments, while shielding users from it, so that they can focus on their core job of creating value from data. It should facilitate easy ways for users to discover, understand, process and track data, without worrying about the intricacies of environments, integrations and implementation. That’s exactly where a framework like CDAP finds its sweet spot.

CDAP, and its role in hybrid data integration

CDAP as a key enabler of hybrid data integration, irrespective of data locations, format and technology choices

CDAP is an open source framework for building data analytics applications. It provides abstractions so that developers, data scientists and business analysts can start deriving value from their data, irrespective of its size, shape, location, format or speed of arrival. Additionally, it also provides a middleware fabric with platform capabilities such as security, metadata (for discovery and governance) and operations. For applications, this fabric is critical because it ensures that these capabilities are available as a common substrate, so you don’t have to build them over and over again, leading to standardization. For data on the other hand, it plays a critical role by making all kinds of data, no matter where it is located, available to users through a single pane of glass. Let’s dive into two key capabilities of CDAP that make it stand out as a critical component of your modern hybrid data integration strategy.

CDAP’s key capabilities that make hybrid data integration a reality

Seamless data movement and integration

CDAP supports a built-in framework for data integration and movement, which allows users to create pipelines for easy data movement, transformation and integration using a graphical interface. Pipelines provide easy access to a variety of data, irrespective of its location — from legacy systems such as mainframes to relational databases to flat files to data warehouses to enterprise applications to modern systems such as cloud services and streaming services — through simple configuration on a UI. It also provides 100s of built-in transformations that can easily parse, cleanse, process, munge, map, filter and aggregate data. The intuitive user interface ensures that you can get access to and process data based on your understanding of business logic, without worrying about its physical implementation detail. The framework then does the heavy lifting of not only translating your business logic into execution plans that can run on scalable systems such as Apache Hadoop, Apache Spark, Google Cloud Dataproc, Amazon EMR, and many more to come. Another benefit of such a framework is that even though your data may be processed using these different frameworks, you still get a standard operational view with logs, metrics, scheduling, etc. This makes operating your data pipelines in mission-critical environments much easier.

Automated data discovery and traceability

While CDAP pipelines make data processing, movement and integration easier, CDAP’s metadata capabilities provide standardized access to data. This provides data stewards and compliance officers in enterprises the traceability they need through lineage and audit logs. It also benefits end-users such as developers, data scientists and business analysts by providing them with automated data discovery, based on both technical and business metadata. Automated data discovery and lineage capabilities ensure that users no longer need to have the knowledge of where their data is located, its format, and in many cases, the technical implementation details of how it needs to be processed. Through a semantic layer, CDAP makes these disparate data assets available to users through a single pane of glass, that shields them from the underlying complexity of the hybrid world.

Key takeaways

Together, CDAP pipelines and metadata provide enterprises a modern, hybrid data integration framework that can help them succeed in this new age of data. When coupled with CDAP’s security features, they can help enterprises promote self-service data integration, while also giving IT the guardrails it needs to maintain data traceability, security and privacy. They provide enterprises the following key benefits:

  1. Unified access any data irrespective of its location, format, and other characteristics; and
  2. A standardized data processing and movement framework across all kinds of data environments.

This benefits ensure that enterprises are not forced to make abrupt, ad-hoc technology decisions, and allow them the flexibility to modernize their infrastructure at their own pace, without causing disruptions in their business.

To experience some of these benefits, download CDAP today, on your laptop, your on-premises data center, or your favorite public cloud. You can also try out Google Cloud’s managed service for CDAP — Data Fusion. Also, we love your feedback, so feel free to reach out to us through our various community channels.




CDAP is a 100% open-source framework for build data analytics applications

Recommended from Medium

Day 4: Merging Records From Separate Systems

Exploring large image datasets with Facets Dive

How to Use a Knowledge Graph for Precision Medicine

Urban Cities: A History Told By Data

Visualisation of 12 UK Cities’ streets’ orientations, side by side

Introduction to Data Science Process

3 Major issues to avoid in Experimentation

Faces of data science 2

Using Natural Language Processing for Preventive Maintenance.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Bhooshan Mogal

Bhooshan Mogal

Product Manager passionate about simplifying complex data technologies for the end user

More from Medium

Why are Data Warehouses evolving to Lake Houses? Part3 — Removing silos allowing collaborative work

Building our big data platform (part 6)

Impact of Data File Formats in Big Data

Data Reliability at Scale: How Fox Digital Architected its Modern Data Stack