Overcoming Big Data Integration Challenges

Published in

cdapio

4 min readApr 24, 2019

August 23, 2017

Sreevatsan Raman is the Head of Engineering at Cask where he is driving the company’s engineering initiatives. Prior to Cask, Sree designed and implemented big data infrastructure at Klout and Yahoo!

Enterprise challenges

Hadoop has emerged as the leading technology to solve a number of big data use cases.

However, enterprises needing to solve their business problems often need to piece together different technologies to build a solution. Each component in the Hadoop technology stack is infrastructure focused and purpose-built to solve a unique set of problems. An enterprise that wants to solve a business use case — for example, a managed data lake — will need to spend a lot of time integrating these technologies to build the solution they need. Enterprises are also challenged with the talent gap of experts who know how the various technologies work when putting them together for the solution. The combination of these two big issues often times results in enterprises not being able to realize quick business value from their big data investments.

There are point products in the market which promise to alleviate some of the above mentioned shortcomings and cater to specific business problems. However enterprises adopting these products are shifting the problem from an infrastructure integration dimension to a product integration dimension, as these point products do not cater to wide range of use cases and the enterprises will often need to integrate several products for their immediate needs, as well as for future-proofing their stack.

Market Needs — Enterprise Ready Big Data platform

What the market needs is an enterprise ready big data solution, that integrates the underlying infrastructures and provides a simplified middleware layer to build self-service solutions that bridges the IT/LOB gap by combining the platform capabilities with product usability. Capabilities are needed to support big data lifecycle needs from data ingestion, exploration, data science, for new LOB development.

Foundation For Data Integration Solutions

If your enterprise is undertaking a data integration project, here are the foundations for a good data integration solution that we have seen:

A good big data integration solution should handle

Data Variety: Ability to seamlessly process different data formats handling both structured and unstructured data
Customizable Cleansing: Solution should handle different data cleansing mechanisms and data quality checks depending on the input data source
Different modes of delivery: Handle Real time, Batch and Streaming data sources
Timely access to data: Provide data access to LOB as fast as possible — including integrating new feeds to the data lakes as fast as possible, making the data available for data exploration and data science as soon as possible
Current and emerging use cases: Solutions should support not only current but also emerging use cases
Data governance: All the data integration efforts should be compliant with strict security and data governance requirements of IT to allow for an enterprise-grade, production-ready solution

Enter CDAP — A Unified Integration Platform

At Cask, we help companies overcome their data integration challenges by providing a runtime environment for the big data eco-system technologies, as well as a self-service-oriented, easy-to-use platform to solve a variety of different big data use cases. Cask Data Application Platform (CDAP) caters to the wide range of use cases including, but not limited to: self-service, managed data lakes, data preparation, data exploration, machine learning and data science, as well as building large scale production-ready applications. The platform components meet strict IT data security and governance needs and provide authentication, authorization, audit logging, metadata and lineage — cornerstones of a solid enterprise-grade, production-ready solution.

Use Case Deep Dive

Now let us take a look at one of the use cases where Cask helped a customer solve their data feed processing problem, which included ingesting and processing several types of data feeds — XML, CSV, JSON etc. Each feed had a different data cleansing and data preparation step. At the end of the data ingestion and preparation stage, the customer had requirements to land the data into a managed data lake.

So, what makes the problem challenging? The capabilities required efficient ways to manage hundreds of feeds, dynamic capabilities to add new feeds easily, provide customizable cleansing and data quality checks on a per feed basis. When the number of feeds are in the hundreds, and managing different set of ingestion pipelines won’t scale, the management of these feeds can become a problem. In addition, data ingestion needs to be realized while complying with strict security and governance requirements of the enterprise IT team.

The following video highlights key elements of the Cask solution to solve the problem.

CDAP provides a solution for end-to-end big data integration solutions. Download CDAP or try it in the cloud today, and give it a spin!