BEWARE! Your Data Lake is not only for today’s Business.

In today’s digital world, “data” is everything.. Every organization is trying to find ways to slice & dice the data to be relevant & competitive . While we see there are many data platforms getting evolved faster, today Hadoop is one of the major data platform which is being implemented for computation analytics using Map Reduce or Spark framework. (Watch out for Flink)

What happens in 1st DL kickoff meetings?

In most of the implementation that i have been involved, it so happened that the devops team’s focus is on the frameworks for the data-pipeline (spark/MR/flink/google dataflow), implementation languages (java/scala/python) and computational requirements (Batch/Realtime).

While these are critical for any data lake success, but this is not enough to support future’s customer need or customer’s customer need.

DL is not for today but also for future too..

The purpose of the “data lake” is to enable your customer’s business to get the “difficult questions & answers” a.k.a insights, which are critical for new services and products introductions.

So, it is safe to assume your Data Lake going to be there and evolve for a decade.

Now is data-pipeline framework, implementation language, computational requirement enough to keep the data fresh for a decade ? May be not,

What more should be in your 1st DL kickoff meeting?

In my experience as AWS/Hadoop SysOps Lead , below are some of the key questions that needs to be answered before your 2nd meeting.

  1. What are the data pattern from source?

2. Is there any cleaning required to improve the computational efficiency ?

3. What is the schema of the data ? How it support evolution?

4. How the data can be stored with less storage ? (Remember your AWS EBS usage bill ^_^)

5. What compression suits your needed ?

6. What data encryption standards suits your need?