Data Lake Can Help Break Data Silos

Naukri Engineering
Naukri Engineering
Published in
7 min readAug 4, 2020

By Arvind Heda

Data is the new currency

Data is the new oil in Digital Economy

Data-driven Organizations

All of us have been hearing these quotes for quite some time now and nobody can disagree that data has become the impetus of progress today. As organizations are adopting new technologies and advancing towards digitization, they are experiencing this day after day.

Everyone today understands the importance of data and being data-driven, but the challenge faced is how to leverage this data to their advantage? How to ensure that they are storing all the data and not letting this opportunity go wasted?

Historically, data used to be what users entered in some form/field, and applications used to safely capture, validate and store this data in a secure DB for future access and action. But, today, data is omnipresent, it is everywhere — it is not just limited to what users have entered, rather, every action and reaction of a user is a data point.

In this new context, data could be anything, like,

  • Every user action on any channel
  • Intent or Start of a potential transaction
  • Customer service call/email
  • User visits to a channel (with or without any activity)
  • User interaction outside the platform or with outbound communications like emails/SMS
  • Even user inactivity and absence on the platform is a data point!

All such signals are important data but unfortunately, all of these are captured by different application modules, at a different frequency, in different sources/formats, such that this data is often incompatible and thus results into building & managing disjoint Data Silos.

Organizations can still capture and store every bit of data but the fact remains that it is stored in deep silos that they are never able to use as effectively as they might want to. Some of the common challenges that they struggle with are:

  • Not everyone in the organization is aware of what data is being captured and available as different groups may be working on these data points.
  • Data persists in different stores/formats and optimized for specific views — this makes it very challenging to get it all together in a single view.
  • Users end up defining alternate cuts and metrics which may be incompatible and overlapping which could lead to wrong conclusions/decisions.

Organizations today have successfully obtained data and built huge data silos to store this mountain of information, but without the ability to efficiently utilize this data, they may be missing on the true power of data and endless possibilities that it offers.

Data gets fragmented in collection/definition/ownership that often stakeholders find it too challenging to understand the data model and start looking at the problem from the other end. Organizations start from what they want to track, determine the set of data that may be required for their use case, and once done, they pull all relevant data to an intermediate system to perform transformation and finally publish the processed data. If it is looked at closely, it is similar to a standard ETL process and may lead them to a typical DWH based solution.

However, the challenge is that it is a top-down approach, where one picks up a use case and pulls data required for it to implement the use case; BUT they are never looking at the option to explore the data or see what all is feasible with it.

Essentially, they are asking only those questions that they know and not even thinking about the possibilities that they do not know

This is where organizations may get limited in options of what they can do with their data — this new currency? As there is no way to even look at this data together — except for the partial cuts that they may have pulled out for specific use cases. These are precisely the deep silos where Data is stored but lost, is available but not usable, is complete but yet incomplete.

Eventually, all stakeholders may end up defining their copies and use it the way they want to, but would lack consistency, correctness, co-relation & completeness.

Bridging Silos — The Concept of Data Lakes

This is where Data Lakes can help bridge these Silos and get everyone on the same page. A data lake is defined as — a clean/well-organized collection of data coming from multiple heterogeneous sources, such that:

  • It serves as a single source of truth for all cross channel data requirements.
  • Data remains as close as possible to the original format.
  • Data remains fresh and in sync with all sources, either in real-time or at a defined frequency.
  • It offers easy discovery, helps get a shared definition of all entities, and attributes.
  • It enables users for multichannel co-relation/exploration and Analytics on this data.

Thus, data lake by its definition alone, addresses some of the inherent challenges posed by silos. A lake can help overcome some of the common disadvantages of silos, it can also help everyone get to common data/definition and metrics and be more data-driven, but at the same time, it requires a robust design/strategy for the lake implementation such that it can blend with the overall Product architecture. Besides this, at a feature level, any data lake solution would need to solve most of the aspects below:

  • High Volume — should be highly scalable to cover large volumes coming out of several sources including detailed click streams.
  • Huge Variations — it should handle different formats and structures of data coming from hundreds of different sources.
  • High Velocity — should scale and collect large volumes at a high ingestion rate — billions of events per day.
  • Clean Catalog — should offer a uniform way of storage/access/discovery via standard catalog
  • Easy Exploration — enable ML/Analytics on the data to derive co-relation/ metrics.
  • Freshness/Completeness — should have monitoring and auto-heal property to ensure that its data remains fresh and updated so that it can be trusted.

The Necessity of Data Governance

As organizations define data lake, there is always a risk of getting to another extreme where they end up putting everything in Lake but without any defined policy for storage, indexing, catalogue, validations and freshness. This can lead to a new set of issues where users get everything in one place but its really not usable as either the data is incomplete, inaccessible or has few holes due to which it cannot be relied upon. This is a sign that the Lake is turning into a data swamp where everything is available but not in a very reliable or predictable manner, thus losing all the advantages of having a data lake.

Hence, one may need to adopt basic data governance in Lake to ensure that the data is healthy, reliable and usable. Some simple things to ensure this are:

  • Everything ingested should have a meaning associated with it and have a purge policy — the lake should not be used as a dump yard for any data.
  • Single consistent way of defining Entities, Attributes and Metrics. This will ensure easy exploration/analytics.
  • Enforce freshness and completeness of data, manage cross dependencies smartly so that Lake users do not get incomplete/inconsistent data.
  • Setup alerts, assertions and auto heal processes to ensure continuous and reliable processing.

Is Data Lake Your Need of The Hour?

A well designed data lake can help organizations overcome data silos, but one question that remains open is, how does one know whether they need it or not? This is where every organization needs to assess it for themselves, e.g. one may NOT need it if they are:

  • Primarily an offline business where all data is in a single system.
  • A single-channel business with data at a central place.
  • Multi-channel business that does not care about past interactions rather treats every transaction as a new interaction.

But, if an organization is not in any of the categories above, or it cares about data, cross channel co-relation, or its business is on the digital transformation journey, then it’s very likely that it may be dealing with these data silos. Once it decides to go for data lake, they still need to devise a clear strategy to build and integrate it such that it can take care of basic tenants of data lake, they also need to ensure that it aligns well with their organizational structure as that is equally critical in its adaption and success.

At Naukri, we work with hundreds of data sources with different grains and it was always a challenge to co-relate them or to have a consistent metric for all stakeholders. Just like any other organization at this scale, we had our own challenges of data silos — used to spend a lot of time and effort to co-relate data and were still left with ambiguous metrics.

To overcome all these challenges and to build a scalable system we started to develop a Data Lake solution, while the journey had its own challenges but what we ended up with is a highly scalable lake that can ingest any number of data sources — batch or real-time, can handle a high volume, velocity and variety and above all equipped to enrich the events in real-time before they actually hit the lake. Additionally, it also enables us to blend multiple data sources and build high-level actionable data models that help us with — real-time actions, batch jobs and reporting.

--

--

Naukri Engineering
Naukri Engineering

Think, Develop, Rollout, Repeat. The world class recruitment platform made with love in India. https://www.naukri.com/