Data Platforms for 2020 & Beyond

Ramesh Hariharan
Better Data Platforms
11 min readDec 7, 2020

--

In the last few years, Data Lakes have utterly failed to deliver the promise of Intelligent Automation. We need to move beyond data lakes to something else. What does that look like?

Intelligent Automation has promised tremendous benefits to businesses: improving customer experience, delivering operational excellence, achieving cost reduction or a combination of these.

Intelligent automation leverages analytics, machine learning and automation to deliver hyper-personalized experiences to customers and empowers employees to make smart and thoughtful decisions.

To realize these benefits, in the last few years, companies have ramped up investments in Big Data & AI. This has only accelerated in 2020 after the onset of the pandemic.

Despite all these investments, companies are struggling to close the gap between promise and reality. They are not seeing results commensurate with the investments made.

Here are some statistics to back up these assertions.

Imagine that you are the VP of Data Platforms in FortuneCo. Your mandate is to create the foundational data infrastructure to enable data-driven transformation (or substitute your own key phrase in italics). At the beginning of 2020, you delivered the first version of an Enterprise Data Lake. Your team architected and built a world-class centralized data lake that ingests every possible dataset in the company, structured or unstructured, internal or external, ingested through batch or streaming. The solution stores and catalogs every possible byte of historical data. The solution is an architectural marvel.

However, to your dismay, your business partners are not using the data as much as they should. “Build it; and they will come” approach seems to be a failure. Not to mention the millions of dollars spent in building the shiny new data platform.

Then the pandemic hit, and you believed that business users will become more data-driven. Sure, some users did come, and they used your data platform to make better decisions in the face of uncertainty. However, you are left wondering how to scale up the impact.

This situation is not an imaginary one. According to several studies, many senior executives believe that their companies are not yet fully leveraging data, analytics and AI in their business processes, in spite of significant investments in data platforms.

A company’s data platforms are critical to building Intelligent Automation. LatentView has been at the forefront of building data platforms that deliver the promise of Intelligent Automation. To make this happen, we believe that the next generation of data platforms need to combine the strengths of data lakes and data warehouses but evolve beyond their limitations.

Today’s data lakes provide a deluge of raw data, but without the business context. Business users are unable to make effective use of this data to drive decisions. These datasets are created without a clear understanding of how they are to be used by their consumers. To democratize data access, these datasets need to be widely available, easily discoverable, understandable and usable, and versioned.

To leapfrog to the next generation, companies must do the the following three things: 1) deliver domain-specific datasets 2) focus on specific use cases 3) provide data as a service.

To see what this means, let’s take a brief look at the evolution of data platforms over the last 15 years, their success and failures.

The Enterprise Data Warehouse (EDW)

In the 2000s and early 2010s companies invested heavily in data and analytics capabilities through software solutions such as the Enterprise Data Warehouses (EDW) and centralized BI. The EDW is a centralized data platform, owned and managed by the IT team. It ingests data from multiple source systems, cleanses, transforms and stores the data for analysis.

There are two broad ways of developing the EDW: a Corporate Information Factory that integrates all the data and then, from there, delivers the data to data marts (Inmon approach), or a collection of business processes-focused dimensional data marts that are “conformed” with each other, collectively serving as the EDW (Kimball approach). Many enterprises adopt a hybrid approach.

Irrespective of the flavor adopted, the EDW helps businesses get answers to known questions (known-unknowns). This is achieved by explicitly modeling the data structure and enabling users to write SQL queries against this structure.

The teams managing the EDW comprising of business analysts and data engineers capture significant enterprise domain knowledge and build it into their transformations. This enables the data warehouse to act as a single source of truth, track audit trail of data and act as the authoritative source of information for analytics and decisions.

However, as companies become more digital, there are many reasons why the EDW fails to meet the needs of its consumers:

Data sourcing challenges

  • The EDW cannot ingest data from unstructured sources. Data needs to be structured before it is ingested for analysis
  • For very high volume data (such as IoT or clickstreams), the EDW cannot store the data at the lowest level of granularity. Data needs to be aggregated, as otherwise the cost and performance will degrade quickly
  • Many times, the EDW cannot keep up with the ever-expanding sources of data, as this is managed by a centralized data management team

Data Processing challenges

  • The data pipelines that feed data into the EDW requires a tremendous amount of management and scaling. Only 20% of the effort is spent on actual analytics. 80% goes to job scheduling, failure management, capacity provisioning, workload management, etc.
  • The architecture of EDW makes it harder to avoid technical debt. Many times, the data pipeline jobs are not configuration-driven, resulting in a large amount of avoidable redundancy. Sometimes, to improve speed to delivery, engineering teams take shortcuts that result in tangled architectures. For example, they reuse intermediary output from other processes, that make them undeclared consumers of these processes, thereby making it harder to change

Consumption challenges

  • Scaling up Machine Learning (ML) pipelines is harder with EDW. ML jobs require experimenting with a large no. of features. To build these into production, these transformations needed to be translated into ETL jobs. Data scientists may want to create a huge number of features, but end up only using a subset of these. This process cannot be managed by a centralized team at scale
  • Analytics teams at the last mile have found a gap between the enterprise data delivered by the EDW versus what’s needed by them for insights. In other words, they need more flexibility for ad-hoc analytics than that provided by a centralized data model. While this can partly be achieved through custom data models sourced from EDW, they cannot mitigate the rigidity of the EDW’s centralized architecture

Change management challenges

  • Sometimes, organizations may have multiple EDWs, resulting in confusion and siloed view of the business. This can happen due to several reasons. Mergers and Acquisitions can lead to two different business units having their own EDW. The people running the business may not have an integrated view of the entire enterprise across the silos.
  • Changes in business logic takes weeks or months to be implemented in EDWs, resulting in unacceptable latency for businesses competing in the digital era.

The Data Lake arose as a natural architectural pattern to complement the EDWs focus on structured and explicitly modeled data.

The Monolithic Data Lake

In 2014, Pentaho CTO James Dixon coined the term “Data Lake” to refer to a single source of data that helps users answer both known questions as well as unknown questions that may arise in the future. The volume of data makes it difficult to store in a data warehouse. There is no schema present, since the data is stored as is, in a raw form.

The data lake, as originally envisioned, complemented EDW’s approach to strictly modeling data, since not everything can be captured in a fixed schema. For example, clickstream data is usually delivered in JSON files, and the structure keeps changing over time depending on the changing nature of digital processes that generate this data.

Over time, enterprises started to build data lakes that ingested all the raw data into a single store, rather than ingesting from a single source of data. They also extended the capabilities of data lakes to ingest real-time data. However, this results in the loss of domain knowledge and makes it difficult for the end user.

Initially, companies used Hadoop-based platforms to deploy the data lake. As cloud platforms became popular, companies shifted to cloud managed services for storage and computing.

The data lake has delivered on the promise of democratization of data access and served as a foundation for data-driven innovation and transformation. However, this has only been realized in pockets, not at enterprise scale.

Unfortunately, Data Lakes as developed and deployed today face severe challenges:

Monolithic monsters

  • The Data lake has become a place to store every type of raw data, without considering the business value or the data lifecycle
  • This leads to rapid escalation of spending without any business value

Lack of usability

  • The data lake fails to provide any context about the data. If someone wants to analyze the data, they have to manually apply a schema and numerous transformations
  • The data lake does not capture important changes driven within the organization (such as territory realignment, new ERP systems, M&A, etc.)

Not outcome focused

  • In this set-up, the data source owners are not incentivized to provide trustworthy and truthful data
  • The data lake engineers, with a limited understanding of the domain, are forced to provide data to downstream consumers without fully understanding the use cases

Poor adoption

  • Many data lakes were built on the premise of “build it; they will come”. They are built and maintained by hyper-specialized engineers without any thought to usability
  • Many times, when executives started on a new role, they created a data lake to quickly demonstrate impact, without fully understanding how bring users on board yet another technology

Bad for security

  • As a storehouse of raw data, the data lake provides unparalleled access to granular enterprise data, for those who can use it
  • This poses risks to privacy and security of data

The Domain-specific Data Services

As we already stated above, the next generation of data platform needs to deliver domain-specific datasets, that is focus on specific use cases, while provide data as a service.

Domain-specific Datasets

Rather than create a single monolithic data lake that aims to integrate all the data into a single physical location, enterprises should focus on creating a set of domain-specific data lakes that deliver critical datasets for each domain. The datasets are different from the internal operational data of the source systems.

This improves the quality of the data, by ensuring that data engineers work closely with domain owners before the data is pushed to consumers. Depending on the need, domain owners can create multiple different datasets, that vary by scope and granularity, depending on different audiences.

For example, in a retailer, the key domains could be online merchandizing, inventory, procurement, order management, customer management, etc. Each team can publish their own historical and real time datasets in one or more ways, depending on the way the consumers need it.

By way of illustration, the order management team can publish order and invoice transactions, as well as accumulating snapshots of order fulfillment pipeline. The procurement domain can publish datasets that help in demand planning: procurement transactions by contract, vendor and products dimensions. The inventory team can publish datasets related to various inventory models: periodic snapshots of inventory or record every transaction that impacts inventory, such as orders and returns.

The customer management team can ingest these datasets, transform and aggregate their data at a customer level, and create their own dataset for consumption. This could, for example, be a dataset that’s used to run loyalty programs, make recommendations or offer promotions.

This approach ensures that the data platform team partners with each domain to publish the datasets, and is held responsible for the quality and timeliness of the data delivery.

Data as a Service (Data APIs)

Rather than create a point-to-point interface between producers and consumers, producers disassemble the monolith and deliver data through a set of independent API endpoints. There are several advantages to publishing datasets as a service.

Services can be scaled independently, depending on the needs of the consumers. This approach decouples producers and consumers and protects them from technological changes. It enables teams to adopt polyglot data management frameworks for their data products, based on whatever language and frameworks the team is most comfortable with. This also gels well with the trend of adopting a multi-cloud approach to managing data infrastructure.

To be most effective, the data service APIs must be discoverable by the end users. To enable easier change management, the APIs need to be versioned. Change is inevitable, since new data elements may be added or relationships between data elements may change. Versioning protects existing clients to allow functioning while allowing new clients to take advantage of newer capabilities and data. The APIs must be easily understandable by data scientists, developers and data engineers.

Use-case focused Approach

Last, but not the least, domain teams take a use-case focused approach to build their data products. Traditionally, data management teams are siloed by their function: ingestion, processing, delivery, analytics, etc.

A use case helps focus on the outcomes, rather than the silos. Rather than focus on the functional data management silos, data consumers focus on building all of these for a given use case.

Use cases must be tied to specific business outcomes. It does not have to necessarily involve all the steps from ingestion to analytics. For example, the use case could be: deliver real time status update of item availability in store to customers. The inventory team publishes this dataset, and the online commerce team takes this data, creates a microservice and integrates this with the commerce platform.

Taking a use-case based approach involves creating a cross-functional team of data scientists, data engineers and business analysts who work together to define and implement use cases.

For example, consider a recommendation engine use case. The first step in building a recommendation engine involves computing similarity between customers and items based on past purchases. The next step is to embed the results from the recommendation engine into customer-facing applications, such as online retail and customer promotion emails. The third step is to measure the effectiveness of recommendations and constantly improve.

Rather than focusing on tasks in a siloed manner, a use case based approach ensures that teams focus on delivering the data product across the siloes. This reduces friction, improves productivity and ensures better quality of the data product.

Pivoting to Domain-specific Data Services

There are many questions on how organizations can pivot to domain specific datasets. Some have suggested a radical disbanding of centralized data management teams and reorganizing them into domain-specific teams. That will probably work well in theory, but may not be practical given the organizational realities.

Coming back to our VP of Data Platforms in FortuneCo, here is where the organization made smart moves. Rather than blame the business team for not adopting the data platform, the VP’s team started to partner with the business by focusing on specific use cases. Rather than deliver a data lake, they pivoted to supporting business achieve concrete outcomes.

They reorganized themselves to align with the domains. They started pilot projects that deliver recommendation engines, ad-hoc analytics for revenue management, pricing optimization, etc. Each of these use cases require specific datasets, analytical models, integration with operational systems, etc. They created cross-functional teams that work across these use cases.

The pilot projects started to deliver results and the VP was relieved. However, the next question arises: how do I scale up this structure? How do I ensure I do not reinvent the wheel from a technology perspective? How do I ensure privacy and security of my data assets?

Answers to these questions will have to wait for a future post.

--

--

Ramesh Hariharan
Better Data Platforms

I’m a data scientist who understands software engineering and architecture. I build cloud-scale, data intensive applications powered by Machine Learning models.