Data Platforms: The Past

Alexandre Beauvois
Alexandre Beauvois
Published in
8 min readMay 27, 2022

--

The Evolution of Data Platforms

This series of three articles (Data Platforms: Past, Present & Future) is a pedagogical experiment to help apprehend the data world as it is and its evolution. It’s dedicated to non-data experts.

In this first article, we’ll explore the transition in the way how data-driven companies handle medium-to-large data before 2015, up to how any company can do it in 2022.

Data stack — The Basics

The goal of any data stack is:

To help company leaders to make the right decisions at the right time.

It’s also the necessary path for building an AI model, which needs to be trained in supervised cases or calculated in non-supervised cases. Then this model could be used in human decision-making by assisting them and alerting them to risks and consequences.

At the end of the day, the aim is still to make the best decision and sequence the right (re)actions.

TDS vs MDS

The evolution of technologies and data regulations have fundamentally changed the way data stacks have been architected, and after 2014 (the advent of cloud warehouse), appeared what we call today the Modern Data Stack (MDS) or Data Platform (DP), to distinguish them from the Traditional Data Stack (TDS).

Traditional Data Stack (TDS)

There were five major parts:

Ingestion: import data sources (e.g. SaaS apps API or files) into your system and transform them.

Storage: copy the transformed data into a persistence solution.

Prepare: organize, aggregate, deduplicate, curate data

BI: query and visualize data via a business intelligence solution.

AI: learn and model from data via an artificial intelligence solution.

RELT: react to provided insights and use them as new data sources or update existing data sources.

Real-world case: E-Commerce (TDS 2015)

Company A sells shoes from its e-commerce application. They have different data sources. We’ll focus only on a few of them:

  • Ecommerce data (failed orders, product catalog)
  • CRM (contacts, customer interactions, e-mailings statistics)

Leaders need to estimate the impact of customer interactions on failed orders.

  • The data lead (or business analyst) should define the key indicators and how they should calculate them, to be understandable, insightful, and actionable.
  • They need the data lead to prepare the data by crossing sources (e.g., interactions by failed orders, impacts in sales breakdown by product)
  • Once transformed and aggregated, data should be accessible by the business intelligence solution (e.g., Looker, Tableau, Power BI…) and visualized into clear charts in dashboards.

Leaders now want these results for only the last 30 days

  • The data lead must redo the process as temporality changes aggregations

Leaders now want the whole data stack to comply with the GDPR before targeting EU markets.

  • Data governance must be able to filter all data like PII (Personal Identifiable Information) or other sensible data if the user doesn’t have permission to access it. This feature can be called RLS (Role Level Security) and having it implemented only on the BI side or visualization tools is not enough, this should be achieved transversely along the data pipelines.

TDS Pros & Cons

  • No Pros if your company needs to handle more data over time while regulations change
  • Rigidity: relies on infrastructures that can not scale without a lot of pain
  • Fragile complexity: a small change on a part can break many other parts
  • Slow: More data means hours, even days to get computation results
  • Labor: Many tasks made manually involve more mistakes, more time, and more human fatigue
  • Compliancy: Difficult to comply with changing data regulation rules (Even more with excel sheets)

MDS vs MDS as a Service

What happened around 2015 is the emergence of a kind of data grail: A technology that allows scaling almost infinitely in size and computation power, and all of this in the cloud. One of the best surprises lay in the low learning curve as data people can continue using SQL Standards (i.e. close to standards).

The most important impact relies on the new way to process data.

Decoupling the Extract Load (EL) from Transformation (T) means certain tools and technologies can Extract & Load (EL) while others can do transformational logic (T). Fantastic Cloud providers arose for EL (Airbyte, Fivetran, Hevo) with ready-to-use connectors on one hand, and innovative tools that save so much pain for Transformations (DBT) on the other hand.

The shift from TDS to MDS

The MDS saves time and money, a lot. The low and declining costs of cloud computing and storage continue to increase the cost savings of a modern data stack compared with TDS solutions (mostly on-premise).

What about the non-technical part of the data-driven mindset in 2022?

  • Data Culture: What is a Data-Driven Organization? Is the data maturity of a company only the addition of technical skills and a tech stack?
  • Processes: Think about data domains, not into silos, design from the objectives (KPI, dashboards), before implementing. Enter the DataOps vision.
  • Self-service: Think about all the gains you can get from existing solutions than could allow non-data engineers to start getting insight in several hours or days (see MDS as a Service below).
  • Legal: MDS also lowers the legal risks, as it gives more flexibility and accuracy to comply with changing data regulations. Your company partly delegates more legal and security towards cloud service providers. The example below shows part of a data stack dedicated to Privacy:
Part of a data stack dedicated to Privacy
Credits: https://www.bvp.com/atlas/how-data-privacy-engineering-will-prevent-future-data-oil-spills

Said that it is still an impressive system to set up:

https://jeremiahhansen.medium.com/beyond-modern-data-architecture-89c8199ac4f9

And there is one missing part in the MDS architecture above that is the most important and challenging: The data governance implementation and configuration.

Why is it the most challenging part when building and MDS (by far):

  • Many applications in this stack implemented different governance systems (redundancy).
  • Many applications are managing security in an incompatible way.
  • Many applications in this stack own their UI and UX, and the long learning curve to master them all.
  • You’ll end up with many points of failure, and each one can break the delivery of precious insights and miss out on opportunities.
  • The redundancy and feature gaps in the stack make it hard to get the expected results.
  • Compliance with changing legal restrictions is almost unachievable.
Credits: abeauvois + MDN (Modern Data Network community)
Credits: abeauvois + MDN (Modern Data Network community)

To sum it up, integrating good tools doesn’t mean you’ll get a good stack and expected results. It’s still a complex thing. It demands scarce human resources like data engineers and other high-level expertise and data maturity.

You deal with a high Total Cost of Ownership, and the integration of multiple non- complementary parts creates many configuration redundancies.

Governance, collaboration, and security are sometimes impossible between some sub-applications depending on your stack if you haven’t tried several combinations and built a unifying application level abstracting the underlying heterogeneity.

There is where the Modern Data Stack as a Service can help.

MDSaaS

Companies don’t want to host servers anymore if it’s not in their core business and mission. For this reason, IaaS was a game-changer as PaaS and SaaS. Since 2021, MDSaaS, or “managed MDS,” has also proved its efficiency in the market for more and more companies.

A comparison between IaaS, PaaS, SaaS and MDSaaS

When you want to make a quick POC, challenge your internal MDS with MDSaaS, or send a vast data pipeline that would overbook your busy MDS team, you can set it up within hours or days on an MDSaaS.

The Governance and Delivery parts of the modern data stack
Credit: https://selfr.io

You still have some work to do:

You only have to provide a configuration to the MDS as a Service
Insert a configuration file (roles, data sources, transformations, schedules times)

Conclusion

In a highly competitive business landscape and the need to adapt to new information at the right speed, the traditional data stack is not an ideal solution. The modern data stack is naturally replacing it over time, as it’s technologically superior by all means (scalability, automation, observability, standardization), and you get more with less pain than with TDS.

Said that it’s still a complex system to set up and manage. And decision-makers could ask themself the legit question: “Is it in our core mission to own and manage an MDS in 2022?”.

If no, or you’re not satisfied with your current TDS or MDS, then have a try on some MDSaaS providers like Selfr.io, CanvasApp, Octolis, 5x.io, or Serenytics. You’ll discover features that can only exist on all-in-one apps like real-time collaboration, no-code / low-code, the full lineage from start to end, and so on.

In the next articles, we’ll enter deeper into the Governance and Delivery pillars of an MDS.

Feel free to comment to improve this content. This article is for pedagogical purposes, it’s dedicated to non-data-experts, don’t hesitate to share it.

Want to chat about data stacks? Reach out to abeauvois.

If I managed to retain your attention to this point, leave a comment describing how this story made a difference for you, and what other topics about writing on Medium you’d like answers to.

References

--

--

Alexandre Beauvois
Alexandre Beauvois

Learning and sharing knowledge. Entrepreneur and software engineer since 19xx.