The Amazing Rise of the Old-school Modern Data Stack

Hlynur Magnusson
Globant
Published in
7 min readSep 27, 2021
Charlie Chaplin in “Modern Times” (1936)

Within the ever-changing landscape of Data, Analytics, and AI, one of the biggest buzzwords today is the Modern Data Stack (MDS). This family of cloud-native tools and technologies aims to make it a lot easier to set up and work with a proper data platform. But what’s all the fuss, what are the components and characteristics of these technologies? To answer this we will dive into a typical data stack: ingestion, storage, transformation, and output, and hopefully, figure out what a modern version of it looks like and how it works.

Modern Ingestion

The first part of a data pipeline is all about connection, getting data from a source to a sink. These are the E and L parts of Extract-Transform-Load (ETL), a common way to describe a data pipeline. The connections are often made with custom code but MDS focuses on ready-to-use connectors. This is usually offered in the form of NoOps SaaS (Software-as-a-Service), i.e. a service completely free of maintenance.

  • Fivetran and Stitch (now part of Talend) are examples of ingestion platforms that focus on delivering no-hassle, code-less data integration. They are also examples of the different pricing models these can have. Stitch charges by the number of rows ingested while Fivetran has a flat fee per connection.
  • Airbyte is a new promising open-source ingestion platform similar to the SaaS options above. It can be operated on containers within any of the cloud providers but is not constrained by any pricing models. It also comes with more flexibility to customize or make your own connectors.
  • Some of the bigger data players have also started branching into this space. Delta Live Tables from Databricks offer a declarative way of creating ETL jobs. These can be created with Python or SQL and then run with the help of an underlying file notification system.

As there are so many different types of data sources the list of integration services also goes on, but let’s have a look at the storage options.

Modern Storage

The storage or data warehousing is handled by a Cloud Data Warehouse such as AWS Redshift, Google Cloud’s BigQuery, and Snowflake. Delta Lake from Databricks also gets included in this club. These options are much cheaper and more available than older data warehouse solutions, which often required setting up specialized hardware on-premise. Cloud-based solutions can offer an on-demand pricing model where you only pay for what you use. And this lowers the barrier to entry for data warehousing.

This is the central piece of MDS and for many, a prerequisite and a catalyst for some of the other pieces. Without relatively cheap and accessible cloud data warehouses, tools like dbt (covered shortly) would most likely not be here today.

Modern Transformation

The transformation step is the T part of Extract-Transform-Load (ETL). Here the data is converted, cleaned, aggregated, etc, often using frameworks like Apache Spark and programming languages like Java, Scala, and Python. Sometimes, especially in data warehousing, the data is first loaded into the warehouse and then transformed. In this case, we talk about ELT, which is quite common in MDS.

The one dominant player for the Modern Transformation is dbt (for Data Building Tool). It has both a paid cloud version and an open-source free version. dbt relies on SQL to transform data. In fact, there is no other option as it delegates the heavy lifting to the actual data warehouse engine. dbt adds some “code-like” functionality to the stew, e.g. conditional logic like for loops that are not available in SQL. This is done with templating (Jinja). It also comes with in-built features for testing, documentation, and version control.

Besides using dbt, there is the alternative of doing the transformation directly within the data warehouse. This would be with SQL, but without the added features of the modern tool. Later on, we will explore more profoundly the pros and cons of SQL.

Modern Output

The backers of MDS come mainly from the Analytic and Business Intelligence sides so the output or consumption part emphasizes visualization and analytics tools over Machine Learning. But certainly, AI and ML workflows can also plug into any data platform.

Besides more veteran tools like Tableau and Power BI (although Power BI is only 10 years old), examples of modern visualization tools include Mode and Looker. Both of these are SQL-based and also easy to use. Looker (now part of Google), has its own descriptive language which translates into SQL queries. The aim here is to simply further the complexities of SQL structure mainly for the business users. Mode is more lightweight on the analytics but stronger on the collaborative side, offering interactive and shareable reports written with SQL or Python/R notebooks.

Benefits of MDS

The first of the benefits of MDS is simplicity. Almost all of the technologies mentioned above promise some kind of “ease-of-use”, maybe a “code-less drag-and-drop”, “intuitive UI”, or “self-service”. This lowers the learning curve which in turn, can deliver a rapid set-up, eventually translating into lower cost.

The second one on the list is the use of SQL. Many of these solutions are SQL-based or even purely SQL. One of its benefits is, once again, greater simplicity. SQL is declarative, and arguably simpler to use and learn than a fully-fledged programming language. But the main benefit is probably its greater pool of potential users. For each data engineer fluent in Java or Python we have several data analysts who could possibly set up similar data pipelines with SQL given the right tools. And this is very important in our current situation of perpetual shortage of data engineers.

Finally, using MDS requires lower upfront costs. This can be the difference between paying a no-strings-attached monthly bill for SaaS or buying and owning a big computing box. This benefit is also related to the skills-cost involved, e.g. between building a custom application with Scala or buying something off the shelf.

Drawbacks of MDS

The argument so far is certainly impressive: MDS can deliver results both faster and cheaper than the alternative traditional stack. But certainly, there are disadvantages.

One of them is, well, that same old SQL we were just praising above. Much has been said about the pros and cons of this forty-year-old technology (here is one interesting example of the cons). SQL is not a general-purpose programming language, it is a domain-specific language (DSL) built for the relational model. This brings some “programming” limitations and in general less flexibility and capacity to handle more complex data needs.

There is also a tooling problem. SQL does not have the same support for debugging, testing, and version control compared to any modern programming language. Improving this toolset is one of the aims of frameworks like dbt.

Another issue with MDS is handling sensitive data, such as Personally Identifiable Information (PII). Sensitive data must often be dealt with before it gets loaded into a data warehouse, so doing ELT might not be an option. Also, the different ways of anonymizing sensitive data may require more custom processing than offered with ready-to-use connectors.

Conclusion

It is almost certain that the MDS train will keep on rolling — we can just look at the current momentum of technologies like dbt and Airbyte. New roles like “Analytic Engineer” are already appearing in the Data landscape, and we are constantly hearing about the “Rise of the Data Analyst”.

Although we have looked at a “full” data stack, from input to output, there is a lot more to it. New components and layers are being added to the stack. Reverse ETL covers the flow from the data warehouses into more specialized third-party systems. This also touches on the trend of Data Sharing. Then there is much talk about a Governing Layer, adding ETLG as yet another pipeline variant.

Further reading

An excellent initial guide to the world of MDS.

History lessons on MDS from one of its main backers.

Interesting analysis of the current Data and AI landscape, including MDS.

--

--