Small but Mighty Data

Thibaut Gourdel
6 min readJul 16, 2024

In a previous article, ETL Engineering Trends in 2024, I listed a few data engineering trends, and among them was “Small but Mighty Data.” In this post, I will expand on my thoughts regarding the current state of Big Data and Small Data.

🔥 From the Ashes of Big Data

We’ve heard the term Big Data used in many ways since it was coined in the 2010s. At the time, Big Data was described as the solution to process enormous amounts of data coming from all these new data sources that arose from the Cambrian explosion of e-commerce data, IoT sensors, and more generally from the digital age. It started with Hadoop and MapReduce and later evolved with Apache Spark. The core principle involves using a data lake to store raw data and employing distributed processing across multiple nodes.

While many companies genuinely needed these solutions, Big Data was often promoted even when there wasn’t a significant amount of data or when processing time wasn’t critical. Everything became a Big Data problem; enterprises could claim they were doing Big Data, and consulting services could sell (expensive) Big Data implementation services. Such is the life of hype cycles. (GenAI, we see you 👋. Is everything a Chatbot problem?). Anyhow, a few years later, many were disappointed with the failed Big Data attempts and the overall results compared to the investments. A great read on the subject is the infamous “Big Data is Dead” from Jordan Tigani.

Spark has, however, been largely adopted and used since then, for both Big Data as well as ML/AI. For analytics use cases, simpler, more effective solutions have emerged through the Modern Data Stack and the good old data warehouse powered by the cloud. In parallel, and propelled by data science needs, new data exploration and analysis frameworks were developed, namely pandas, and later Polars and DuckDB. These libraries brought a simple yet powerful set of functions to extract and explore data quickly on single machines.

Data analysts or scientists without access to a team of data engineers to manage a Spark cluster and provide them with data on a plate had to find their own pragmatic way to crunch data on their laptop or single machines. Many use cases could, in fact, be managed on single machines if done right.

🤔 What’s Small Data Anyway?

Well, as often, it depends. If you’ve read this far, “small data”, as opposed to Big Data, could be defined as using a single node library such as pandas, whereas Big Data relies on distributed workloads likely using Spark. This is a fine and simple definition but yet limited. In the past years, the power of machines has increased exponentially, and you now have access through cloud providers to literal beasts of machines. The amount of data and the velocity of processing you can achieve on those single machines nowadays is actually more powerful than most of the MapReduce/Spark clusters set up a decade ago. Polars, DuckDB, and even pandas using Modin, for example, can leverage all the cores of a machine. If the data is too big for in-memory processing, no issue — they also offer spill-to-disk / out-of-core execution capabilities among other techniques.

We can see that my previous definition is therefore limited. How to define it then? By volume? That could be it. As a rule of thumb, pandas’ creator tells us that:

Nowadays, my rule of thumb for pandas is that you should have 5 to 10 times as much RAM as the size of your dataset. Source

That is for pandas; more performant libraries such as Polars and DuckDB claim to be able to analyze and manipulate data anywhere between 1 to 10 times the size of the available memory. Different techniques are used by Polars and DuckDB to handle larger than memory datasets. In the case of DuckDB, it supports out-of-core execution, meaning it can work with datasets larger than the available memory by storing intermediate results on disk (inducing slower processing however depending on the disk IO performance). If we dare to give an estimate, some of the best machines out there go up to 256GB (MacBook Pro 128GB, EC2 m7g.16xlarge 256GB), which means you could leverage single-node libraries to process anywhere between a few hundred GB to a few TB, depending on your configuration (particularly disk size).

This is also if you need to process the data all at once, which might not be the case. Many data pipelines process data incrementally, meaning only a fraction of the actual data is handled at a time, which then lands in its destination. As a result, many use cases can be achieved using such libraries and techniques with potentially increased speed and simplicity and, more importantly, reduced costs compared to full-fledged Spark clusters or push-down execution in cloud data warehouses.

🔍 Choosing the Right Path

Choosing the right approach should then depend on the following factors:

  • Volume of data (full dataset manipulation or incremental?)
  • Speed of processing (is the processing time-constrained by business requirements?)
  • Infrastructure you have access to and are ready to support (and pay for)
  • Your team’s skills (are they fluent in Spark (PySpark), Python, etc.?)

For many current use cases, I believe that the complexity introduced in the past years can be greatly reduced and costs can be saved by using “small but mighty data.” Take a look at Polars, DuckDB, and test if you could leverage them for some of your workloads. If you’re processing lots of files, DuckDB could particularly be a good fit. Also, check out Ibis, which lets you choose your backend engine and supports both single-node frameworks and distributed ones, giving you the flexibility to scale as you see fit with the same dataframe code. I suggest you to take a look at a this article from Ibis “Querying 1TB on a laptop with Python dataframes”.

The skills and resources you have available are also important to consider. If you can afford Databricks or equivalent Spark services and your team members already know Spark, it makes sense to go that route, even if not all your workloads require Spark. Databricks has recently released its serverless compute to simplify infrastructure management and offer more flexibility and scalability (auto-scaling). Similarly, Snowflake’s Snowpark offering lets you pay for what you use within its environment, so if you’re already deep into Snowflake, that might be the right solution for you. Additionally, Snowflake recently announced support for Pandas, powered by Modin, to scale your workloads in Snowflake environments. This offers a smooth migration route to start small and scale as you go with Snowflake.

As a note, this article doesn’t cover the needs for real-time data processing. These types of use cases, which require handling millions of records per second for real-time analytics and logs and events processing for example, require specialized solutions like ClickHouse.

As you can see, many solutions exist today, and there isn’t a “right” or “wrong” approach. Weigh the resources and requirements you have. Nowadays, you have a plethora of choices.

At Amphi, we embraced the “small data” mindset and use pandas and DuckDB as primary frameworks in Amphi ETL for the reasons mentioned above. Amphi is free and open, give it a try!

--

--

Thibaut Gourdel

I write about data engineering and ETL. I'm building Amphi, a low-code ETL for structured and unstructured data for the AI age.