Member-only story
ArcticDB vs. Pandas: Scaling to Production-Size Datasets Without Overloading Your RAM
Python has grown to dominate data science, and its package Pandas has become the go-to tool for data analysis. It is great for tabular data and supports data files of up to 1GB if you have a large RAM. Within these size limits, it is also good with time-series data because it comes with some in-built support.
That being said, when it comes to larger datasets, Pandas alone might not be enough. And modern datasets are growing exponentially, whether they’re from finance, climate science, or other fields.
This means that, as of today, Pandas is a great tool for smaller projects or exploratory analysis. It is not great, however, when you’re facing bigger tasks or want to scale into production fast. Workarounds exist — Dask, Spark, Polars, and chunking are some of them — but they come with additional complexity and bottlenecks.
I faced this problem recently. I was looking to see whether there are correlations between weather data from the past 10 years, and stock prices of energy companies. The rationale here is there might be sensitivities between global temperatures and the stock price evolution of fossil fuel- and renewable energy companies. If one found such sensitivities, that would be a strong signal for Big Energy CEOs to start cutting their…