Data Science Collective

Advice, insights, and ideas from the Medium data science community

Member-only story

ArcticDB vs. Pandas: Scaling to Production-Size Datasets Without Overloading Your RAM

--

ArticDB can help where Pandas hits its limits. Image generated with Leonardo AI

Python has grown to dominate data science, and its package Pandas has become the go-to tool for data analysis. It is great for tabular data and supports data files of up to 1GB if you have a large RAM. Within these size limits, it is also good with time-series data because it comes with some in-built support.

That being said, when it comes to larger datasets, Pandas alone might not be enough. And modern datasets are growing exponentially, whether they’re from finance, climate science, or other fields.

This means that, as of today, Pandas is a great tool for smaller projects or exploratory analysis. It is not great, however, when you’re facing bigger tasks or want to scale into production fast. Workarounds exist — Dask, Spark, Polars, and chunking are some of them — but they come with additional complexity and bottlenecks.

I faced this problem recently. I was looking to see whether there are correlations between weather data from the past 10 years, and stock prices of energy companies. The rationale here is there might be sensitivities between global temperatures and the stock price evolution of fossil fuel- and renewable energy companies. If one found such sensitivities, that would be a strong signal for Big Energy CEOs to start cutting their…

--

--

Data Science Collective
Data Science Collective

Published in Data Science Collective

Advice, insights, and ideas from the Medium data science community

Ari Joury, PhD
Ari Joury, PhD

Written by Ari Joury, PhD

Founder of Wangari. Sustainable finance & ESG-financial modeling. Get all articles 3 days in advance: https://wangari.substack.com

Responses (1)