Unlocking the Power of Pandas on Major Cloud Platforms

Thibaut Gourdel
4 min readJun 17, 2024

In my previous article, I discussed whether Pandas was suitable for developing ETL pipelines. To summarize, Pandas is a solid choice for ETL tasks thanks to its rich ecosystem, extensive documentation, and widely recognized API, provided you’re aware of its limitations and know how to address them. To go further, I’ll explore in this article the best ways to leverage Pandas for data engineering on major cloud providers: AWS, Azure, GCP, Snowflake, and Databricks.

Pandas is extremely flexible for exploring data on your laptop with notebooks or scripts. However, when it comes to setting up data workloads in production, you will likely deploy them on major cloud platforms. While Pandas is a universal Python package that can be installed anywhere, there are still specificities and interesting tools you could benefit from when using Pandas on major cloud platforms in terms of scalability, integration and useful abstraction.

🟧 Amazon Web Services

Let’s start with AWS as it is for me the gold standard in terms of support for Pandas. First, like most data platforms, AWS, through AWS Sagemarker, provide a notebook interface with support for common python libraries including Pandas. But AWS’ strength when it comes to pandas support is the AWS SDK for pandas (formely AWS Data Wrangler). Developed by AWS Professional Services teams, it offers an extensive python package to extend the power of Pandas for AWS services such as Athena, S3, Redshift etc… and most importantly with support for the AWS Glue Catalog to centrally access datasets. It provides welcome abstraction to deal with cloud services and how to move and transform data from one to another. In addition, it supports scaling with Ray or Modin for large datasets. Clearly a must-have if you use pandas on AWS. Additionally, a few services support executing Pandas code such as AWS Lambda and EMR clusters.

🌐 Azure

Azure’s support for Pandas usage on their platform is less comprehensive than AWS’s. Azure Data Studio naturally allows for Pandas usage on their notebook interface, but for Azure Synapse, their flagship enterprise analytics product, Spark support is preferred, even though there is minimal support for Pandas. There are also bridges between Pandas and PySpark, notably through the Pandas API for Spark, which allows scaling Pandas to big data using an equivalent API. Overall, Pandas support across Azure’s services remains limited. Evidently, services that run standard Python, such as Azure Functions, can run Pandas, but there isn’t any purpose-built abstraction or library to ease the process (using regular Python libraries would be the way to go).

☁️ Google Cloud

Google Cloud Platform support for Pandas is also relatively limited. First things first, the widely used Google Colab supports Pandas in its notebook interface. It’s also important to note that Google Colab provides GPU runtimes, allowing you to supercharge your executions with cuDF’s zero-code-change acceleration for Pandas. Check out this recent Nvidia article showing up to 50x speedups over standard Pandas on Colab! But what about integration with Google’s crown jewel, BigQuery? Google provides a Python package to read and write BigQuery tables with Pandas, which is welcome but limited. Pandas support on GCP is limited to its BigQuery package and, similar to Azure, standard Python libraries to access other services.

❄️ Snowflake

Pandas support on Snowflake was quite limited before its latest release in June 2024. They recently announced the Snowpark Pandas API, which allows running Pandas through Modin (acquired by Snowflake). This enables scalable Pandas code execution on the Snowflake platform, in addition to their SQL and Snowpark API. This addition greatly enhances interoperability and code reuse, and is a significant improvement for Snowflake users wanting to use more Python. By the way, it’s also good to know you can use Pandas in Python UDFs (User Defined Functions).

🧱 Databricks

Databricks, founded by the fathers of Spark is of course favoring PySpark within their platform. However, due to the popularity of Pandas, Databricks does support using the Pandas API through Spark Pandas API. It means you can use the same syntax with minimal code change on top of PySpark DataFrames.

Pandas, the most popular data exploration and manipulation library in Python, benefits from widespread support and integration. However, when it comes to deploying Pandas code in the cloud and integrating with cloud services, support varies among major data platforms. Some platforms stand out, such as AWS with its AWS SDK for Pandas and Snowflake with its native support for the Pandas through Modin. Also, don’t hesitate to share more tips to use Pandas on the cloud in the comments.

Amphi ETL is an open-source low-code ETL tool that generates native Python code with Pandas. Simplify the development, maintenance and deployment of structured and unstructured data pipelines with Amphi!

Amphi ETL

--

--

Thibaut Gourdel

I write about data engineering and ETL. I'm building Amphi, a low-code python-based ETL for data manipulation and transformation.