Supercharging the Snowflake Python Connector with nanoarrow

Anurag Gupta Adam Ling Avani Chandrasekaran Dewey Dunnington Matt Topol Kae Suarez

The Snowflake Connector for Python is one of the most popular Snowflake drivers. With millions of weekly downloads, it powers critical Snowflake customer integrations and pipelines. Today, we’re thrilled to announce the preview release of the nanoarrow-based Snowflake Connector for Python. The Snowflake Connector for Python provides an interface for transmitting data between Snowflake and a Python application.

This new version of the connector is smaller, compact and removes a hard dependency on a specific version of pyarrow, making it possible to load the library in resource-constrained environments with a much smaller footprint with the same fast performance.

These improvements were made possible by the newly-developed nanoarrow project. nanoarrow is a bundling of the Arrow C data and Arrow C streaming interfaces with a minimal set of helper functions to enable deployment of the Apache Arrow format and streaming anywhere C is available.

Why nanoarrow integration?

Before the nanoarrow integration, the Snowflake Connector for Python had already enabled a number of use cases for developing Python applications. Some examples are:

  • Web applications that rely on frameworks such as Flask and Django and leverage Snowflake for data storage and retrieval.
  • Data pipelines that query and process data from Snowflake, for everything from traditional analytics to machine learning.
  • Snowflake libraries and tools, such as Snowflake SQLAlchemy, Snowpark Python, and SnowSQL.

We have listened to the feedback from Snowflake developers and reduced the size of the connector to allow it to be deployed in resource-constrained environments. Previously, pyarrow added as much as 50 MB of code to the stack, yet we’ve seen scenarios, such as AWS Lambda, where an environment is constrained on how many bytes can be deployed in a package. This was the motivation to create a connector with an even smaller footprint using nanoarrow.

At just 350 KB, the nanoarrow C library was built to address the Snowflake use-case: use Arrow-native data structures for data transport and support requests to perform low-level manipulations in an ever-expanding list of environments. The new Snowflake Connector for Python reduced the install size from 53.3 MB to 4.7 MB. And because the nanoarrow library is self-contained, it means the Snowflake Connector for Python can safely interact with any recent pyarrow release.

How does the nanoarrow-based Snowflake Connector for Python work?

The Snowflake Connector for Python uses the nanoarrow library to process and convert Arrow in-memory formatted data returned from Snowflake into Python objects or pandas dataframes.

When data is retrieved in the Arrow IPC format, the connector uses the nanoarrow IPC extension to parse the response into Arrow C Data interface structures (such as ArrowArrays and ArrowSchemas). The table of result data contains an ArrowSchema defining the name, type, and metadata for each column. Data is then extracted from the ArrowArrays and converted from the Arrow in-memory format to Python objects according to Snowflake’s specification.

Key benefits of nanoarrow integration

1. Reduced disk space requirements with no more vendored Arrow libraries

2. Fits in AWS Lambda environment

Before implementing the nanoarrow integration, we were also unable to deploy the Snowflake Connector for Python in AWS Lambda as it exceeded the space limitation, as shown below, leading to a deployment failure.

Running "serverless" from node_modules
Deploying summitdemo2 to stage dev (us-west-2)
Using Python specified in "pythonBin": python
Packaging Python WSGI handler…
✖ Stack summitdemo2-dev failed to deploy (61s)
Environment: darwin, node 20.2.0, framework 3.31.0 (local) 3.31.0v (global), plugin 6.2.3, SDK 4.3.2
Credentials: Local, "user1" profile
Docs: docs.serverless.com
Support: forum.serverless.com
Bugs: github.com/serverless/serverless/issues
Error:
UPDATE_FAILED: ApiLambdaFunction (AWS::Lambda::Function)
Resource handler returned message: "Unzipped size must be smaller than 262144000 bytes (Service: Lambda, Status Code: 400, Request ID: xxxxx)" (RequestToken: xxxx, HandlerErrorCode: InvalidRequest)

Now with the nanoarrow integration, it gets deployed in the AWS Lambda environment successfully!

3. Unpins pyarrow dependency

Before the nanoarrow integration, the snowflake-connector-python pins pyarrow dependency to version 10 as the vendored Arrow library is ported from pyarrow v10. After integration with nanoarrow library, snowflake-connector-python no longer pins pyarrow to a specific version which resolves having a conflicting version with other libraries that depend on different versions of pyarrow. Users can now stay up-to-date with the latest features from the pyarrow library.

Upcoming plans

When we announced, during the Snowflake Summit (at Las Vegas in June 2023), that the preview release of nanoarrow-based Snowflake Connector for Python was coming in July, there was a lot of interest from the Snowflake developer community. We wanted to ensure that the nanoarrow library development would meet Snowflake user’s needs with enough feature coverage, a reduced package size, and a build process that was more efficient compared with the previous pyarrow-based connector. We finally accomplished these goals, and we are excited to release this early preview to the community!

After the preview release, we will release the GA version, which by default uses nanoarrow for data consumption but also allows switching to the vendored Arrow library as a contingency plan. Once the connector attains a stable state, a pure nanoarrow-based GA version will be released. Since we replaced pyarrow with nanoarrow, we will be performance benchmarking the connector to validate there are no regressions before the GA release as well.

The next release of nanoarrow will focus on rounding out existing features and improving documentation based on the Snowflake developer’s experience implementing the connector.

The Snowflake and Voltron Data teams worked together on the release of nanoarrow-based Snowflake Connector for Python, and we are excited about all the possibilities with this partnership. Voltron Data helps enterprises design and build composable data systems with open standards like ADBC, Arrow, Ibis, and Substrait, aligning perfectly with Snowflake’s philosophy — ‘We believe in open where open matters’ (see Open Source at Snowflake).

Get started today!

We’d invite you to try out this preview release of nanoarrow-based Snowflake Connector for Python, and we would love to hear your feedback at developers@snowflake.com.

The nanoarrow-based Snowflake Connector for Python can be installed from pypi with the command:

pip install snowflake-connector-python --pre

Additional resources

For more information on Snowflake Connector for Python, please visit:

https://docs.snowflake.com/en/developer-guide/python-connector/python-connector ,

Github repo at https://github.com/snowflakedb/snowflake-connector-python or reach out developers@snowflake.com

For more information on nanoarrow, please see the documentation: https://apache.github.io/arrow-nanoarrow/dev/index.html or reach out via Github: https://github.com/apache/arrow-nanoarrow/tree/main

--

--