[Snowflake Summit 2024] Cheatsheet to Snowpark Pandas API

The Snowpark Pandas API stands out as a formidable addition to the capabilities of Snowflake, seamlessly introducing the ease and familiarity of Pandas into the realm of data analytics. Data Engineers and Data Analysts can readily incorporate this powerful functionality into their daily workflows, capitalizing on the robust features provided by Pandas.

Let’s learn more about Snowflake Snowpark Pandas API in detail.

Executive Summary

  • What Snowpark Pandas API does?
  • Benefits of Using Snowpark Pandas API
  • Snowpark Pandas Dataframe VS Snowpark Dataframe
  • Snowpark Pandas Dataframe VS Native Pandas Dataframe
  • Snowpark Pandas Dataframes limitations

What Snowpark Pandas API does?

  • The Snowpark pandas API lets you run your pandas code directly on your data in Snowflake.
  • You can get the same pandas-native experience with the scalability and security benefits of Snowflake with Snowpark Pandas API.
  • It runs workloads natively in Snowflake through transpilation to SQL, enabling it to take advantage of parallelization and the data governance and security benefits of Snowflake.

Benefits of Using Snowpark Pandas API

  • API offers a familiar interface to Python developers — A pandas-compatible layer that can run natively in Snowflake.
  • API bridges the convenience of pandas with the scalability of mature data infrastructure.
  • Pandas can now run at Snowflake speed and scale by leveraging pre-existing query optimization techniques within Snowflake.
  • No code rewrites or complex tuning are required, so you can move from prototype to production seamlessly.
  • Data does not leave Snowflake’s secure platform.
  • This feature leverages the Snowflake engine, and you do not need to set up or manage any additional compute infrastructure.

Snowpark Pandas Dataframe VS Snowpark Dataframe

  • DataFrames in Snowpark and pandas are semantically different.
  • Snowpark DataFrames are modeled after PySpark, which operates on the original data source, gets the most recent updated data, and does not maintain order for operations.
  • Snowpark pandas DataFrames are modeled after pandas, which operate on a snapshot of the data, maintain order during the operation, and allow for order-based positional indexing.
  • The Snowpark pandas DataFrame API is intended to extend the Snowpark functionality and provide a familiar interface to pandas users to facilitate easy migration and adoption, and is not a replacement for Snowpark.
  • You can use the following operations to do conversions between Snowpark DataFrames and Snowpark pandas DataFrames:
    - to_snowpark_pandas()
    - to_snowpark()

Snowpark Pandas Dataframe VS Native Pandas Dataframe

  • Snowpark pandas and native pandas have similar DataFrame APIs with matching signatures and similar semantics.
  • Snowpark pandas respects the semantics described in the native pandas documentation as much as possible, but it uses the Snowflake computation and type system.
  • When native pandas executes on a client machine, it uses the Python computation and type system.
  • Like native pandas, Snowpark pandas also has the notion of an index and maintains row ordering.
  • Datatypes — Constrained by Snowflake type system, which maps pandas objects to SQL by translating the pandas data types to the SQL types in Snowflake. A majority of pandas types have a natural equivalent in Snowflake, but the mapping is not always one to one.

Native Pandas Dataframe Relies on NumPy and by default follows the NumPy and Python type system for implicit type casting and inference. For example, it treats booleans as integer types, so 1 + True returns 2.

While, Snowpark pandas Dataframe Maps NumPy and Python types to Snowflake types according to the preceding table, and uses the underlying Snowflake type system for implicit type casting and inference. For example, in accordance with the Logical Data Types, it does not implicitly convert booleans to integer types, so 1 + True results in a type conversion error.

Native Pandas dataframes executes operations immediately and materializes results fully in memory after each operation. This eager evaluation of operations might lead to increased memory pressure as data needs to be moved extensively within a machine.

Snowpark Pandas Dataframes provides the same API experience as pandas. It mimics the eager evaluation model of pandas, but internally builds a lazily-evaluated query graph to enable optimization across operations.

Snowpark Pandas Dataframes Limitations

Snowpark pandas has the following limitations:

  • Snowpark pandas provides no guarantee of compatibility with OSS third-party libraries. Starting with version 1.14.0a1, however, Snowpark pandas introduces limited compatibility for NumPy, specifically for np.where usage. For more information, see NumPy Interoperability.
  • When calling third-party library APIs with a Snowpark pandas dataframe, Snowflake recommends that you convert the Snowpark pandas dataframe to a pandas dataframe by calling to_pandas() before passing the dataframe to the third-party library call.
  • Snowpark pandas is not integrated with Snowpark ML. When using Snowpark ML, we recommend that you convert the Snowpark pandas dataframe to a Snowpark dataframe using to_snowpark() before calling Snowpark ML.
  • Lazy Index objects are not supported. When dataframe.index is called, it returns a native pandas Index object, which requires pulling all data to the client side.
  • Not all pandas APIs have a distributed implementation yet in Snowpark pandas. For unsupported APIs, NotImplementedError is thrown. Operations that have no distributed implementation fall back to a stored procedure. For information about supported APIs, refer to the API reference documentation.
  • Snowpark pandas requires a specific pandas version. Snowpark pandas requires pandas 2.2.1, and only provides compatibility with pandas 2.2.1.
  • Snowpark pandas provides fast and zero copy clone capability while creating DataFrames from Snowflake tables. However, several table types do not support zero copy clone capability and will materialize the data, which might be slow for large tables. Some examples of such tables include Hybrid tables, Iceberg tables, External tables, and tables from shared databases.

About Me:

Hi there! I am Divyansh Saxena

I am an experienced Cloud Data Engineer with a proven track record of success in Snowflake Data Cloud technology. Highly skilled in designing, implementing, and maintaining data pipelines, ETL workflows, and data warehousing solutions. Possessing advanced knowledge of Snowflake’s features and functionality, I am a Snowflake Data Superhero & Snowflake Snowpro Core SME. With a major career in Snowflake Data Cloud, I have a deep understanding of cloud-native data architecture and can leverage it to deliver high-performing, scalable, and secure data solutions.

Follow me on Medium for regular updates on Snowflake Best Practices and other trending topics:

Also, I am open to connecting all data enthusiasts across the globe on LinkedIn:

https://www.linkedin.com/in/divyanshsaxena/

--

--