Intro to Data Engineering with Snowpark for Python: Hands-on lab & FAQs

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

5 min readFeb 9, 2024

Intro to Data engineering with Snowpark for Python: NYC city bike analysis

This week Jeremiah Hansen and I presented a hands-on lab on how to use Snowpark for data engineering use cases. A lot of you joined and we built an end-to-end data pipeline together to analyze NYC city bikes data.

Can you run Scala spark code on Snowpark?

You can run Scala code in Snowpark. Snowpark supports Python, Scala and Java APIs. If you are looking to convert existing Spark code to Snowpark, check out Snowpark Migration Accelerator as well.
More on Snowpark Scala
Snowpark Migration Accelerator
Snowpark Scala Official Docs

How do I find my organization name on Snowflake UI?

Select Admin -> Accounts in Snowsight UI. The organization name is listed above the account names.
More on that here

Who pays for the processing time of views on a shared database?

When using Snowflake data sharing, the provider pays for the storage cost of the data and the consumer pays for the compute costs. That is because there is only one copy of the in the producer’s account, and the compute warehouse is running in the consumer’s account.

Is there a way to collect the SQL statement generated by Snowpark? Is the SQL generated by Snowpark optimized?

Snowflake SQL engine optimizes the SQL queries, and you can see for yourself the actual SQL text and the query execution plan.
For every Snowpark API you run, you can check the corresponding SQL text in the Query history tab in Snowsight UI: Activity -> Query History.
You can also access query history and SQL text from Snowpark APIs as well.

with session.query_history() as history:
  df.show(20)
  history.queries

Official docs

How do you know which Snowpark API corresponds to which SQL text in Query History?

You can set query tags to the DataFrame API and search by those query tags in the Query History.
For example:

df.describe().show(statement_params={"QUERY_TAG": "test tag"})

I’m a data scientist, any specific documentation pages that I need to look at?

Do you have to define the table structure in snowflake before loading the excel file or will snowflake pick the most appropriate data types/columns names and build a table?

You can let Snowpark infer schema based on the data types or columns or explicitly define it yourself while reading the csv as well.
For example:

df = session.read.option("INFER_SCHEMA", True).csv("@mystage/testCSV.csv")

Any sample snowsql config file I can look at?

How to configure the connection params in snowsql config file?

How is the snowflake task scheduling different from Airflow?

Airflow is a popular open source job scheduler that many customers use with Snowflake. Airflow allows customers to schedule activities against Snowflake as well as against other systems.
The Snowflake Task feature is Snowflake’s native job scheduler, and allows customers to schedule activities against Snowflake.

Is the Tasks API only available in Python?

You can create Snowflake Tasks using the SQL DDL commands, or if you are a python developer you can use Python Tasks API as well.
Create Snowflake Tasks using SQL
Snowflake Python Tasks API

Does Snowpark DAG have a visualizer?

Yes, you can see the DAG graph under Data -> Databases -> Tasks in the Snowsight UI.
Task information is displayed, including Task Details, Graph, and Run History sub-tabs.

Is there a way to install customer lib to create UDF?

You can use any 3rd party python libraries available in snowflake-anaconda-channel and package it as part of your User-Defined-Function.
One of the benefits of using Snowpark for Python is the thousands of open-source packages that are pre-installed in Snowpark so developers don’t need to worry about manual installs or missing dependencies.
Official Docs
List of supported 3rd party libraries in Snowflake

Is there a way to test snowpark code locally?

Snowpark now supports local testing of your transformation logic/code using pytest.
Official docs
Hands-on blog to start testing locally

Event-driven scheduling means if I get any new record in a dynamic table I like to run an aggregate job based on new data just arrived. Any idea how to do this?

Currently you can schedule tasks based on a time interval only. We plan to support a true event-driven schedule in the future.
Meanwhile, here is a workaround to use streams and tasks together to achieve an event-driven schedule.

Can we export this activity to Grafana? Can we display all activity data in tableau reports or Grafana?

Yes, you can monitor Snowflake from a variety of tools, including Tableau and Grafana. Snowflake provides access to rich sets of metadata that can be used for a variety of purposes.
Export to Grafana
Export to Tableau

Is snowflake.core.task available to use?

Yes you can use snowflake.core.task APIs after installing Snowflake package. For example, pip install snowflake.
PyPI repo
Hands-on Quickstart

Does Snowpark support Apache Spark on Snowflake cloud?

You can use Snowpark Python APIs that are very similar to PySpark APIs for running any transformations you need.
Suppose you need help converting your existing PySpark code to Snowpark to run them on Snowflake, check out Snowpark Migration Accelerator.
Spark to Snowpark Conversion

I hope this was helpful, if you have more questions, or if you run into errors while building your project following the quickstart, feel free to reach out to Snowflake developer community on reddit, stackoverflow, or comment on this post. Happy to help!

Thanks for Reading!

If you like my work and want to support me…

The BEST way to support me is by following me on Medium.
For data engineering best practices, and Python tips for beginners, follow me on LinkedIn.
Feel free to give claps so I know how helpful this post was for you.