Intro to Data Engineering with Snowpark for Python: Hands-on lab & FAQs

Intro to Data engineering with Snowpark for Python: NYC city bike analysis

This week Jeremiah Hansen and I presented a hands-on lab on how to use Snowpark for data engineering use cases. A lot of you joined and we built an end-to-end data pipeline together to analyze NYC city bikes data.

In case you missed it:

We also had some great questions from the practitioners, and we want all snowflake developers to benefit from those. So here we go!

Can you run Scala spark code on Snowpark?

How do I find my organization name on Snowflake UI?

  • Select Admin -> Accounts in Snowsight UI. The organization name is listed above the account names.
  • More on that here

Who pays for the processing time of views on a shared database?

  • When using Snowflake data sharing, the provider pays for the storage cost of the data and the consumer pays for the compute costs. That is because there is only one copy of the in the producer’s account, and the compute warehouse is running in the consumer’s account.

Is there a way to collect the SQL statement generated by Snowpark? Is the SQL generated by Snowpark optimized?

  • Snowflake SQL engine optimizes the SQL queries, and you can see for yourself the actual SQL text and the query execution plan.
  • For every Snowpark API you run, you can check the corresponding SQL text in the Query history tab in Snowsight UI: Activity -> Query History.
  • You can also access query history and SQL text from Snowpark APIs as well.
with session.query_history() as history:
df.show(20)
history.queries

How do you know which Snowpark API corresponds to which SQL text in Query History?

  • You can set query tags to the DataFrame API and search by those query tags in the Query History.
  • For example:
df.describe().show(statement_params={"QUERY_TAG": "test tag"})

I’m a data scientist, any specific documentation pages that I need to look at?

Do you have to define the table structure in snowflake before loading the excel file or will snowflake pick the most appropriate data types/columns names and build a table?

  • You can let Snowpark infer schema based on the data types or columns or explicitly define it yourself while reading the csv as well.
  • For example:
df = session.read.option("INFER_SCHEMA", True).csv("@mystage/testCSV.csv")

Any sample snowsql config file I can look at?

How is the snowflake task scheduling different from Airflow?

  • Airflow is a popular open source job scheduler that many customers use with Snowflake. Airflow allows customers to schedule activities against Snowflake as well as against other systems.
  • The Snowflake Task feature is Snowflake’s native job scheduler, and allows customers to schedule activities against Snowflake.

Is the Tasks API only available in Python?

Does Snowpark DAG have a visualizer?

  • Yes, you can see the DAG graph under Data -> Databases -> Tasks in the Snowsight UI.
  • Task information is displayed, including Task Details, Graph, and Run History sub-tabs.

Is there a way to install customer lib to create UDF?

  • You can use any 3rd party python libraries available in snowflake-anaconda-channel and package it as part of your User-Defined-Function.
  • One of the benefits of using Snowpark for Python is the thousands of open-source packages that are pre-installed in Snowpark so developers don’t need to worry about manual installs or missing dependencies.
  • Official Docs
  • List of supported 3rd party libraries in Snowflake

Is there a way to test snowpark code locally?

Event-driven scheduling means if I get any new record in a dynamic table I like to run an aggregate job based on new data just arrived. Any idea how to do this?

  • Currently you can schedule tasks based on a time interval only. We plan to support a true event-driven schedule in the future.
  • Meanwhile, here is a workaround to use streams and tasks together to achieve an event-driven schedule.

Can we export this activity to Grafana? Can we display all activity data in tableau reports or Grafana?

  • Yes, you can monitor Snowflake from a variety of tools, including Tableau and Grafana. Snowflake provides access to rich sets of metadata that can be used for a variety of purposes.
  • Export to Grafana
  • Export to Tableau

Is snowflake.core.task available to use?

  • Yes you can use snowflake.core.task APIs after installing Snowflake package. For example, pip install snowflake.
  • PyPI repo
  • Hands-on Quickstart

Does Snowpark support Apache Spark on Snowflake cloud?

I hope this was helpful, if you have more questions, or if you run into errors while building your project following the quickstart, feel free to reach out to Snowflake developer community on reddit, stackoverflow, or comment on this post. Happy to help!

Thanks for Reading!

If you like my work and want to support me…

  1. The BEST way to support me is by following me on Medium.
  2. For data engineering best practices, and Python tips for beginners, follow me on LinkedIn.
  3. Feel free to give claps so I know how helpful this post was for you.

--

--

Vino Duraisamy
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Developer Advocate @Snowflake❄️. Previously Data & Applied Machine Learning Engineer @Apple, Nike, NetApp | Spark, Snowflake, Hive, Python, SQL, AWS, Airflow