Intro to Data Engineering with Snowpark for Python: Hands-on lab & FAQs
This week Jeremiah Hansen and I presented a hands-on lab on how to use Snowpark for data engineering use cases. A lot of you joined and we built an end-to-end data pipeline together to analyze NYC city bikes data.
In case you missed it:
- You can watch the recording.
- You can build at your pace, following this step-by-step quickstart
We also had some great questions from the practitioners, and we want all snowflake developers to benefit from those. So here we go!
Can you run Scala spark code on Snowpark?
- You can run Scala code in Snowpark. Snowpark supports Python, Scala and Java APIs. If you are looking to convert existing Spark code to Snowpark, check out Snowpark Migration Accelerator as well.
- More on Snowpark Scala
- Snowpark Migration Accelerator
- Snowpark Scala Official Docs
How do I find my organization name on Snowflake UI?
- Select Admin -> Accounts in Snowsight UI. The organization name is listed above the account names.
- More on that here
Who pays for the processing time of views on a shared database?
- When using Snowflake data sharing, the provider pays for the storage cost of the data and the consumer pays for the compute costs. That is because there is only one copy of the in the producer’s account, and the compute warehouse is running in the consumer’s account.
Is there a way to collect the SQL statement generated by Snowpark? Is the SQL generated by Snowpark optimized?
- Snowflake SQL engine optimizes the SQL queries, and you can see for yourself the actual SQL text and the query execution plan.
- For every Snowpark API you run, you can check the corresponding SQL text in the Query history tab in Snowsight UI: Activity -> Query History.
- You can also access query history and SQL text from Snowpark APIs as well.
with session.query_history() as history:
df.show(20)
history.queries
How do you know which Snowpark API corresponds to which SQL text in Query History?
- You can set query tags to the DataFrame API and search by those query tags in the Query History.
- For example:
df.describe().show(statement_params={"QUERY_TAG": "test tag"})
I’m a data scientist, any specific documentation pages that I need to look at?
Do you have to define the table structure in snowflake before loading the excel file or will snowflake pick the most appropriate data types/columns names and build a table?
- You can let Snowpark infer schema based on the data types or columns or explicitly define it yourself while reading the csv as well.
- For example:
df = session.read.option("INFER_SCHEMA", True).csv("@mystage/testCSV.csv")
Any sample snowsql config file I can look at?
How is the snowflake task scheduling different from Airflow?
- Airflow is a popular open source job scheduler that many customers use with Snowflake. Airflow allows customers to schedule activities against Snowflake as well as against other systems.
- The Snowflake Task feature is Snowflake’s native job scheduler, and allows customers to schedule activities against Snowflake.
Is the Tasks API only available in Python?
- You can create Snowflake Tasks using the SQL DDL commands, or if you are a python developer you can use Python Tasks API as well.
- Create Snowflake Tasks using SQL
- Snowflake Python Tasks API
Does Snowpark DAG have a visualizer?
- Yes, you can see the DAG graph under Data -> Databases -> Tasks in the Snowsight UI.
- Task information is displayed, including Task Details, Graph, and Run History sub-tabs.
Is there a way to install customer lib to create UDF?
- You can use any 3rd party python libraries available in snowflake-anaconda-channel and package it as part of your User-Defined-Function.
- One of the benefits of using Snowpark for Python is the thousands of open-source packages that are pre-installed in Snowpark so developers don’t need to worry about manual installs or missing dependencies.
- Official Docs
- List of supported 3rd party libraries in Snowflake
Is there a way to test snowpark code locally?
- Snowpark now supports local testing of your transformation logic/code using pytest.
- Official docs
- Hands-on blog to start testing locally
Event-driven scheduling means if I get any new record in a dynamic table I like to run an aggregate job based on new data just arrived. Any idea how to do this?
- Currently you can schedule tasks based on a time interval only. We plan to support a true event-driven schedule in the future.
- Meanwhile, here is a workaround to use streams and tasks together to achieve an event-driven schedule.
Can we export this activity to Grafana? Can we display all activity data in tableau reports or Grafana?
- Yes, you can monitor Snowflake from a variety of tools, including Tableau and Grafana. Snowflake provides access to rich sets of metadata that can be used for a variety of purposes.
- Export to Grafana
- Export to Tableau
Is snowflake.core.task available to use?
- Yes you can use snowflake.core.task APIs after installing Snowflake package. For example, pip install snowflake.
- PyPI repo
- Hands-on Quickstart
Does Snowpark support Apache Spark on Snowflake cloud?
- You can use Snowpark Python APIs that are very similar to PySpark APIs for running any transformations you need.
- Suppose you need help converting your existing PySpark code to Snowpark to run them on Snowflake, check out Snowpark Migration Accelerator.
- Spark to Snowpark Conversion
I hope this was helpful, if you have more questions, or if you run into errors while building your project following the quickstart, feel free to reach out to Snowflake developer community on reddit, stackoverflow, or comment on this post. Happy to help!