Boosting Data Team Productivity with Ascend and Snowpark for Python

Snowflake announced Snowpark for Python available for public preview at Summit 2022, after an extensive private preview phase that included Ascend.io. With more than 90% of Ascend.io customers already using Python, it’s in the interest of our mutual customers to integrate this new capability in the Ascend experience — and empower users with the best that Snowflake has to offer.

In this blog, we will step through a tutorial that summarizes what makes the partnership between Snowflake and Ascend such an advantage for data teams using Snowpark for Python. The tutorial also highlights key collaboration features that are unique to Ascend, with which all data experts — from data analysts to engineers to scientists — can link up their pipeline logic.

Quick Introductions

What is Snowpark?

Snowpark is a library with an API that developers can use for querying and processing data in the Snowflake Data Cloud. Snowpark’s key benefit is that it enables data teams to code in languages other than SQL — such as Scala, Java, and now Python — without moving data out of Snowflake and, therefore, take full advantage of its powerful capabilities through code.

What is Ascend?

Ascend is a data automation platform that boosts productivity by unifying ingestion, transformation, orchestration, and delivery of data for any business use case. Data and analytics engineering are mission-critical processes that are hampered by loosely connected tools, and are ripe for consolidation. Ascend’s seamless automation spans all the steps of building and operating workflows end-to-end, and is available on Snowflake in AWS, GCP, and Azure.

Better Together: Snowflake, Python, and Ascend

The combination of Snowflake and Ascend.io has resulted in an integrated, high-performance platform to automate your data processing workflows, regardless of scale or complexity. When you use native Snowpark for Python libraries within the full Ascend automation context, the resulting transformations are seamlessly orchestrated by Ascend, which assures that your Snowflake-based data pipelines benefit from free ingestion and validation, guaranteed sequencing and lineage, deep job monitoring, and powerful pause/resume functions that save headaches, time, and money.

Ascend also augments Snowflake’s own data-sharing capabilities to bring the efforts of individual developers together like no other platform can. With Ascend’s unique pipeline sharing controls, developers can link the processing logic and orchestration of each others’ pipelines across teams, assuring end-to-end pipeline integrity with no additional code.

In today’s tutorial, one developer is responsible for creating and publishing clearly defined weather data, while an analytics team needs this weather data as input to perform analytics on taxi cab rides. Since both use Ascend to automate their data pipelines, the taxi analytics team can subscribe to the weather data pipeline with a simple mouse click. Not only does one receive the data from the other, but Ascend also assures that the two are orchestrated together, data updates cascade correctly through both, logic changes are propagated end-to-end, and error notifications reach all affected parties.

Here is the sequence of our tutorial today:

Quick Setup

For those who play along:

  1. Let’s make sure you have a working Snowflake account and can log in with user credentials. Have your credentials ready for a later step. If you don’t have a Snowflake account, you can create a free trial account by signing up here. Then follow these instructions to properly provision your Snowflake account for Ascend to be able to interoperate fully and automate your Snowflake workloads on your behalf.
  2. This tutorial is based on the Ascend Cloud Developer tier, for which you can create a free account here. When you log into a new account, a brief onboarding process will automatically prompt you to enter your Snowflake credentials and connect your Snowflake account to your Ascend account. Configuring this connection is similar to the process described in these instructions. The onboarding process includes linking to the Snowflake tables containing your data, similar to loading external data described here.
  3. Go ahead and load the data for this tutorial into your Snowflake database directly as described by Snowflake here, or from your own file system like Amazon S3 as described by Ascend here. In our example, the taxi analytics team uses the public taxi cab ride datasets published by the New York Taxi & Limousine Commission at this link, and loads it into a table called “CLEAN_GREEN_CAB.” The weather data developer uses data published by the NOAA at this link, and loads it into a table called “Raw Sensor Data.”

Let’s get going!

Step 1: Augmenting Raw Taxi Cab Data With A Simple Data Pipeline

Let’s begin our example with the taxi analytics team, and build a simple data pipeline to prepare the taxi cab data for the analytics. We’ll start the pipeline with the initial dataset containing 4.68M rows. In Ascend, each data pipeline is called a Dataflow, and when displayed on the UI, the initial taxi analytics data pipeline looks something like this:

A simple dataflow viewed in the Ascend UI

In our example, the “CLEAN_GREEN_CAB” component on the UI is a Read Connector that represents the linked table with the initial datasets, for which a sample can be inspected directly in the Ascend UI.

Sample records of the taxi cab data viewed in the Ascend UI

The first processing step is a Transform component, where you can use Snowpark for Python to clean the data. You can add this component by selecting “Snowpark Python” from the left-side panel of the canvas, or by clicking the three vertical dots on the top right corner of the “CLEAN_GREEN_CAB” Read Connector.

Adding processing steps to a pipeline in the Ascend UI

This action lands you in the configuration screen to create this new component, prepopulated with “CLEAN_GREEN_CAB” as an input. In our example, we name the component “Augmented_CLEAN_GREEN_CAB”, and since we won’t need to partition the data, you can select “Reduce all partitions to One” as the partition strategy. This simply instructs Ascend to generate a single physical table for the results of the transformation step.

Unlike other platforms that only provide their own generalized ETL dialect, specifying the component as a “Snowpark” transform unlocks native Snowpark for Python and all of its features. In our simple example, we add a new column called “pickup_date” by converting “pickup_datatime” from string to a date data type, and we add a new column named “cab_type” with “Super Green” as the value.

def transform(session: Session, inputs: List[DataFrame]) -> DataFrame:

table = inputs[0]

return table.withColumn(‘cab_type’, F.lit(‘Super Green’)) \

.withColumn(‘pickup_date’, F.date_trunc(‘day’, F.to_timestamp(‘PICKUP_DATETIME’)))

The next screenshot shows the actual configuration screen for our example.

The configuration panel of a Snowpark for Python Transform in the Ascend UI

The configuration screen also shows a “Component Pausing” option below the code (not visible in this screenshot). Should you want to keep the automation from immediately processing the code when you click the “CREATE” button, set this option to “Paused.” With the option set to “Running,” the Ascend automation immediately incorporates your code into the pipeline and creates the incremental Snowflake queries that are needed to update it. Ascend automation seamlessly issues these queries to Snowflake to create a new table and load its records, monitor the process, and update the UI to the “UP TO DATE” state once execution is finished.

The result is the simple, two-step pipeline we saw earlier:

The first Transform in the taxi analytics dataflow viewed in the Ascend UI

Click on the middle of the new Transform component to get a popup, then select the “Records” tab to see the new columns and sample data from the new underlying Snowflake table:

Validating a Transform in the Ascend UI

Since all of the data for Transform operations stays in Snowflake, you can also directly query the table in Snowsight to verify that it matches the Ascend view:

Validating data in the taxi analytics dataflow viewed in the Snowsight UI

It is that easy! You have successfully created your first data pipeline with Snowpark for Python in Ascend. With the components set to “Running,” Ascend’s built-in orchestration constantly monitors the source tables for any new data arrivals, and automatically processes them through the pipelines with no further action required from you.

Step 2: Sharing Data in Ascend Between Teams

Now let’s get into how multiple developers and data teams use Ascend to connect their data pipelines.

In our example, the weather data developer is responsible for cleaning up the raw weather data from NOAA, and then sharing it with other analytics teams, including the taxi analytics team. The weather data developer must share data in a way that

  1. provides controls over who gets access to the data,
  2. is always updated with new data arriving,
  3. is always updated per new data cleansing rules, and
  4. it is easily discoverable by other data teams

The weather data developer is also required to

  1. report on which teams are using the cleaned weather data,
  2. notify them of any errors,
  3. version all the cleansing logic in the central code repository, and
  4. resume data updates as soon as updates are applied without generating any duplicate records.

Good thing they’re using Ascend, which automates every one of these requirements!

Cleaning Up and Sharing Weather Data

To get started, our weather data developer creates a separate Dataflow in Ascend and encodes the cleansing logic in a Transform called “Max Sensor Reading By Day” using Snowflake SQL and similar configurations as you used for the taxi cab data earlier in this example. They then terminate it with a special Data Feed component, so that it looks like this:

The weather data Dataflow viewed in the Ascend UI

The Data Feed component is for sharing the outputs of the data pipeline. To add that component, click the “Create Data Feed” button on the left-side panel of the UI, or click on the dots on the top right corner of the final Transform component, like this:

Creating a Data Feed sharing component in the Ascend UI

This action lands you in the configuration screen for this new component, prepopulated with “Max Sensor Reading By Day” as an input. Here you can name the Data Feed with an intuitive name that other teams will recognize, configure the access you would like others to have, and click “Create”. With that, the platform instantly publishes the Data Feed to the Ascend Data Feed Catalog, where other teams or users can discover it and subscribe to it to use as a source in their own data pipelines.

Configuring a Data Feed sharing component in the Ascend UI

That is all that is needed to share datasets between Dataflows! Ascend automatically fulfills all the requirements that were laid out for our thrifty weather data developer, so they can move on to the next project and deliver new business value!

Step 3: Connecting to Weather Data for Taxi Cab Ride Analysis

Let’s switch back to the taxi analytics data pipeline that the taxi analytics team was building, and use the cleaned weather data in the taxi cab ride analysis. We do this by subscribing to the weather Data Feed from inside your own taxi cab data Dataflow. To find the weather data, click “Connect to Data Feed” on the left side panel to access the Ascend Data Feed Catalog. You can search or browse the catalog, and if it was configured by the weather data developer with the right permissions for you, you can connect to the weather Data Feed. Note the Subscribe button, as well as the visible metadata by which to confirm the right dataset for your analytics:

Subscribing to a Data Feed in the Ascend UI

The Data Feed is a virtual construct that connects the automation logic of the two data pipelines and points to the actual table from the weather data pipeline in Snowflake. It does not make a physical copy or replicate it into another table. This way you are guaranteed to always have the latest data in your data pipeline, running in the right order with all the other linked data pipelines. Ascend automation handles all this coordination behind the scenes.

Step 4: Analyzing Taxi Cab Rides in Rainy Weather

Let’s wrap up our analysis of the impact of rainy weather on taxi cab ride tips in two more steps. We could have probably used Snowflake SQL for this, but let’s keep the spotlight on Snowpark for Python to join the “Augmented CLEAN_GREEN_CAB” dataset with the clean weather Data Feed.

Here is the Transform with the code that filters out the taxi cab rides on rainy days:

Using Snowpark for Python to join a Transform and a Data Feed in Ascend

The last step in the sequence is a Transform that analyzes the tip percentage by rider count, for these taxi cab rides on rainy days:

Using Snowpark for Python to add a filter to a data pipeline in Ascend

Using the records preview in the configuration panel for this final Transform, the data looks like this:

Record view of the output of the tip % analysis in the Ascend UI

Return to the overview of your taxi analytics data pipeline by closing the configuration panel, and the Ascend DataFlow UI looks like this:

Dataflow with rainy day taxi cab ride tip analysis pipeline in the Ascend UI

Note the record counts for each of the components as the data is processed through the pipeline. Any new data arriving in the “CLEAN_GREEN_CAB” data source in your data pipeline will automatically be processed incrementally, and increase the records counts handled by those components.

Similarly, any new data arriving in the “Raw Sensor Data” data source in the weather data pipeline will be processed there, then trigger the “Taxi Rides on Rainy Days” in yours through the Data Feed. However, it will NOT cause a reprocessing of the taxi cab data — Ascend only creates Snowflake jobs to handle changes and avoids reprocessing wherever possible. In addition, if the result of “Taxi Rides on Rainy Days” is the same as before, it will stop the sequence and NOT rerun “Tip % by Rider Count.” In our example, this could happen if weather data arrived for yesterday, but no taxi cab ride data for yesterday has arrived yet. The join would yield no new data for yesterday, so no downstream update is needed.

So what does all this look like in Snowflake? Behind the scenes, Ascend has created the objects in Snowflake and is constantly pushing the right jobs in the right order so that Snowflake updates the records correctly. Your Snowflake console view might look something like this:

Ascend-generated objects in the Snowflake UI

You’ve just completed a data pipeline in Ascend, and Ascend is running it correctly for all new data arrivals and through all the dependencies!

Final Thoughts on Snowpark for Python and Ascend

All the data engineers, analytics engineers, developers, data analysts, and data scientists in your organization can now harness the full power of Snowpark for Python in the context of Ascend’s unified, fully integrated data automation platform.

The combination of Ascend automation and Snowpark for Python avoids many seams in technology and data that result from piecemeal data pipeline architectures. Instead of writing lots of custom code papering over the gaps, you leapfrog into deep data workload automation with out-of-the-box functionality. Your developers and data teams can not only share data in Snowflake where it resides, but also collaborate easily on organizing and sequencing their pipeline logic along business principles instead of IT silos or technology boundaries. Behind the scenes, Ascend orchestrates the workloads within Snowflake, all while increasing security, governance, and observability.

--

--