Improve data scientist productivity by 60% or more, with a little help from Snowflake

Source: Forbes https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/

Clean data is the foundation for accurate and meaningful data analysis. Data engineering teams are experts in cleaning, transforming, and preparing data for analysis. By utilizing their skills, data scientists can save valuable time and resources that can be better spent on actual analysis and driving business insights.

But where should this clean data be located and analyzed? Look no further than the data warehouse. Modern data warehouses provide a centralized location for storing and analyzing large amounts of data. By analyzing data directly in the warehouse, data scientists can easily access the data they need without the need for time-consuming and potentially insecure data transfers.

“I need to use Python, I can’t pre-process the way I want in the warehouse”

You can! Pre-process your data using the packages you already know, Snowpark for Python has Anaconda packages, 1692 of ~7000 packages, which you can find here.

For example, you can easily use SKlearn to pre-process and use the class RobustScaler for scaling numerical input features around the center of the data to avoid outliers in model training. As always, be mindful of the bias this can introduce.

What about my precious notebooks?

Notebooks are not production-ready — code runs in blocks, which makes it much more difficult to identify bugs, especially when data is changing. It can be difficult to stay organized when working with potentially disconnected code blocks. Python notebooks typically don’t integrate well with other tools, although a simple Python script can go a long way. Reproducibility, being able to run a script from start to finish in one go, changing only parameters/data inputs, is key.

Notebooks aren’t evil, they are for me to test things in, a safe space where I can break things, where I can walk before I can run. To date, I don’t think I’ve sent one of my Python Notebooks to a colleague — I might keep it that way. It might be something to do with the quality of my code 🙈.

Why should I change my way of working?

Increase the reach of your analysis. When data leaves a secure environment, it leaves a safe, secure space that many users already have access to. Frankly, few have access to whatever notebook tool you use, your desktop, or Amazon Sagemaker if you use that. Your work becomes less visible and therefore dangerously less valuable when others do not have access to it.

It is crucial to expand the reach of your analysis beyond a limited and secure environment. When data is shared outside of its secure environment, it becomes vulnerable and at risk of being compromised. It is important to acknowledge that not everyone has access to the same tools and platforms as you do (think a local notebook or even Amazon Sagemaker). As a result, if you limit the visibility of your work to only a few individuals or specialized systems, it is no longer useful to a broader audience. Therefore, it is imperative to increase the accessibility of your analysis to ensure it is valuable for all.

A data scientist that can share insights further across the business is a more valuable data scientist.

How do I increase my productivity?

Use the data as it is, directly where it lives. Snowpark offers the use of external languages in Snowflake like Python, Java and Scala. Python worksheets are now in public preview, where you can instantly try your Python code as a UDF (User Defined Function) before sharing it with others. If you are ready to publish/register a UDF, you can already do this directly in Snowflake’s UI, Snowsight, or run this from your local Python. Re-use that function across your data, providing access to all the users or services that need it, even enabling real-time use cases!

Time to be efficient. Save > 60% of Data Scientists’ time

By cleaning and organizing data using Python, directly in Snowflake, you can have Data Scientists save a large portion of that pie, maybe even the whole 60% if Snowflake data is already clean and organized.

Even the 4% of time spent refining algorithms can be aided by running this in Snowflake, where the latest data always resides.

ℹ Productivity gains continue to stack as you bring your workloads closer together.

BONUS TOPIC: How can business users interact with data scientists’ work?

Reusable functions can run directly on your data, so that your work can be utilized and even reused by others, saving even more of your time. Now, you’re nailing productivity! Running Python workloads inside the data cloud is often associated with data pipelines, but there is a major benefit to real-time use cases. Users can interact with components such as buttons and sliders, in whatever tool of their choosing, to produce beautiful applications that allow end-users control of inputs and parameters in a model to make better data-driven decisions.

Some of my favourite examples:

Astrato: Real estate instant property estimate

Streamlit: Ad Spend ROI Prediction

Hex: Superstore Summer promotion

Astrato: An interactive Data App for Snowflake, utilising Snowpark for Python.

If you liked this blog post, check out some of the tools I love to work with and connect with me on Linkedin 🤝🏼

--

--