Disk Detour: In-Memory Staging with Snowpark’s put_stream Function

Skip the save and land the file—an in-depth review of working with internal stages with Snowpark.

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

4 min readJul 25, 2023

Adjusting an image and using the adjusted image for a preview image.

Snowflake offers several methods to upload and download files from internal stages. The command names are identical to the HTTP request methods, like GET and PUT, and their behavior is comparable.

I used "internal" for the stage type that permits file operations within Snowflake. Uploading or downloading files to or from external stages isn't supported by Snowflake, although it's still possible to carry out these tasks with the help of cloud service utilities.

Let’s briefly break down what’s happening here when we perform a regular PUT.

We can see the typical client-server (server, in this case, Snowflake) interaction. The client, likely an application or script running on your laptop or desktop, needs some content to be uploaded. Using a connector, the client can log in and, if all permissions are correct, upload a file that can be retrieved later with a GET or interacted with using Snowflake’s SQL dialect with COPY INTO or SELECT.

This is a simplified view; it’s essential to be mindful of the optional parameters the PUT command supports, which aren’t pertinent to this article but can undoubtedly impact the interaction.

So far, we’ve explored PUT in such a way that’s demonstrating taking a file from the client and loading it to an Internal Stage in Snowflake. What if there is no file? What if we want to write something we haven’t yet saved, such as a DataFrame or a machine learning model?

Why Write Twice?

Meme comparing PUT and PUT_STREAM options with a cat wearing sunglasses.

I see a lot of examples that will write files to a temporary directory only to execute a PUT to have that file copied and written to a stage. We can streamline (pun intended) this with PUT_STREAM.

Let’s review two scenarios using the joblib library to serialize (dump) a machine-learning model. In machine learning, serialization allows us to save the trained model to disk or memory so that it can be reused later without retraining it from scratch.

Scenario 1 — PUT

The model is serialized using joblib.dump(model, model_file), where model is the machine learning model to be serialized, and model_file is the path of the file where the serialized model will be stored.
The joblib.dump function saves the serialized model to the specified file location (model_file).
The session.file.put function is then used to store the serialized model file (model_file) in a session file. @models is the name of the internal stage where the model file will be stored.

Scenario 2 — PUT_STREAM

A BytesIO object x is created. BytesIO is an in-memory stream used to store binary data.
The model is serialized using joblib.dump(model, x), where model is the machine learning model to be serialized, and x is the BytesIO object to store the serialized data.
The joblib.dump function saves the serialized model to the BytesIO object x.
The session.file.put_stream function is then used to store the serialized model data from the BytesIO object x in a session file. @models is the name of the internal stage where the model file will be stored.

Since the dump function provided by the Joblib library supports a file object or a path of the file for the filename parameter; both solutions work almost interchangeably.

The main difference between the two solutions lies in how the serialized model is stored. Solution 1 saves the serialized model to a physical file on the disk (model_file) and then writes the file to the internal stage. Solution 2, on the other hand, uses the model already in memory using a BytesIO object (x) and then writes the object to the internal stage.

Conclusion

There’s not really a right or wrong way to do this, but evaluating and understanding the different ways it can be done is worthwhile.

I thought about this trick when I was thinking through a quick way to write a Pandas DataFrame to a stage in another article, and I figured it would be worth sharing in greater detail here.

I hope you enjoyed the read and found this helpful!