Safetensors: a simple, safe and faster way to store and distribute tensors.

Souvik Mandal
10 min readJul 8, 2023

--

Safetensors is a new simple format for storing tensors safely (as opposed to pickle) and that is still fast (zero-copy). Safetensors is really fast 🚀.

safetensors and ONNX serve different purposessafetensors is a simple, safe, and fast file format for storing and loading tensors. It is a secure alternative to Python’s pickle utility, which is not secure and may contain malicious code that can be executed.

On the other hand, ONNX (Open Neural Network Exchange) is an open format for representing deep learning models. It allows you to save your model in a way that can be loaded by different deep learning frameworks, such as PyTorch, TensorFlow, Caffe2, etc. This makes it easier to share models between different frameworks2.

In summary, safetensors is used for storing and loading tensors in a safe and fast way, while ONNX is used for sharing models between different deep learning frameworks. Same applies for other model sharing frameworks.

I personally am interested in the speed up part. When we have a really large dataset, and we create some cache of the dataset it is really important that we load the cache very fast. Security is also important, but that aspect is fixed on torch.load which I have discussed below.

Installation

pip install safetensors huggingface_hub

Saving and Loading

We will tryout different features and compare. First let's install:

import torch
from safetensors.torch import save_file

tensors = {
"embedding": torch.zeros((2, 2)),
"attention": torch.zeros((2, 3))
}
save_file(tensors, "model.safetensors")

Load tensors

from safetensors import safe_open

tensors = {}
with safe_open("model.safetensors", framework="pt", device=0) as f:
for k in f.keys():
tensors[k] = f.get_tensor(k) # loads the full tensor given a key
print(tensors)
# {'attention': tensor([[0., 0., 0.],
# [0., 0., 0.]], device='cuda:0'),
# 'embedding': tensor([[0., 0.],
# [0., 0.]], device='cuda:0')}

You might have noticed the framework argument. Right now, we are saving PyTorch tensors, so we are using pt.

Loading only part of the tensors:

tensors = {}
with safe_open("model.safetensors", framework="pt", device=0) as f:
tensor_slice = f.get_slice("embedding")
vocab_size, hidden_dim = tensor_slice.get_shape()
tensor = tensor_slice[:, :hidden_dim] # change the hidden_dim to load part of the tensor

Lazy loading is the ability to load only some tensors, or part of tensors for a given file. This is possible with safetensors.

Lazy loading is really important in cases when we have a large file containing many key and value pairs. This can be a metadata cache for large dataset. If we can load the value for single keys indivisually it will be memory efficient and faster else we will have to load the full file into memory to inspect any of the key.

Load state Dict to model

from safetensors.torch import load_model, save_model

save_model(model, "model.safetensors")
# Instead of save_file(model.state_dict(), "model.safetensors")

load_model(model, "model.safetensors")
# Instead of model.load_state_dict(load_file("model.safetensors"))

Speed 🚀

CPU Speedup

In the official doc it says that 76.6x speed up on CPU for loadng GPT2 weights on Intel(R) Xeon(R) CPU @ 2.00GHz, Ubuntu 18.04.6 LTS. I have tested the same script (only change is I ran the same script 500 times and took the median of both times) on Colab and Kaggle. The speedup on colab is 12.6xand for Kaggle the speedup is 9.30x.

import os
import datetime
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
import torch
import numpy as np
from tqdm.auto import tqdm

sf_filename = hf_hub_download("gpt2", filename="model.safetensors")
pt_filename = hf_hub_download("gpt2", filename="pytorch_model.bin")

torch_load_times = []
st_load_time = []
for i in tqdm(range(500)): # Run and compute the time 500 times.
start_st = datetime.datetime.now()
weights = load_file(sf_filename, device="cpu")
load_time_st = datetime.datetime.now() - start_st
st_load_time.append(load_time_st)

start_pt = datetime.datetime.now()
weights = torch.load(pt_filename, map_location="cpu")
load_time_pt = datetime.datetime.now() - start_pt
torch_load_times.append(load_time_pt)

speed_up = np.median(torch_load_times)/np.median(st_load_time) # take the median
print(f"on CPU, safetensors is faster than pytorch by: {speed_up:.1f} X")

Let’s see how the speedup changes with the change in the size of the checkpoint. We will just create a large tensor and save it.

sizes = [100, 1000, 10000, 20000, 30000]
file_size_tensor = []
file_size_st = []
speed_ups = []
for size in tqdm(sizes):
tensor = torch.randn((size, size))
tensors = {
"embedding": tensor,
}
save_file(tensors, "model.safetensors")
torch.save(tensors, "model.pt")
file_size_tensor.append(os.path.getsize("model.pt") / (1024 * 1024))
file_size_st.append(os.path.getsize("model.safetensors") / (1024 * 1024))


torch_load_times = []
st_load_time = []
for i in tqdm(range(500), leave=False):
start_st = datetime.datetime.now()
weights = load_file(sf_filename, device="cpu")
load_time_st = datetime.datetime.now() - start_st
st_load_time.append(load_time_st)

start_pt = datetime.datetime.now()
weights = torch.load(pt_filename, map_location="cpu")
load_time_pt = datetime.datetime.now() - start_pt
torch_load_times.append(load_time_pt)
speed_up = np.median(torch_load_times)/np.median(st_load_time)
speed_ups.append(speed_up)

With the above script the sizes of the checkpints are around [0.038, 3.815, 381.47, 1525.879, 3433.228]. The file-sizes are not changing for pt and safetensors much. The speed-ups for the above checkpoint sizes are [9.73, 9.73, 9.86, 9.79, 9.85] for Colab and [11.01, 11. , 10.61, 10.48, 10.05] for Kaggle.

This speedup is due to the fact that this library avoids unnecessary copies by mapping the file directly.

I wanted to compare the speedup against npy file loading. The speedup of 87.8x on Colab and 103xon Kaggle when mmapis None and 2.0x and 2.9x respectively when it is mmap_mode=r.

Memory-mapped files use virtual memory, which is the address space that a process can access. When you memory-map a file, you are telling the operating system to map a region of virtual memory to the contents of the file. This does not mean that the whole file is loaded into physical memory at once. Instead, the operating system will load only the parts of the file that you access into physical memory, and unload them when they are no longer needed or when there is memory pressure. This is done transparently by the operating system using paging and page faults.

import os
import datetime
from safetensors.torch import load_file
import torch
import numpy as np
from tqdm.auto import tqdm

np.save("np_cache.npy", np.random.randn(400, 512, 512))
save_file({"data": torch.randn(400, 512, 512)}, "st_cache.safetensors")

np_filename = "np_cache.npy"
sf_filename = "st_cache.safetensors"

torch_load_times = []
st_load_time = []
for i in tqdm(range(500)): # Run and compute the time 500 times.
start_st = datetime.datetime.now()
weights = load_file(sf_filename)
load_time_st = datetime.datetime.now() - start_st
st_load_time.append(load_time_st)

start_pt = datetime.datetime.now()
weights = np.load(np_filename) # , allow_pickle=True
load_time_pt = datetime.datetime.now() - start_pt
torch_load_times.append(load_time_pt)

speed_up = np.median(torch_load_times)/np.median(st_load_time) # take the median
print(f"on CPU, safetensors is faster than npy by: {speed_up:.1f} X")

CPU Speed-up is really significant 🚀.

GPU Speed-up 🚗

The official doc mentioned model loading on GPU speedup is 2.1x. I did the same experiment as before. For Colab the speedup was 1.4x. For Kaggle the Speedup was 1.5x

torch_load_times = []
st_load_time = []
for i in tqdm(range(500)): # Run and compute the time 500 times.
start_st = datetime.datetime.now()
weights = load_file(sf_filename, device="cuda:0")
load_time_st = datetime.datetime.now() - start_st
st_load_time.append(load_time_st)

start_pt = datetime.datetime.now()
weights = torch.load(pt_filename, map_location="cuda:0")
load_time_pt = datetime.datetime.now() - start_pt
torch_load_times.append(load_time_pt)

speed_up = np.median(torch_load_times)/np.median(st_load_time) # take the median
print(f"on GPU, safetensors is faster than pytorch by: {speed_up:.1f} X")

GPU speed up with multiple size checkpoints:

  • On Colab: [1.42, 1.35, 1.37, 1.35, 1.34]
  • On Kaggle: [1.25, 1.25, 1.25, 1.24, 1.24]
sizes = [100, 1000, 10000, 20000, 30000]
file_size_tensor = []
file_size_st = []
speed_ups = []
for size in tqdm(sizes):
tensor = torch.randn((size, size))
tensors = {
"embedding": tensor,
}
save_file(tensors, "model.safetensors")
torch.save(tensors, "model.pt")
file_size_tensor.append(os.path.getsize("model.pt") / (1024 * 1024))
file_size_st.append(os.path.getsize("model.safetensors") / (1024 * 1024))

torch_load_times = []
st_load_time = []
for i in tqdm(range(500), leave=False):
start_st = datetime.datetime.now()
weights = load_file(sf_filename, device="cuda:0")
load_time_st = datetime.datetime.now() - start_st
st_load_time.append(load_time_st)

start_pt = datetime.datetime.now()
weights = torch.load(pt_filename, map_location="cuda:0")
load_time_pt = datetime.datetime.now() - start_pt
torch_load_times.append(load_time_pt)
speed_up = np.median(torch_load_times)/np.median(st_load_time)
speed_ups.append(speed_up)

SafeTensors advantages

torch.load() unless weights_only parameter is set to True, uses pickle module implicitly, which is known to be insecure. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never load data that could have come from an untrusted source in an unsafe mode, or that could have been tampered with. Only load data you trust. — PyTorch Docs

PyTorch has introduced a parameter weights_only which should load only tensors, primitive types and dictionaries. There is a proof of concept code injection example is here. SafeTensors does not have this problem.

Zero Copy Operation 0️⃣

Let’s understand zero-copy with an example of reading a data from disk and sending it to a socket (This happens a lot in most of the web applications). To complete this operation a kernel will read the data to the user space.

Operating System terminologies:

User space and kernel space are two regions of virtual memory that are separated by the operating system to provide memory protection and hardware protection from malicious or errant software behavior.

User space is the memory area where application software and some drivers execute. Each user space process normally runs in its own virtual memory space, and cannot access the memory of other processes or the kernel space, unless explicitly allowed. User space processes can only interact with the kernel through system calls, which are a set of functions that the kernel exposes to user space.

Kernel space is the memory area where the operating system kernel, kernel extensions, and most device drivers run. Kernel space programs run in kernel mode, also called supervisor mode, which is a privileged mode that allows access to all CPU instructions and hardware resources.

Going back to the example. Once the data is loaded to the user space it will again do a kernel call and the kernel will write the data to the socket. Each time data traverses the user-kernel boundary, it must be copied, which consumes CPU cycles and memory bandwidth.

Zero copy request that the kernel copy the data directly from the disk file to the socket, without going through the application.

Traditional copy operation. Other than the copy operation there are lots of context switch is happening which makes the process slower. Source: Efficient data transfer through zero copy[4]
Zero copy data transfer. Source: Efficient data transfer through zero copy[4]

Not coming back to our example, its same situation as before. First we ask the kernel to read the data. Kernal will read the data and send it to the user process (first copy), then user will again ask the Kernel to load the data to CPU memory. While passing this data to kernel to put it into memory we create another copy of the data. So, we are creating total two copies while loading the data. The diagrams are also same and same number of context switch happens.

So, if you understood this correctly you can guess zero copy operations is not possible (ignoring case when user process has more privilege to execute direct kernel commands) with our problem statement because we will have to read the data and load the data. Only way we can do zero copy if one of the two operation is not required.

A disk cache (cache memory) is a temporary holding area in the hard disk or random access memory (RAM) where the computer stores information that used repeatedly. If you are loading too many different things from disk, some other parts of the disk might be flushed from cache meaning they can’t be read without slow disk access. Otherwise the data might get cached into the disk cache. In this case we can do zero-copy operation.

Source: SafeTensors github

Safe: Can I use a file randomly downloaded and expect not to run arbitrary code ?

Zero-copy: Does reading the file require more memory than the original file ?

Lazy loading: Can I inspect the file without loading everything ? And loading only some tensors in it without scanning the whole file (distributed setting) ?

Layout control: Lazy loading, is not necessarily enough since if the information about tensors is spread out in your file, then even if the information is lazily accessible you might have to access most of your file to read the available tensors (incurring many DISK -> RAM copies). Controlling the layout to keep fast access to single tensors is important.

No file size limit: Is there a limit to the file size ?

Flexibility: Can I save custom code in the format and be able to use it later with zero extra code ? (~ means we can store more than pure tensors, but no custom code)

Bfloat16: Does the format support native bfloat16 (meaning no weird workarounds are necessary)? This is becoming increasingly important in the ML world.

PyTorch Operations

Load a model in PyTorch.

from torchvision.models import resnet18

model_pt = resnet18(pretrained=True)

Save the state_dict to safetensor, and load them back to a new model.

from safetensors.torch import load_model, save_model

# save the state dict
save_model(model, "resnet18.safetensors")

# load the model without weights
model_st = resnet18(pretrained=False)
load_model(model_st, "resnet18.safetensors")

Infer on a random image with both the initial model and the newly model with weights loaded from safetensors.

img = torch.randn(2, 3, 224, 224)

model_pt.eval()
model_st.eval()

with torch.no_grad():
print(torch.all(model_pt(img)==model_st(img))) # tensor(True)

Torch shared tensors

  • PyTorch have something called shared Tensors. In safetensors these are not there. Let’s see what shared tensors are with the example below:
from torch import nn

class Model(nn.Module):
def __init__(self):
super().__init__()
self.a = nn.Linear(100, 100)
self.b = self.a # same weights as a

def forward(self, x):
return self.b(self.a(x))


model = Model()
print(model.state_dict())
# odict_keys(['a.weight', 'a.bias', 'b.weight', 'b.bias'])
torch.save(model.state_dict(), "model.bin")
# This file is now 41k instead of ~80k, because A and B are the same weight
# hence only 1 is saved on disk with both `a` and `b` pointing to the
# same buffer
  • In the example above a and b are sharing the same weights/ tensors. Because of this, while saving we don’t create two copies of the same weights. This makes the checkpoint size smaller.
  • Safetensors does not support them as of now.

Not all frameworks support them for instance tensorflow does not. So if someone saves shared tensors in torch, there is no way to load them in a similar fashion so we could not keep the same Dict[str, Tensor] API.

Note: Dict[str, torch.Tensor] is dictionary that contains name as key, value as torch.Tensor for the state dictionary.

  • Lazy loading is easier to do without shared tensors.
with safe_open("model.safetensors", framework="pt") as f:
a = f.get_tensor("a")
b = f.get_tensor("b")

Now it’s impossible with this given code to “reshare” buffers after the fact. Once we give the a tensor we have no way to give back the same memory when you ask for b.

That’s all for this post. Have a nice day.

Resources

  1. GitHub
  2. Safetensors (huggingface.co)
  3. Compatibility with `torch.save()`? · Issue #65 · huggingface/safetensors · GitHub
  4. Efficient data transfer through zero copy — IBM Developer
  5. What is disk cache? — Definition — Computer Notes (ecomputernotes.com)
  6. Zero copy is not used when torch memory is sparse and/or has not been garbage collected · Issue #115 · huggingface/safetensors · GitHub

--

--