PyTorch Tensors Explained
From Memory Usage to AutoGrad in PyTorch
PyTorch is a very important library for the creation of new Machine Learning Models. Built out of Meta, the library led the way to dynamic compilation, allowing for easier debugging. When you are creating architectures from scratch, PyTorch is the easiest library to use.
This post explores the internal workings of PyTorch’s Tensor
class, building upon Edward Z. Yang’s foundational blog post and my own experience working with the library. As practically every machine learning library has taken some inspiration from PyTorch, understanding the low-level choices made here will help you understand all other libraries.
Let’s dive in!
Tensors
What is a Tensor?
Let’s begin from the most abstract point of view. A tensor is a mathematical concept that generalizes scalars, vectors, and matrices to n dimensions. This means that a one-dimensional array is a tensor, as is a 2x2 matrix, as is the (b, t, c) matrix used for Transformers.
Why does this type of variation matter for computer scientists?
In general, matrix operations are the most memory-expensive. You need to make significantly more memory transfers to do matrix multiplication than you would for any other operation done to scalars. Thus, while we want to be efficient with how scalars are represented in memory, to avoid significant slow down, we need to be efficient with how tensors are represented.
While there are many different ways to layout these tensors, I’m going to focus on the strided tensor as this is the most common type needed for Large Language Models (LLMs). If people are curious, I can do another blog post on alternative layouts used in different ML workloads.
How Are Tensors Laid Out In Memory?
When a user is creating a Tensor in PyTorch, they typically have something like the below in their head. A matrix laid out with a specific shape and perhaps a specific value they want to access later on.
By default, PyTorch will store the values in memory contiguously in row major order like the below (assuming each value is a 4-byte integer). Row major order means that elements in the same row are stored contiguously (this is also the default for how low-level languages like C and Rust store matrices).
While this layout is simple, we need to store more data to ensure we don’t lose information. For instance, for the layout drawn above, we have no idea if the represented tensor has a shape of (4x1), (1x4), or (2x2). Naturally this makes a big difference in all of the operations we do. Let’s add in the shape to our diagram below.
With the shape and the values within memory, now we have to figure out how to fetch specific entries. For this, we use a stride.
A stride is an array of values that correspond to each dimension in the tensor. We use these to determine how far to walk within the contiguous memory to find the exact value we need. To continue with our example:
With the right stride we will get all the correct values. The question is, how do you get the right stride? This is based off the dimensions of your tensor, and we start by working backwards. For all contiguous tensors, the last dimension always has a stride of 1. Once this is done, we now need to figure out how many elements we need to jump for the next dimension. This is done by multiplying the next dimension by the last value. For our example of (2x2), our next dimension is 2 and our last stride value was 1, so we would get 2. Showing this generally in the image:
This stride format is dynamic, allowing us to change the shape of the Tensor by simply changing the metadata. Whenever we want to adjust the shape of the tensor without copying over the data, we simply do tensor.view(x,y)
and, assuming the new dimensions can hold all of the data, it will now let us access our data via this new dimensions.
To wrap this up, let’s revisit our earlier example. After running view on the tensor to move it to a (4x1) matrix, we can see below that our values and memory addresses have stayed the same, while only our metadata was updated.
Operations
How Do Tensors Get Operated On?
Now that we have our data set, it’s time to look into how these tensors get operated on. From a high-level, PyTorch has operations like torch.mm
which carry out matrix multiplication. While it may seem like a single function, in reality there are multiple different implementations based off the device (CPU, CUDA, etc.) and the data type it holds (fp32, int, etc.).
This separation is also why you generally cannot operate on tensors held on different devices. Doing so results in a runtime error because operations between tensors on different devices are unsupported.
Looking at the Python code alone, you wouldn’t see the complexities and performance implications this choice has. Under the hood, PyTorch is implemented in C++, where there are two main ways to invoke functions: statically or dynamically.
In a static call, the function is known precisely at compile time — this makes it fast but would require PyTorch to make assumptions it isn’t designed to make (e.g. device type, type of Tensor, data type, etc). Put differently, only at runtime does PyTorch have all the info needed to determine which function to call. As a consequence, PyTorch relies on dynamic dispatch.
Dynamic dispatching has its costs however. While the extra storage overhead of dynamic libraries is rarely an issue today, managing backwards compatibility is. When running older models, finding the correct combination of dependencies can be incredibly painful. This means that when you are setting up your PyTorch environment, you need to be extremely careful to call out every possible dependency — relying solely on the default pip wheel can lead to version mismatches and elusive bugs.
Once PyTorch determines which operation to run based on device and data type, it delegates to a low-level kernel — usually written in C++. Let’s now look into how these kernels are structured and how you can write your own.
Writing Kernels
The dispatches we talked about eventually connect to low-level kernels hand-written for PyTorch. These kernels are incredible (check out ones for each compute type in this folder). If you want to write PyTorch kernels, there are a few tools that help make it easy for you.
First, the wrapper code that connects your kernel to the high-level PyTorch functions are automatically generated for you when you fill in the right schema ( code for schema defined here ). This allows you to immediately connect your implementation to the corresponding high-level PyTorch operation.
Second, PyTorch provides macros to handle common behaviors. For example, the low-level TORCH_CHECK
macro is useful to write better debugging messages and guard against crashes from invalid memory accesses. TORCH_CHECK
itself can be thought of as an abstracted if and throw along with a custom message.
TORCH_CHECK(self.dim()==1, "Expected 1-D tensor, got ", self.dim(), "-D tensor");
The last one I’ll cover is the TensorAccessor class. This low level class in the kernel passes in the data, dimensionality, and the dtype. This code handles strides correctly out of the box, making it a nicer interface to work with than just raw pointers. Internally the strides are correctly handled, even if the memory underlying it is not contiguous.
auto accessor = B.accessor<float, 2>(); // 2D float accessor
for (int i = 0; i < B.size(0); ++i) {
for (int j = 0; j < B.size(1); ++j) {
float val = accessor[i][j];
std::cout << val << "\n";
}
}
Autograd
The last feature I’m going to cover here is autograd for Tensors.
This is perhaps the most important PyTorch feature for machine learning library. Autograd (short for Automatic Gradients) frees the programmer from having to manually say how you should calculate the derivatives through their model. Instead, by defining the forward pass clearly, the programmer can rely on autograd to figure out how the backwards pass should be computed. The way PyTorch does this is by storing more metadata to keep track of the operations happening during a forward pass so it can figure out the derivative it needs to create for the backwards pass.
This metadata is used to create a directed acyclic dynamic dependency graph that shows PyTorch how operations are related to each other. It’s important to note that by default the compute graph is built at runtime (rather than using compile time). While this makes it very easy to debug and work with, it comes at the cost of speed and performance. There have been attempts to reduce this time (notably torch.compile in Torch 2.0), but there are still major issues here [see this Google Doc outlining ‘gotchas’].
Note that the data additions of autograd are not small. If you are simply using your model for inferencing, it will needlessly slow down your code to keep these metadata. To get around this, we wrap the tensors we don’t want to track in the with torch.no_grad():
Autograd Example
I’m going to walk you through a simple example that shows why autograd is a key feature for PyTorch.
The below shows us creating a simple Multi-Layer Perceptron (MLP) and then doing a quick stochastic gradient descent optimization run on it.
import torch
import torch.nn as nn
# Set seed for reproducibility
torch.manual_seed(0)
# Dummy data
x = torch.randn(5, 2) # 5 samples, 2 features
y = torch.randn(5, 1) # Target values
# Simple MLP: 2 -> 3 -> 1
model = nn.Sequential(
nn.Linear(2, 3),
nn.ReLU(),
nn.Linear(3, 1)
)
# Loss and optimizer
loss_fn = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
# One training step
y_pred = model(x)
loss = loss_fn(y_pred, y)
loss.backward() # autograd does all the work here!
optimizer.step()
Because autograd tracks all of the operations, we don’t have to worry about doing the derivatives of ReLUs or the linear layers (multiplication and addition) by hand. If we wanted to do this without PyTorch, it would look something like the below:
import numpy as np
np.random.seed(0)
# Dummy input and target
x = np.random.randn(5, 2) # 5 samples, 2 features
y = np.random.randn(5, 1) # 5 samples, 1 output
# Initialize weights
W1 = np.random.randn(2, 3) # (input_dim, hidden_dim)
W2 = np.random.randn(3, 1) # (hidden_dim, output_dim)
# Forward pass
z1 = x @ W1 # shape: (5, 3)
a1 = np.maximum(0, z1) # ReLU activation
y_pred = a1 @ W2 # shape: (5, 1)
# Compute loss (MSE)
loss = np.mean((y_pred - y)**2)
print(f"Loss before: {loss:.4f}")
# Backward pass (manual gradients)
# dL/dy_pred
grad_y_pred = 2 * (y_pred - y) / y.shape[0] # shape: (5, 1)
# dL/dW2 = a1^T @ grad_y_pred
grad_W2 = a1.T @ grad_y_pred # shape: (3, 1)
# dL/da1 = grad_y_pred @ W2^T
grad_a1 = grad_y_pred @ W2.T # shape: (5, 3)
# dL/dz1 = grad_a1 * ReLU'(z1)
grad_z1 = grad_a1 * (z1 > 0).astype(float) # shape: (5, 3)
# dL/dW1 = x^T @ grad_z1
grad_W1 = x.T @ grad_z1 # shape: (2, 3)
# Gradient descent step
lr = 0.01
W1 -= lr * grad_W1
W2 -= lr * grad_W2
# Forward again after update
z1 = x @ W1
a1 = np.maximum(0, z1)
y_pred = a1 @ W2
loss = np.mean((y_pred - y)**2)
print(f"Loss after: {loss:.4f}")
We quickly reach a point where keeping track of these operations is complex and error-prone.
The Future
PyTorch’s flexibility has made it the go-to library for prototyping cutting-edge models. But as models move from research to production, new challenges emerge: dependency conflicts, runtime overhead from dynamic graphs, and the need for hand-optimized kernels tailored to specific hardware.
This is where projects like Luminal (disclosure: I contribute to Luminal) explore a different approach. As an open-source framework, Luminal compiles entire models ahead-of-time, analyzing the full computational graph to eliminate redundant operations, fuse kernels automatically, and optimize memory layouts. By seeing the whole model upfront, Luminal can apply optimizations once — like rewriting matrix multiplications or statically allocating memory — that pay dividends across thousands of inferences.
For example, while PyTorch relies on handwritten CUDA kernels for performance (e.g., FlashAttention), Luminal generates kernels programmatically based on the model’s specific operations and target hardware. This avoids the complexity of maintaining hardware-specific kernel libraries, though it trades off some low-level control. The result is a single, lightweight binary with no external dependencies — ideal for reducing inference costs and increasing the speed of response.
PyTorch’s strength will always lie in its flexibility and ecosystem. But for teams deploying models at scale, tools like Luminal highlight the power of compilation-driven ML: sacrificing some interactivity for efficiency gains that compound over time.
The future of ML frameworks isn’t one-size-fits-all. By understanding PyTorch’s internals, you’re better equipped to choose — or build — the right tools for your next challenge.
[1] Yang, E., “PyTorch internals” (2019), blog.ezyang.com