Salsa Algorithm Explained

Ilya Lakhin
12 min readNov 18, 2023

--

Photo by Dennis Klein on Unsplash

Salsa is a Rust framework for arbitrary incremental computations, sometimes referred to as reactive computations or self-adjusted computations. It is utilized for semantic analysis in the Rust Compiler and the Rust Analyzer projects under the hood.

Imagine that you have a set of functions that take as inputs the results of other functions within this set or some data from the external environment, such as “user input.”

These functions don’t have any other side effects. If the inputs of a function don’t change, the output remains the same each time the function is called. Additionally, these functions are not recursive; no function within this set can use its own result as an input, either directly or indirectly through the results of other functions.

If these functions meet these two requirements, they are pure functions that form a graph without cycles. This graph becomes the subject of management for the incremental framework.

The goal of the incremental computations framework is to optimize computations over the managed graph of such functions.

The framework utilizes memoization of function results to reduce the need for subsequent re-computations. What’s more important, when the user input changes, the framework is capable of updating the minimum number of memoization caches in the function graph, which is, in fact, not a trivial task.

Additionally, incremental frameworks are typically demand-driven. They compute the necessary subsets of the function graph whenever we observe specific function results but leave the rest of the graph nodes intact. Therefore, we can make as many changes to the user input as needed; the framework will propagate these changes through the graph of functions only when we try to observe particular nodes.

Incremental computation frameworks have a wide range of applications. In particular, as mentioned above, they are excellent candidates for organizing the architecture of incremental semantic analysis in programming language compilers and source code analysis tools.

Moreover, they could be employed to optimize graphical user interface rendering. Specifically, React.js serves as an example of an incremental framework that minimizes DOM alterations based on changes in user input.

Perhaps the most classical example of incremental computations is found in Excel Spreadsheets. The inner algorithm of Excel is responsible for propagating changes in user input within the spreadsheet cells to other table cells with mathematical formulas that directly or indirectly depend on these changes.

Spreadsheets is an example of the incremental computations

Apart from Salsa, the Rust ecosystem offers several alternative implementations of incremental frameworks. It’s worth mentioning Anchors and Adapton. They exhibit notable API differences, as well as distinctions in the way user input changes are propagated.

Being such a crucial component of the Rust compiler core, I will specifically focus on the algorithm used in Salsa.

The Salsa documentation primarily concentrates on providing guidance for the end user API, but, in my opinion, it offers too little information about the inner algorithm. This limitation is understandable because providing a thorough explanation of all optimizations and specifics implemented on top of the core algorithm goes beyond the scope of typical API documentation.

In this article, I will attempt to explain the core algorithm of Salsa in its purest and most simplified form. I will intentionally steer clear of Salsa API terminology, opting instead to elucidate the algorithm in terms of Excel Spreadsheets. My aim is to make the explanation of the core idea more accessible to a broad audience not familiar with Salsa specifics.

However, towards the end of the article, I will briefly touch upon some important aspects of the actual implementation of Salsa.

The Algorithm

The algorithm requires a single data structure that we will call a Spreadsheet, where we will store two things:

  • A globally unique Version. This is simply an integer number that always increases and never decreases, initially set to 1 by default.
  • A set of nodes in the graph that we will refer to as Cells. For simplicity, we will assume that the cells reside in the Vec<Cell>, a vector defined upfront, and we will address specific cells by their integer index within this vector.

The primary purpose of Cells is to store memoized (cached) values of arbitrary computable types.

You can think of the cells much like the cells in an Excel spreadsheet, where they may have computation formulas that refer to other spreadsheet cells. Cells without any references are akin to cells with arbitrary user-input numbers. Without loss of generality, however, we will not distinguish between these two types of cells, considering that both define some formula (or function) with possibly zero arguments and a constant output.

Each Cell is comprised of the following components:

  • A set of Inputs, which is a set of indices of other cells that this cell refers to.
  • A function that computes the Output value of this cell from the values of the Input cells.
  • A possibly uninitialized output cache paired with the output’s hash.
  • A VerifiedAt integer number, initially set to 0.
  • An UpdatedAt integer number, initially set to 0.
The Cell Internals

Cell Inputs effectively define the edges of the Cell graph within the Spreadsheet. The system of “output -> input” edges must not form cycles among the graph nodes. Therefore, the Spreadsheet is a directed acyclic graph.

Typically, the computation functions inside the cell should calculate result values solely from the input cell values (from the outputs of the referred cells). In this scenario, such functions are pure functions without side effects. Their results remain unchanged if the values of the input cells do not change. The purity of these functions is an essential feature of the entire algorithm.

Certainly, the user input into the spreadsheet requires some of the cell’s functions to read values from the external environment. Functions of this nature are considered impure in this context.

Whenever we change the external state that a particular cell’s function relies on, we need to do two things:

  1. Increment the spreadsheet global Version number by one.
  2. Set this cell’s cache to an uninitialized state, effectively erasing the cache.

As we will see later, such a change will eventually trigger all necessary recomputations within the spreadsheet.

To observe the actual values of the spreadsheet cells, we have two procedures: the Fetching Procedure and the Verification Procedure. Both procedures are interconnected and may recursively initialize and, depending on the needs, update necessary subsets of the cell graph.

The Fetching Procedure returns the up-to-date output value of a particular сell:

  1. If the cell’s VerifiedAt number equals the Spreadsheet Version, we return the cached value as it is (in this case, the cache is always initialized).
  2. Otherwise, we run the Verification Procedure that initializes and/or updates the cache. After the verification completion, we return the cached value.
The Fetching Procedure

The Verification Procedure recursively updates a subset of the сell graph through a depth-first traverse, ensuring that each сell in the subgraph will be up to date:

  1. If the VerifiedAt number equals the Spreadsheet Version, this cell is considered verified, following the same rule as in the Fetching Procedure.
  2. If the cache is not initialized, we run the fetching procedure for each cell referred to from this one through the Inputs set. Then we pass the fetched values to the cell’s inner function, storing the function’s result in the cell’s cache. Finally, we set VerifiedAt and UpdatedAt numbers to the current Spreadsheet Version. Essentially, we initialize the cell with a fresh new memoized value.
  3. If the cache is initialized, we need to ensure that it’s Inputs didn’t change.
    – First, we recursively run the verification procedure for each input cell.
    – Then, we check if each input’s UpdatedAt number is less than or equal to the current cell’s VerifiedAt number.
    – If they are less than or equal, the cache is considered up to date. Since the inputs remain intact, we may conclude that the function doesn’t need to be rerun. We set the VerifiedAt number to the current Spreadsheet Version, leaving UpdatedAt intact, and finish the procedure.
    – Otherwise, we rerun the cell’s computation function. If the result of the function has the same hash as the one previously stored in the cell, we also consider the cell to be up to date. In this case, we only need to update the VerifiedAt number. However, if the old and new hashes are distinct, we raise the UpdatedAt number to the Spreadsheet Version.
The Verification Procedure

When fetching a value from the сell, the algorithm performs a depth-first traverse of the subgraph of a particular сell, skipping (potentially large) parts of the subgraph computed previously. The algorithm skips traversal by comparing VerifiedAt to the current global Spreadsheet Version, which is a fast pass optimization of the algorithm.

However, on the first call or when the Spreadsheet Version receives a fresh new value (which occurs after each user input), the algorithm will have to traverse the entire subgraph regardless. This is necessary to check each cell’s VerifiedAt and UpdatedAt numbers within the boundaries of the cells, figuring out affected cells and updating their caches accordingly.

Even though the traversal procedure is quite fast, the necessity of traversing the entire subgraph on each Spreadsheet Version upgrade is one of the drawbacks of this algorithm in its most trivial implementation.

The rationale for distinguishing between the VerifiedAt and UpdatedAt numbers is to minimize the potential amount of extra computation between the graph edges.

The UpdatedAt number represents the most recent version of the cell cache update. This value is always less than or equal to the cell’s VerifiedAt. We only use this number within the cell boundaries to compare the new UpdatedAt number of the referred cell with the old VerifiedAt number of the dependent cell. The referred cell’s UpdatedAt number may receive a newer Spreadsheet Version in the previous verification step, surpassing the old VerifiedAt number of the dependent cell that we haven’t updated yet. This event indicates that we should attempt to rerun the cell’s function since one of its inputs has been updated.

Dependent Cell updates when any of the Referred Cells receive new UpdatedAt number

Even if we rerun the cell’s function, we don’t update the cell’s cache immediately. We additionally test the old cached value against the function’s result by hash. If they match, we don’t update the cache, leaving the UpdatedAt number of the cell intact. As a result, the subsequent verification step will not observe any changes in this cell, and the propagation of dependent cell function invocations will stop.

To enhance understanding of the algorithm, let’s examine the following spreadsheet:

Spreadsheet at Version 1

Here, we have four cells: two of them return constant values (presumably from the user input), and the other two perform actual computations.

Spreadsheet cells, and their inputs

Initially, all cells are initialized with VerifiedAt/UpdatedAt numbers set to 0, have no caches, and the Spreadsheet’s Version is 1.

Let’s observe cell D :

  1. D’s VerifiedAt is 0 which is less than the current Version number 1 . As such, we initiate verification procedure for this cell.
  2. D’s cache is uninitialized. To compute the cache, we need to fetch it’s referred cells B and C cached values first.
  3. B does not have dependencies, so we can compute it’s value instantly(20), save it to B’s cache, and assign B’s VerifiedAt and UpdatedAt to the current Version, which is still 1.
  4. C undergoes the same procedure, initializing constant cell A and then computing its own value (which is 10+5=15). We are saving 15 to the C’s cache, assigning VerifiedAt and UpdatedAt as well.
  5. Finally, as we successfully fetched B and C output values, we can compute D’s value (20+15=35), save it to D’s cache, and assign D’s VerifiedAt/UpdatedAt numbers accordingly.

From now on, our spreadsheet table looks like this:

Spreadsheet at Version 1

The spreadsheet’s version is still 1. All cells are up to date. If we try to observe any other cell, the algorithm will return the cached value instantly as the VerifiedAt number of each of them equals the current spreadsheet version.

Now let’s change B’s function from =20 to =23. As this action was a side-effect to our spreadsheet state, we have to unset B’s cache and increase the spreadsheet global version by one.

Spreadsheet at Version 2

If we observe cell C, the algorithm detects that its VerifiedAt number is less than the current Version and starts the verification procedure for C:

  1. C has a cache, then it needs to verify referred cell A first.
  2. A‘s VerifiedAt number is out of date, but it has no dependencies, so the cell’s function doesn’t need to be invoked. We can set the VerifiedAt number to the current Version, leaving UpdatedAt intact.
  3. A’s UpdatedAt number is still 1 which is less than or equal to C’s VerifiedAt (also 1), so we don’t need to call C's function. We leave C’s cache and UpdatedAt number intact as well, but assign C’s VerifiedAt number to the current Version.

The algorithm completed the verification of a subgraph of the spreadsheet cells graph, refreshing the subgraph cells’ VerifiedAt numbers but leaving all cached values as they are since no actual changes in this subgraph have been detected during verification.

Spreadsheet at Version 2

From this point, observing the cell D involves the following steps:

  1. D’s VerifiedAt is out of date, entering D’s verification procedure.
  2. D depends on B and C.
    – B has no cache, so we initialize it as we did previously. Its cached value is going to be 23, and VerifiedAt and UpdatedAt numbers will be assigned to 2.
    – C has a cache, and its VerifiedAt number matches the current Version. As such, we don’t need to traverse C’s subgraph once again.
  3. Now, both B and C are verified. C has an UpdatedAt number (1) equal to D’s current VerifiedAt number (1), but B received a new UpdatedAt number (2) that exceeds the current D’s VerifiedAt number. D’s function needs to be rerun.
  4. D’s function returns a new value 38 (B+C=23+15=38). This value is distinct from the currently cached value. As such, the cache is going to be updated, and D’s VerifiedAt and UpdatedAt numbers will be assigned to the current Spreadsheet Version.
Spreadsheet at Version 2

The spreadsheet is fully verified, and its values are up to date in accordance with the user input.

Conclusion

I hope you have gained a comprehensive understanding of the fundamental algorithm used behind Salsa.

Of course, Salsa employs different terminology compared to what I used in my article: Databases, Revisions, Tracked Functions, Jars, Queries, etc. The framework extensively utilizes Rust macros to express the managed entities. However, in the end, these entities form a structure that resembles the data structure explained above.

Salsa implementation is designed to be used in a multi-threaded computational environment, and, of course, the actual data structures used inside Salsa are highly optimized for concurrent computations.

Salsa also provides several important features that are worth separate discussion. To name a few of them:

  • Cycle detection. If, by chance, the programmer expressed a cycle between the managed functions, Salsa provides a mechanism to detect and handle such situations.
  • Updates interruption. An application is capable of canceling unfinished graph updates in the middle of computations without damaging the entire structure.
  • Durability. Salsa provides a mechanism to put “barriers” between the managed functions such that the updates propagation algorithm would not check parts of the dependency graph that are supposed to be updated infrequently.

I found inspiration in several valuable resources that cover additional aspects and details on the topic of incremental computations, aspects that I did not address in my article. I would like to acknowledge and recommend the following sources for those interested in further exploration:

  • “A journey through incremental computation” talk by Raph Levien, providing a comprehensive introduction to incremental computations and their specific application to GUI development.
  • “Salsa Architecture Walkthrough” by Niko Matsakis, where Niko provides an insightful overview of Salsa’s internal algorithm and walks through key sections of the repository source code.
  • “How to recalculate a spreadsheet” by the author of the Anchors framework, offering a comprehensive exploration of the topic. The author guides readers through the history of incremental computations development and explains various approaches.

These resources contribute valuable insights beyond the scope of my article and offer a more in-depth understanding of incremental computations.

Ilya Lakhin.

--

--