What’s new with Solana’s transaction scheduler?

A deep dive into the recent improvements in solana’s transaction scheduling coming straight from the solana core.

12 min readJan 22, 2024

Introduction

A transaction scheduler is a very important component of the solana validator client responsible for block building and the ordering of transactions to be executed. In this post, we will dive deeper into the new improvements in Solana’s transaction scheduler.

First of all, to understand how Solana’s transaction scheduler works, we have to revisit our understanding of the Banking stage. Solana validator has a group of processes collectively called the Banking stage. Whenever a validator node becomes a leader, the Banking stage is responsible for the processing of transactions. Within this suite of processes, there is a specific component called transaction scheduler. In this piece we’ll explore how this exact piece of the validator schedules transactions and the basis on which transactions are scheduled.

A typical Solana transaction flows like this: A user submits a transaction by signing/confirming through their wallets, the wallets are running an RPC node in the backend which forwards those transactions as network packets to a validator which is geographically closest to them and the validator in turn forwards these transactions to the current leader node. Before scheduling, a leader may receive a whole lot of transactions as network packets. These network packets, after going through all the cleaning and verification process in the sigverify stage, need to be further filtered, deserialized and buffered into memory before they can be scheduled.

The Problem

In the old scheduler design, threads operate by accessing a common channel from which they retrieve transactions and place them into their individual queues. Once these transactions are in the local queues, they are organized according to their priority. During the phase of block production, every thread selects the top 128 transactions from its queue and tries to grab the necessary locks( read or write or both) for these transactions. If successful, the thread proceeds to execute, record, and finalize the transactions. However, if it’s unable to obtain the locks, the transaction is set aside to be retried again later.

“The historic design had several drawbacks, mainly related to failed lock grabs. During competitive events, such as NFT mints, the top of the queue is often filled with conflicting transactions. Given the entry constraints, this means each thread would attempt to lock 128 conflicting transactions, but fail to lock 127 of them. This resulted in massive inefficiency as the majority of transactions would be skipped over to be retried later. Additionally, because the other threads were also likely to be grabbing the same locks, only one thread may have actually been doing useful work.” — Andrew Fitzgerald, Solana Labs

The old banking stage receives packets and buffers them into a container hence more time is spent on receiving transactions, grabbing locks on them and retrying transactions if the thread fails to grab locks. Note that scheduling only happens after it can successfully grab locks of accounts that are to be read from or written to. This means that any failures in acquiring locks leads to an idle scheduler and hence worse performance.

The new block production mechanism

Banking stage spawns a set of threads based on the configuration, set by the validator operator. The default number of threads to be spawned are 6. Out of those six threads, 2 of them are reserved for processing vote transactions, the rest 4 threads are for processing non-vote (and hence user) transactions. As per the time of this writing, there are two block production modes through which the banking stage operates, the modes are opt-in, that means that the node operators can choose the mode on which the Banking stage will run:

BlockProductionMode::ThreadLocalMultiIterator
BlockProductionMode::CentralScheduler

The ThreadLocalMultiIterator is a legacy block production mode without the new scheduler optimizations, so for brevity of this post we’ll skip it. We’ll dive deeper into how the CentralScheduler is efficiently able to schedule transactions over threads and the new optimizations.

The default block production mode is set to ThreadLocalMultiIterator to allow for a smooth and graceful migration from the old scheduler to the new one after intensive testing and benchmarks.

The meat

Central scheduler is a culmination of a group of processes managed by the Scheduler Controller. This central “coordinator” oversees a variety of tasks including making decisions to consume or forward packets, managing the inflow and buffering of packets, and organizing batches of packets for worker threads to process transactions. Let’s dissect the scheduler controller object, focusing on the important components:

Definitions

Decision maker — The process responsible for deciding whether to consume packets or clear queue since Forwarding process has been removed in the new scheduler.
Packet deserializer — Also referred to as packet receiver, it receives and deserializes packets coming from the sigverify stage.
Prio graph scheduler — A priority graph which keeps track of a chain of transactions and has a top-level priority function.
“ A directed acyclic graph where edges are only present between nodes if that node is the next-highest priority node for a particular resource. Resources can be either read or write locked with write locks being exclusive.”
Transaction state container — A fixed capacity, shared data structure between packet deserializer and prio-graph scheduler. It stores the state of the transaction for the entirety of transaction processing period in the scheduler.
ReadWriteAccountSets — A pair of hashsets or ordered containers that hold a collection of accounts that are locked either for reading or writing.
Sanitization — A broad range of checks verifying the integrity of each component of a transaction.

The Priority Graph and Scheduling

The Scheduler::run(..) is called inside the banking stage, this kicks off these processes in order:

Start decision maker and produce a decision on whether to consume or forward packets.
Process transactions based on the decision produced, this internally starts the priority graph scheduler and the consumer.
Receive batches of executed transactions which are consumed by the Consumer inside the scheduler.
Receive consumed packet results, i.e. the batches of transactions that have been consumed & processed and executed.

Let us understand this algorithm with a simple example, then we’ll dive into the actual parameters used. The diagram below shows we have 16 transactions arranged in the priority queue in order. If the decision is “Consume” in step 1 & step 2, then the transactions are inserted into the priority queue inside the transaction state container. The container has a max capacity of 700,000 packets, these packets are buffered, chunked and converted into transactions before finally inserting into the container. The priority queue is a MinMaxHeap which has a tree like structure and uses the “bubble-up” functionality to keep the highest priority transactions on top of the tree.

An illustration of how Priority graph works

The graph has a fixed window size of 2048 and is termed as “look-ahead-window-size” which means the scheduler is looking ahead, a certain number of transactions for read & write conflicts between the accounts specified in the transaction. For each iteration, the priority scheduler then pops a certain no. of transactions, the no. here is 6 transactions but in the actual implementation it’s set to 128 transactions and decreases the window size by 128, i.e. for the first iteration the window size gets reduced to 2048–128 = 1920 transactions. This batching of 128 transactions is just another optimization in the scheduling and look-ahead operations.

The window size is a tunable parameter that is used to pop transactions from the main queue, so the scheduler has two queues; one is inbuilt inside the priority graph algorithm and the other is the main queue. The graph “looks-ahead” 2048 transactions, separates out the non-conflicting transactions from the conflicting ones and then inserts it into it’s inbuilt priority queue, this is from where 128 (now appropriately ordered) transactions are scheduled onto the available threads. This is a dynamic process so when all the top non conflicting transactions are scheduled on the threads for processing, new transactions are popped off from the main queue and “looked-ahead” and then once again scheduled appropriately. Now this may require some wild imagination for sure.

If the scheduler finds a read & write conflict between 2 or more transactions, then it separates the conflicting transactions as the only nodes having edges between them. The transactions are interpreted as graph nodes.

We can see in the above diagram, tx.2 and tx. 3 have conflicting accounts, tx.2 being read locked while tx. 3 being write locked, same case for tx.4 and tx.6. The priority graph keeps a track of these conflicting transactions in it’s state and this data is used by the scheduler to select threads on which these transactions will be processed. Note that we assume here that tx.2 and tx.3 are mutually non conflicting with tx.4 and tx.6 otherwise all four of these transactions would have been scheduled on the same thread. The conflicting transactions are processed on the same thread and the non-conflicting transactions can be processed on different threads depending on the workload. This enables an even distribution of workload without having to worry about inter-thread lock conflicts.

Scheduling conflicting transactions on the same thread.

We can observe how the conflicting transactions in our example are scheduled on the same thread for processing. Note that the ordering and the graph structure given in the example is simplified for our understanding, we can also have multiple transactions having a common transaction b/w them. The graph keeps track of non-conflicting high priority transactions and “chains” of sequential transactions based on the read and write locks and schedule these “independent chains” onto different threads to achieve even better parallelization.

Thread selection and load balancing mechanism

One thing to note is that the priority graph also manages it’s own priority queue in the form of a BinaryHeap. This queue only stores non-conflicting transactions that are ready to be executed on the worker threads. The worker threads are chosen based on the amount of transactions already scheduled on them and pending transactions that are “in-flight” or queued to be scheduled. The transactions are batched before they are sent to be scheduled and processed.

The scheduler also manages a ReadWriteAccountSet that stores the set of accounts that are read locked and write locked and compares them with the transactions that are popped off from the main queue of the graph. This is used to verify if transactions removed from the prio-graph queue are already locked by another thread.

The scheduler maintains a global ThreadSet i.e. a set of threads that can be used to schedule transactions, this thread set is decided on the number of non-vote transaction processing threads that are set up in the validator configuration. The criteria for selecting threads (as documented in code) is as follows:

If there are no locks, then all threads are schedulable.
If only write-locked, then only the thread holding the write lock is schedulable.
If only read-locked, the only write-schedulable thread is if a single thread holds all read locks. Otherwise, no threads are write-schedulable.
If only read-locked, all threads are read-schedulable.

Each transaction is popped from the main queue, and each associated account within the transaction is scanned to determine whether it is designated for reading or writing. The accounts that are read or write locked in that specific transaction are then gathered. Subsequently, each account undergoes an evaluation against the pool of available threads. This process involves calculating the threads that are suitable for scheduling either read or write operations for an individual account, taking into account the current lock status (read and/or write) of the account. Different scenarios are considered during this evaluation, including the absence of locks, the presence of only write locks, only read locks, or a mix of both read and write locks. This assessment is essential to determine the appropriate ThreadSet that is eligible for the operation. Now if the scheduler cannot find an eligible ThreadSet, it will pick one with the least amount of work scheduled on it.

In essence, this mechanism achieves the task of scheduling read and write operations on accounts in a multithreaded environment. They ensure that operations are scheduled on threads without violating the locking constraints of the accounts, thereby managing concurrency and preventing conflicts like data races. This is crucial in systems where multiple threads might attempt to read or write to the same account simultaneously.

Some notable improvements and benefits

The goal of this new scheduler is to eliminate failed account locking during competitive events like NFT mints or token launches. As compared to the older design, the new scheduler efficiently separates out batches of non-conflicting transactions and by performing a “look-ahead” using the priority graph.
The legacy scheduler had an additional overhead where the packets received and stored inside the temporary storage were popped to decide whether they should be processed in the current thread or not. A key part of this process was deserializing and sanitizing the packets before locking the accounts for reading and writing. The sanitization of transactions, especially, was a resource-intensive task. This was particularly true for versioned transactions that required resolving address-lookup tables to identify which accounts needed to be locked. As Andrew aptly puts it, “Anytime spent sanitizing is time spent not scheduling 😉” referring to the idling of a scheduler. The overhead of sanitizing and then selecting transactions compounds during competitive events on Solana. The new scheduler eliminates this overhead by first selecting all the transactions that make it to the scheduler within 100 ms, buffer and batch them into groups of 128 transactions, sanitize them on demand and finally insert them into the transaction state container.
The new scheduler, also completely removes the forwarding process in the banking stage. The rationale behind this was that almost 90% of the forwarded packets received by the next leaders were redundant. This redundancy had arisen from the RPCs who forward transactions directly to the next leaders, if the current leader cannot process the excess transactions. This would significantly reduce egress and ingress costs and save validator resources.

Insights

More efficient block packing leads to efficient use of Solana’s blockspace and hence leads to increased blockspace especially during high network traffic which happens during token launches or NFT mints, transactions unrelated to the NFT mints or token launches can now land successfully without any failures.
More reliability in processing transactions that pay a higher priority fee, i.e. less failed priority transactions.
The new scheduler will pave the way for one of the major goals of Solana: Asynchronous block production. You can read more about it here.

Summary

The recent advancements in Solana’s transaction scheduler represent a significant leap forward in optimizing the blockchain’s performance, particularly during high-demand scenarios like NFT mints or token launches. The introduction of the Central Scheduler streamlines the transaction processing mechanism in a multitude of ways.

Primarily, the new scheduler addresses the inefficiencies of the historic design by implementing a more intelligent transaction selection process. The Priority Graph, a key feature of the new scheduler, effectively identifies and manages conflicts among transactions, ensuring that batches of non-conflicting transactions are processed efficiently. This look-ahead capability not only enhances throughput but also minimizes the time wasted on failed account locks and retries.

Furthermore, the elimination of the forwarding process in the banking stage is a strategic move to reduce redundancy and conserve valuable network and validator resources.

The new scheduler’s ability to batch transactions, perform on-demand sanitization, and intelligently allocate transactions to threads according to their lock status represents a sophisticated approach to transaction processing. By dynamically managing read and write locks in a multi-threaded environment, the scheduler effectively mitigates the risks of concurrency issues like data races.

Overall, these improvements in Solana’s transaction scheduler not only enhance the efficient use of solana’s blockspace but also demonstrate a proactive approach to adapting to the challenges and demands of a rapidly evolving digital asset ecosystem. This development is a testament to Solana’s commitment to continual improvement and scalability, ensuring its position as a robust and efficient platform for decentralized applications.

A special thanks to Andrew Fitzgerald from Solana Labs for helping me understand the new scheduler, he’s the lead developer behind this new scheduler, highly recommend following him on his twitter!

If you like this and want to know more/connect, you can find me on my twitter: @harsh4786