Micro Rollback in Bus-Connected Systems

CELSY PHILLIPS
Fault Tolerance in VLSI Circuits
3 min readJun 7, 2021

Periodic checkpointing of process states and roll back to a previous state when an error is detected is a common technique for error recovery in distributed systems. If each process is checkpointed independently, rolling back one process may require rolling back a second process further in time which, in turn, may cause a third process to roll back, etc. leading to an uncontrolled domino effect. In the worst case this can result in all processes in the system rolling back to their state when the system is initialized. In the context of micro rollback, which is done at the level of hardware modules, the domino effect cannot occur in such system. However, if the modules are connected in an arbitrary topology, where there are several independent communication paths between pairs of modules, the domino effect could, potentially, occur. Since the range of rollback is severely limited (a few cycles), this can make recovery impossible.

Bus-Based Multi-Module System

At first glance, it appears that the domino effect can be a problem when micro rollback is used in common bus-connected systems (see the figure above). For example, in the system shown above, the following situation could occur:

(i) a rollback signal is initiated in the main processor which rolls back C cycles.

(ii) the main processor sends a number of transactions to roll back to the FPU.

(iii) the FPU rolls back and its transducer determines that the MMU must also roll back because it interacted with the FPU during the last few cycles.

(iv) the MMU rolls back, and its transducer sends a number of transactions to roll back to the main processor, which, in turn, is now required to roll back more than C cycles.

In a system where all modules are interconnected via a common bus, this problem can be solved by using bus transactions as a common logical clock. Bus transactions can be monitored by all the modules in the system and used for synchronization. There are two possible techniques for using bus transactions to achieve a consistent state following rollback:

1) Each module has one transducer which monitors generic bus transactions. Whenever a module detects an internal rollback signal, it converts it to a number of bus transactions, and puts it on the bus. All the other modules read this number of bus transactions, convert it to an internal number of cycles, and roll back. The disadvantage of this method is that it generates unnecessary rollbacks. Modules may roll back a certain number of system bus transactions even if they haven’t had any interactions with the rest of the system.

2) Each module has two transducers. The shift register (monitor) in the first transducer, shifts every time a bus transaction is executed. A one is shifted in if the bus transaction belongs to the module (private bus transaction). The shift register in the other transducer shifts every cycle and also shifts in a one if a private bus transaction is monitored. If a rollback signal is detected the following conversion occurs:

Generic Bus Transactions → Private Bus Transactions → Internal Cycles

In this way modules roll back only if necessary, but require twice the amount of hardware. The delays are also doubled which may make the implementation more critical for modules operating at high frequencies.

References

  1. Tremblay, Marc & Tamir, Yuval. (2001). Fault-Tolerance for High-Performance Multi-Module VLSI Systems Using Micro Rollback.

--

--