Understanding the Concept of Virtual Time Using the Time Warp Algorithm

Introduction

Gaurav Sarma
Apr 10 · 6 min read

What is virtual time and why do we need it?

As distributed systems have progressed and been adopted over the last decade, there have been numerous technologies in different segments like databases, caches, message queues, etc which are built on top of other frameworks which abstract away the difficulty of managing distributed systems. One of the most important and difficult things to manage in distributed systems is managing synchronicity using time.

Some common forms of synchronization techniques in distributed systems are block-resume, abortion-retry, lookahead-rollback. This post will cover the lookahead-rollback used by the Time Warp algorithm since, though not intuitive, it is leads to elegant and efficient solutions when the cons are compared for each of the techniques.

A virtual time system is a distributed system executing in coordination with an imaginary virtual clock that ticks virtual time. Virtual time itself is a global, one-dimensional, temporal coordinate system imposed on a distributed computation; it is used to measure computational progress and to define synchronization.

The main purpose is to have a single virtual time (which may also be in sync with the real time) across the system so that processes can always operate in an opaque manner when in actuality, it is an unpredictable entity.

All processes communicate with each other (either locally or remotely) via messages which mainly consist of 4 primary fields: sender, virtual send time, receiver, virtual receive time.

Some fundamental rules which must be observed for virtual time:

The virtual send time of each message must be less than its virtual receive time.

The virtual time of each event in a process must be less than the virtual time of the next event at that process.

It should also be noted that the virtual times of any event A and B should follow the above rules only if there are events which directly or indirectly have causality amongst A and B.

Lamport’s logical Clock

Lamport was one of the first to show that real time temporal order, causality between events had a strong connection to the concepts of relativity. He provides an algorithm which assigns ordered clock values to events once the execution of a distributed system starts. The Time Warp algorithm is an inverse of the Lamport alogrithm where they mainly have the time assigned to the event and rollback if any discrepancy is found

Reed’s pseudotime

Reed came up with the concept of pseudotime which seems similar to virtual time but is different in the sense that pseudotime where events are assigned multi-version timestamps which are used for concurrency control to understand the atomic time of the event in distributed systems where virtual time is more relative in nature such that it mainly deals with events which have happened before or after in comparison to others. Reed uses abortion-retry for his algorithm where there may be starvation, unlimited retries, deadlocks, etc.

Schneiders’s work

Schneider’s algorithm mainly consists of broadcasting also synchronized messages to all processes and not proceeding till acknowledgements are received from each of them. Keeping all synchronized events in their local memory, processes are able to make decisions locally about the order of events. However, the algorithm doesn’t scale with the requirements of today’s scenario where the broadcast messages and acknowledgements needs to be performed for tens of thousands of servers for millions of messages.

Time Warp algorithm

Each process maintains its own virtual clock which is changed only between events. Each process has 3 queues; the input message queue, output message queue and the state queue. Keeping in mind the virtual time rules mentioned in the previous section, the each event in the local queue of the process should always be in increasing order.

There are extremely probable cases where events with virtual time lesser than the current time in the process may arrive, which violates the rules of virtual time. In such cases, the time warp algorithm follows the lookahead-rollback algorithm, which results in rolling back to the point where the incoming message fits appropriately with the processes’ virtual time and remaining messages have to be replayed in the correct sequence.

The rollback of messages cannot be narrowed down to a single processor alone since the messages sent in the wrong sequence also affects other processes which maintain their own queues. Hence, there has to be a rollback for the error messages in other processes as well which are connected to each other. Other processes which don’t have any direct or indirect connection to the message with error sequence won’t be affected, thus leading to a lesser network rework effect.

For rollbacks, there is a concept of antimessages in the system which can also be used for other purposes. All fields of messages and antimessages are the same except for one field, which is the sign of the message. All messages which have been sent to other processes have a (+) sign and their antimessages have a (-) sign. Whenever a message is sent, the message is stored in the receiver’s input queue and the antimessage is stored in the sender’s output queue.

Whenever a message and antimessage exist in the same queue, both the messages cancel each other out and are therefore removed.

The messages and antimessages are created together and can exist in different queues.

Coming back to the rollback case where we had to rollback the sent messages to other processes to maintain the new correct order, the process generates and sends antimessages for all the incorrect messages sent to other processes which do the same locally as well, thus leading to ripple effect of efficient rollbacks. Even in cases where the antimessage arrives before the actual error message in another process, both the messages will be annihilated since the same queue cannot have both the messages ultimately bringing all processes to the correct order.

As we go through the above mechanism, we need to understand that all the steps listed above are happening in the local context of a process, which doesn’t have knowledge of a global virtual clock yet. Not having a global value doesn’t allow efficient memory management of the queues since we see that messages have to be kept in their queues indefinitely which can be a problem at scale.

We introduce the concent of Global Virtual Time (GVT), which at real time r, is the minimum of the following

  • All virtual times in all virtual clocks at r
  • Virtual send times which have been sent but have not been processed/received at time r.

From the above definition, it becomes evident that at any point in time in entire system, the GVT will also be the lowest value or the floor value in the system. The GVT symbolizes that the messages below the GVT have been processed in the correct order and can be forgotten about, which the local processes can use to clean up their queues.

The delay in calculating the GVT is the total delay in sending a broadcast message to the processes.

Conclusion

A common feedback of the Time Warp algorithm contains the argument that rolling back across thousands of processes may not be feasible in the real life use cases. However, there is a point to be made that the rollbacks are termed as exceptions in real life use cases and not the norm. Since processes folllow the temporal locality principle, it makes more sense that events would arrive in the actual order and the events which arrive in the past arrive in the recent past which means the rollbacks are lesser.

The only alternative to lookahead/rollback is for the process to be blocked (i.e., doing nothing) for the same length of real time as the lookahead computation, which is just as much of a “waste.”

Virtual time is strongly analogous to Virtual Memory, the same concept of memory management where the most optimal pages are kept in the main memory. There are numerous efficient algorithms which try to determine the pages which are kept in the main memory, where lookahead section of data files prefetch a few pages in blocks and keep as it’s the usual trend of the user. In cases where a required page is not found in the main memory, there is a page fault which swaps out the least optimal pages from main memory and replaces it with the required page which is again similar to the rollback mechanism. The analogy can be extended in various ways to suggest that virtual time can be implemented in an elegant and efficient manner in distributed systems.

I hope you liked the article. Please let me know if you have any queries regarding the article. Happy reading!!

References:

Geek Culture

Proud to geek out.

Sign up for Geek Culture Hits

By Geek Culture

Subscribe to receive top 10 most read stories of Geek Culture — delivered straight into your inbox, once a week. Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store