Application of synchronized clocks in Distributed Systems

Introduction

8 min readDec 4, 2018

Most distributed systems make use of time in some form of another. For time to be useful, clocks on different nodes in the network need to be synchronized in some way. Protocols such as NTP ensure such synchronization within a certain margin of error. It is important to note that NTP like protocols only guarantee synchronization with a certain level of probability. There are scenarios when clocks might get out of synch. In addition, NTP servers are connected to real universal time clocks that are sampled to ensure that the time is close to real-time, again within a certain margin of error.

Since clocks can drift sometimes, synchronized clocks are useful in distributed systems for improved performance and only occasionally for ensuring correctness. In such cases, synchronized clocks act as a proxy for some message exchange between two nodes in a distributed system. One can make local decisions(relying on past information) without having to communicate with another node every time. Barbara Liskov’s paper covers some such practical applications of synchronized clocks. Let’s go over these applications.

At-most-once messages

Protocols such as SCMP, rely on at most once delivery of messages i.e. duplicates are not tolerated. For this to work, the receiver of the message would need to know if the sender has sent this message before. Receiver cannot potentially have infinite storage and hence cannot store every message from the sender. In such cases, synchronized time can be used to establish “recentness” and to discard all messages that are not recent as duplicates. While this approach risks rejecting some duplicate messages, it does so with very low probability as long as “recentness” has decent enough lifetime.

If both sender and the receiver have synchronized clocks, then sender can send the current time with the message to the receiver. Receiver stores a table of timestamps, per connection. The receiver can prune this table and remove entries that are older than the lifetime i.e. cur_time — lifetime. When messages are removed from the table the highest time is also stored called upper. When a new message lands on the receiver that has an entry in the table, timestamp is compared against the recorded timestamp in the table. If it is less than the stored timestamp then it is rejected as duplicate. If a connection does not have an entry in the table, then the timestamp of the message is compared against “upper”. If incoming timestamp is less than upper, then this message is likely very old and hence rejected as a duplicate with a very low probability of rejecting a previously unseen message. If incoming timestamp > upper, then it means that we have not seen a message this recent on this connection and hence a new entry is created in the connection table.

Another interesting variation of this is how to handle process crashes. One could potentially commit the connection timestamp table mentioned above, every so often to the disk. But that can be a little expensive. Another approximate but effective way to deal with this is to introduce another timestamp “latest”. On the receiver, “latest” is assigned to current_time + recovery_window(some seconds) and this latest is written to disk every few seconds. Any message later than “latest” is discarded or delayed. This is not too meaningful during normal operations. But it is useful during crash recovery. When the receiver crashes and comes back up after some time, it initializes “upper” to last “latest” that was committed to disk. Now any message with timestamp < upper is discarded as a potential duplicate, because only messages before “latest” could have been accepted before the crash. Good values of “recovery_window” ensure that not too many messages are rejected as potential duplicates and also minimizes the writes to disk.

Detecting validity of encrypted tokens

Systems such as Kerberos, use session tokens for mutual authentication between client and servers with the help of a trusted third-party that issues session tokens. Private keys are used to exchange encrypted information and messages are trusted as long as they can be verified. In such scenarios, it is critical to guard against tokens that are just left over at a shared workstation and where user hasn’t logged out. To avoid, perennial use of such tokens, each token also contains an expiry. If current_time at the server > expiry, then such tickets are rejected. If synchronized time wasn’t available, then the server would have to ask the token granting third party for validity of the tokens. Here server’s time acts as a proxy for the third-party time.

Another useful feature is to detect replays. In kerberos, an authenticator is created by encrypting client’s current time and it’s private key which is also shared with the server. Server can verify the authenticator be decrypting it. Server can also store all recent authenticators — this is similar to at-most-once messages. It can detect replays of the authenticator using similar schemes described in the last section.

Cache consistency

Cache consistency in a distributed system is about ensuring that modifications to objects on some node is visible on some other node so that all nodes have a consistent view of the modified object. This wouldn’t be an issue if every read and write went to a primary node or a server. But generally for better performance many cached replicas of the object are present on different nodes. Without synchronized clocks, messages would need to be exchanged for every operation to ensure that the object be read or written to, is the most recent one. Let’s see how time/clock based leases simplify this.

In a write through cache, where any write is reflected on the server write away. In such cases a time based lease can be given to all nodes caching this object. When a modification is attempted, the server will notify all clients holding the lease and ask them to give up the lease. When clients give up the lease, write goes through. When there are no writes, all the nodes that have valid lease(based on local time) can continue to use the object without asking the server.

In a write-behind cache, changes are only written back to the primary or the server asynchronously or after some time. So system can be more prone to inconsistent views and there is no well defined event if write due to the asynchrony. For this type of cache, leases can be modified to have read and write leases. The leases still have valid expiry times. Read leases on different nodes can be used without conflict, while write leases conflict with other leases. In this case, whenever a node asks for a write lease, other nodes are asked to relinquish their leases. This is how consistent views are ensured. In all other cases, whenever client uses a file, it checks for the validity using expiry of the lease.

One interesting discussion point on using clocks in leases(or cases mentioned before), is that clocks can drift. A machine can fall behind and have a slow clock or slow synchronization for some period e.g. a client with a slow clock may end up reading an stale/invalid cached version. This will continue while the clock catches up and lease expires eventually.

While this is not too terrible, it would be good to be able to ensure cache consistency in all cases. It can be done by adding a higher level primitive such as a transaction. Each object can have a version number and server stores version numbers along with the objects. Clients can store objects and will have leases still. But when changes are made, new versions along with objects are sent in a transaction to the server. Server checks for the correct transaction numbers from the client and aborts the transaction if client doesn’t have the right version number of objects. Server will also notify clients holding leases when some object is changed on the server. This helps with minimizing the aborted transactions. Here also clocks can drift, but the worst it can do is to cause an aborted transaction.

Replication

Consider a file system that uses primary-copy-replication technique. In this primary owns the files and creates replica on sub-majority backups. Sub-majority is one less than majority i.e. If there are 9 nodes in the system, then 4 backups are chosen because together with the primary they form a majority. This ensures that when many nodes go down and at least any half of nodes stay up, at least one of those nodes will have the needed file. When this situation occurs, the owner of the file may change. In normal course of operations, a log is maintained for recording change operations. Owner sends the log to backups and they append it to their logs and when acknowledgment is received from sub-majority, the operation is committed. Read operations are not interesting for the purposes of the log, because they don’t change any objects. This is problematic in case of a network partition. If the old owner is separated from the new owner via a partition, then old owner could conceivably keep on reading stale file, while the new owner is modifying and changing this file. This can be addressed via leases.

When owner assigns the file to sub-majority it also get leases from the sub-majority. While owner has leases from sub-majority, it can do the read operation unilaterally. During partition, owner is not able to communicate with these backups. So when some other backup becomes the new owner for the file, it waits for the lease given to previous owners to expire. Once the previous owner’s lease has expired, the new owner can change/own this file without having to worry about stale use at the previous owner.

Discussion on synchronized rates

Some algorithms may not need synchronized clocks and can use synchronized rates instead. As long as all nodes tick at the same rate, there may not be a need to have synchronized clocks at all the nodes. Consider the case of leases. In those cases instead of using expiry time, one could use time to expire. Time to expire can be added to local time to establish final expiry. When a lease is expiring at the client, it is in response to some message that it sent. So expiry can be established with regards to the initial message sent by the client to obtain the lease. For server to establish the lease, server’s reference time can be the local time at which it sent the response to the client. When there are such dependent events, synchronized rates suffice and clocks are not necessarily needed.

In some cases, rates are just not sufficient and synchronized clocks are necessary. Consider the case of at-most-once messages. In that case, just sending time to expire is not sufficient. On the receiver node, the time-to-expire has no meaning, because receiver node didn’t initiate any message with the sender. In such cases, expiration time i.e. synchronized clocks are necessary.

Conclusion

While this paper is quite old, concepts mentioned in this paper are still quite relevant. Synchronized clock and rate based algorithms provide an efficient mechanism for making local decisions without having to communicate with remote nodes.