All Things Clock, Time and Order in Distributed Systems: Logical Clock vs Google True Time

Kousik Nath
Geek Culture
Published in
9 min readMar 15, 2021

In our last article, we looked into logical clocks especially vector clocks and version vectors deeply and analyzed how they are very useful in real life for designing eventually consistent systems like Riak or Amazon Dynamo DB. But not every eventually consistent systems use vector clock.

How Cassandra Handles Ordering

Cassandra is a NoSQL distributed wide column based storage system which is also eventually consistent, yet it does not use any logical clock to order events. There is a reason for it:

Key-Value (KV) stores like Riak, Dynamo DB are like big hash tables — they map an immutable key to a value. Dynamo DB stores the value in a JSON serialized format, Riak is content type-agnostic and is able to store any type of value but the client needs to provide proper content type while storing the data. Let’s consider a simple use case in Dynamo:

Let’s say we are storing user details, the objects is stored as below in Dynamo:

{
"name": "kousik",
"mobile": "9090909090",
"email": "knath@test.com",
"address": "xyz abc"
}

Q. What happens if you want to update the field mobile?
A.
The following sequence of operation happens in order to update any attribute in the stored data:

  1. You first read the data which fetches the full user object.
  2. The object attribute mobile is updated.
  3. Finally, the whole object is saved in the store.

Q. What happens if you want to update the email attribute?
A.
The same sequence of operations as described above happens.

Q. What if two different clients want to update mobile and email separately in a concurrent fashion?
A.
Even though both of them update a single attribute individually, the whole object is read-modified-written with potentially two different conflicting versions. Although, apparently it looks like mobile and email are independent attributes that can be updated without any conflict, conflict happens since the object itself is fully overwritten by the clients.

Cassandra bypasses such conflicts by enabling independent attribute updates even though they belong to the same object. When you create the above object in Cassandra, you issue the following command:

CREATE TABLE user (
name text PRIMARY KEY,
mobile text,
email text,
address text
);

INSERT and UPDATE requests happen as below:

INSERT INTO user (name, mobile, email, address)
VALUES ('kousik', '9090909090', 'knath@test.com', 'xyz abc');

UPDATE users SET email = 'knath222@test.com' WHERE name = 'kousik';

UPDATE users SET phone = '7893839393' WHERE name = 'kousik';

Cassandra is designed to handle each column separately. You can issue individual update columns as you do in traditional relational databases. Metadata like last update time is maintained for each column. Update to a particular column for some matching data affects only that column. Thus updates to the columns are finer-grained.

Q. What happens if two different clients update the same column for the same key?
A.
Cassandra applies Last Write Wins ( LWW ) strategy to resolve the conflicting updates. Since fine-grained granular updates are on individual columns, it’s not practically possible that all clients end up updating the same column concurrently - their updates would be distributed across columns. Thus Cassandra survives conflicting updates even though clocks are coarsely synchronized to NTP, although it’s a good practice to always keep the clocks synchronized to NTP with the highest possible accuracy, a good such article can be found here.

Q. Still there is a chance that Cassandra loses data due to conflicting updates in the same column, right?
A.
Yes, technically it’s possible, however, since updates are spread across columns, the effect should be less, better if you have clocks synced to NTP through the appropriate daemon, that should help as well.

Okay! We have now seen different variants of eventually consistent system in the previous and the current article. What about strongly consistent systems? They also have data versioning challenges. How do they solve it?

Google True Time

Getting strong consistency in a distributed system is a huge challenge. As we have already seen, time is fickle and a particular moment is simply not a constant to define across machines. When we use timestamp in our code, we are accustomed to use language specific library and get possibly an epoch or time zone based instant while committing a transaction which gives us a false impression that time is constant. However, there is always some uncertainty — time across nodes is not same, see this article to get more idea. The uncertainty happens because:

  • Nodes are situated far from each other across different geographic locations which induces communication latency across the network.
  • Network is not uniform causing variable delays.
  • The physical Quartz based clocks sitting in the nodes vary drastically from each other.

Q. So, if we manage to optimize the communication and network latency, should the uncertainty reduce?
A.
Possibly, but optimizing public network is difficult. Why not to create own private network where the network communication can be managed under control. This is exactly what Google does.

Google created a distributed SQL database called Spanner, it relies on something called True Time for very strong consistency of transactions across nodes. Google knows that time is uncertain, so True Time defines a bounded and small uncertainty of time window where transactions can not be ordered definitely. True Time works as a Global Time across Google datacenters.

True Time is expressed as a time interval [earliest, latest]. It exposes an API called now() whose value lies in this interval. The uncertainty interval varies between 1 ms to 7 ms — note that the maximum uncertainty has a tight upper bound.

The APIs TT.before(t) or TT.earliest()and TT.after(t) or TT.latest() take a timestamp as input and answers whether the given timestamp is before or after the current uncertainty interval.

The relation between TT.earliest(), TT.latest() and absolute time of an event is:

TT.earliest ≤ Absolute Time of current event ≤ TT.latest
Figure 1, True Time API

True Time guarantees that if a transaction T1 commits before another transaction T2 starts, then T1’s commit timestamp is smaller than T2’s. In fact, this is one of the guarantees given by Google Spanner.

True Time is internal to Google datacenters only and the beauty is from whatever Google datacenter the APIs are called, the tight bound on the uncertainty always remains the same.

The million dollar question is how does Google guarantee such a tight upper bound?

Google does this magic by couple of tricks:

Optimized Infrastructure: Google infra runs on specially designed private network. They have optimized the network over time, it has a lot of redundancy of connections across datacenters and failure handling mechanisms built in. It does not mean network partition don’t happen or things don’t go wrong — however the possibility of such incidents and communication latency reduces a lot.

Using own clocks: True Time does not rely on external NTP pools or servers. Rather, Google datacenters are equipped with GPS receivers and Atomic clocks. See the below picture of such an installation:

Figure 2, Time master

Every datacenter across geography has one or more time server or time master. There are two kind of time master:

  • GPS Time Master: Majority of the time masters are GPS based. These nodes are equipped with GPS receivers which receive GPS signals directly from satellites and interprets current time from it. GPS antennas are installed in such time servers. GPS time masters are spread across datacenters to reduce the effect of antenna failure or signal interference issue etc.
  • Armageddon Master: These masters are equipped with local Atomic clocks. Atomic clocks are used as a supplement to GPS time masters in case satellite connections become unavailable.

All masters’ time references ( the atomic and GPS clocks ) are regularly compared against each other. Each master also cross-checks the rate at which its reference advances time against its own local clock, and evicts itself if there is substantial divergence. — Google Spanner paper

The client machine ( servers where applications run ) which actually need True Time timestamps, they run daemon process to poll different time masters periodically ( e.g.; poll every 30 seconds ).

Figure 3. Client Polling Architecture

To reduce the possibility of error from any time master, as you can see in figure 3, the client daemons poll time information from GPS and Atomic time masters belonging to nearby as well as farther datacenters.

Daemons apply a variant of Marzullo’s algorithm to detect and reject liars, and synchronize the local machine clocks to the non liars. — Google Spanner paper

Cool! We now know how the true time server and clients interact with each other. However, how do the applications overcome the uncertainty interval?

Google Spanner applies a very simple strategy: while committing the timestamp, just wait for the uncertainty time period to get over — wait for maximum 7 ms more while committing a transaction. Since all the transactions wait, it ensures an acceptable level of error if any and very strong consistency from customers’ point of view( Google calls it external consistency — the strongest consistency level, stronger than usual strong consistency ).

Q. Don’t True Time time masters become bottleneck at such high scale?
A.
Theoretically yes. But given that there are enough redundancy of time servers across geography and in the internal communication system, it’s reliable at Google scale.

Thus in Google internal networks, Google does not need to rely on external NTP servers or even logical clocks for partial ordering rather it can use True Time across its services to properly order events and transactions.

The Comparison With Logical Clocks

We saw in the second article of this series that logical clocks identify concurrent updates but at the cost of high number of conflicts, siblings or extra metadata which maps siblings to their particular versions. We also saw, Last Write Wins strategy can help us resolve conflicts but at the cost of lost information although Cassandra’s approach is better at handling such situation.

Google True Time gets out of all these issues, however at the cost of communication infrastructure and clock installations. Google anyway has to keep optimizing its network to serve the massive scale at which it operates, however not every company can take the same approach for events and transaction ordering.

But it’s absolutely good to know how different companies solve problems is very different and innovative way.

The Success of True Time

Google relies on True Time across dozens of datacenters spread across continents running tens of thousands of servers. Through Spanner, some extremely popular & heavily loaded services like AdWords (Google’s most significant moneymaker), Gmail, Google Photos, and the Google Play store depend on True Time.

Amazon Time Sync Service

Inspired from Google True Time, AWS also manages its own fleet of Atomic clocks and GPS clock receivers. Any EC2 server can connect to these time references via NTP using Chrony daemon for more accurate time rather than connecting to external NTP pools or time servers over NTP. More details can be found here. Leap second smearing is also handled by Amazon Time Sync Service.

Problems with True Time

Google True Time looks like an alternative to logical clocks since essentially both try to solve the Ordering problem in their own way. However, as we have seen, it’s not only about installing atomic and GPS clocks, rather the communication infrastructure optimization is the game changer here.

Though GPS receivers and atomic clocks are possibly affordable, not every company has their own privately owned network — they rely on AWS, Azure, GCP, OCI a lot over public network which essentially means communication overhead affects precise time calculation.

Even if companies run their own private network, they possibly won’t take the cost and burden of installing precise Atomic and GPS clocks and optimizing their network. On the other hand, Amazon Time Sync is a free of cost service but it’s not very clear how much accurate it is and whether it can be reliably used alone to order events.

So, is there any other way alternative to logical clocks or True Time kind of system which can still give us some sort of strong consistency without all these headache?

We’ll explore that in the next article in the series, stay tuned!

References

  1. https://aphyr.com/posts/294-call-me-maybe-cassandra
  2. https://stackoverflow.com/questions/34898693/why-cassandra-cluster-need-synchronized-clocks-between-nodes
  3. https://blog.rapid7.com/2014/03/17/synchronizing-clocks-in-a-cassandra-cluster-pt-2-solutions/
  4. https://www.wired.com/2012/11/google-spanner-time/
  5. https://news.softpedia.com/news/Google-Operates-a-Massive-Worldwide-Database-with-GPS-and-Atomic-Clocks-310093.shtml
  6. https://amulya-bhatia.medium.com/cloud-spanner-almost-all-you-need-to-know-f1c1fa471df

--

--

Kousik Nath
Geek Culture

Deep discussions on problem solving, distributed systems, computing concepts, real life systems designing. Developer @Uber. https://in.linkedin.com/in/kousikn