Time Synchronization in Distributed Systems

Gaurav Kumar Srivastava
stackspacearena
Published in
7 min readMar 17, 2024

Consider a Trading Platform where multiple traders across different geographical locations are executing trades on the platform. The order of execution is important to ensure consistency in such a system. Similarly transaction settlement rely on correct timestamps to ensure trade settlements are fair and accurate. This essentially means that all the servers handling trade orders and settlement should be in sync with respect to time.

Let us consider another example of a distributed system where multiple nodes collaborate to process customer orders. Each order consists of several steps, such as placing the order, verifying payment, and updating inventory. Let’s look at an example where the ordering of events is essential:

Order Placed(Event A): A customer places an order for a product on Node 1. This event involves creating a new order record in the system.

Payment Verification (Event B): The payment system on Node 2 verifies the payment for the order placed in Event A. This event confirms the financial transaction associated with the order.

Inventory Update (Event C): Once payment is verified, the inventory system on Node 3 updates the product inventory to reflect the purchase made in Event A.

In this example, the correct ordering of events is crucial to maintain consistency. Event B (Payment Verification) should not occur before Event A (Placing Order), as it depends on the existence of the order. Similarly, Event C (Inventory Update) should not happen before Event A or Event B, as it relies on both order placement and payment verification.

What is the time synchronization problem?

In distributed systems, a time synchronization problem refers to the challenge of ensuring that clocks across different nodes or machines are consistent with each other. This consistency is crucial for various tasks such as event ordering, coordination, scheduling, and logging within the distributed system. However, achieving perfect synchronization of clocks in a distributed environment is inherently difficult due to factors such as:

  1. Clock Drift:

Clocks in different machines may drift apart over time due to differences in hardware, temperature variations, or inaccuracies in crystal oscillators. Even small differences in clock rates can accumulate over time, leading to significant discrepancies.

2. Network Latency

Messages exchanged between nodes in a distributed system experience varying network latencies. These latencies can cause delays in message delivery, leading to inconsistencies in the observed timestamps at different nodes.

3. Measurement errors

Even with accurate hardware clocks, there’s always uncertainty in the time reported by each machine due to factors such as clock resolution and measurement errors.

4. Node Failures

Nodes in a distributed system may experience faults or failures, leading to transient or permanent inconsistencies in their clocks.

The time synchronization problem becomes particularly challenging when strict consistency requirements are necessary, such as ensuring the correct ordering of events in distributed transactions or coordinating real-time activities across multiple nodes.

*************************************************************

How do we solve this ?

A. Physical clock sync

To address the time synchronization problem, distributed systems employ various algorithms and protocols which are primarily used for synchronizing physical clocks in distributed systems. These algorithms focus on adjusting the clocks of individual machines or devices to achieve a common time reference across the network. Here’s how each of these algorithms deals with physical clock synchronization:

Network Time Protocol (NTP)

  • NTP is a widely used protocol for synchronizing clocks over a network.
  • It operates by adjusting the local clocks of machines based on time information obtained from reference time servers.
  • NTP aims to achieve accurate time synchronization between machines, typically with precision in the order of milliseconds or better.
  • It’s commonly used in scenarios where maintaining accurate time across a network of computers is essential, such as internet-based applications, network infrastructure, and server systems.
  • NTP runs on User Datagram Protocol (UDP), which in turn runs on IP.
NTP with high-precision timekeeping devices at top layer. (Source : Wikipedea)

Precision Time Protocol (PTP)

  • PTP is a more advanced protocol designed for achieving extremely high-precision clock synchronization, typically in the sub-microsecond range.
  • It utilizes hardware timestamps and sophisticated algorithms to compensate for network delays and asymmetries.
  • PTP is commonly used in applications where timing accuracy is critical, such as industrial automation, telecommunications, and scientific research.

Cristian’s Algorithm:

  • Cristian’s algorithm is a simple method for clock synchronization between a client and a time server in a distributed system.
  • It involves the exchange of timestamps between the client and server to estimate clock offsets and adjust the client’s clock accordingly.
  • While it’s not as precise as NTP or PTP, Cristian’s algorithm provides a basic mechanism for achieving clock synchronization in client-server architectures or systems with a centralized time source.

Berkeley Algorithm

  • The Berkeley algorithm is a clock synchronization algorithm that aims to synchronize the clocks of machines in a distributed system by adjusting them to a common time reference.
  • It operates by designating one node as the time server or master, which periodically polls the clocks of other nodes to determine their clock offsets and adjusts them accordingly.
  • The Berkeley algorithm is suitable for scenarios where a centralized time server can be designated and where moderate precision in clock synchronization is acceptable.

B. Logical Clock & Causal ordering

In distributed systems, logical clocks are used to order events based on their relative occurrence rather than relying on physical time. These clocks provide a way to establish a partial ordering of events across different processes, even if their physical clocks are not synchronized.

Causal ordering refers to the ordering of events based on their causal relationships. In a distributed system, events may be causally related if one event directly influences or depends on the occurrence of another event. Establishing causal ordering ensures that events are processed in a way that preserves causality, even in the absence of global clock synchronization.

Lamport’s Happened Before relationship:

For two events a and b,

a → b if a and b are events in the same process and a occurred before b,

or

a is a send event of a message m and b is the corresponding receive event at the destination process, or a → c and c → b for some event

Lamport timestamps and vector timestamps are approaches used in distributed systems to order events and deal with the synchronization of logical clocks rather than physical clocks. Here’s how they handle clock synchronization:

Lamport Timestamps

  • Lamport timestamps are based on logical clocks rather than physical clocks. Each process in the distributed system maintains a logical clock that represents the ordering of events.
  • When a process sends a message, it attaches its current Lamport timestamp to the message. Upon receiving a message, the receiving process updates its own logical clock to be greater than the maximum of its current timestamp and the timestamp received in the message, ensuring causal ordering of events.
An example of Lamport timestamps for events among processes running on different servers
  • Lamport timestamps do not require global clock synchronization. Instead, they rely on the ordering of events based on causality relationships, allowing processes to progress even if their clocks are not perfectly synchronized.
  • Use Cases: Lamport timestamps are useful in scenarios where strict global time synchronization is not feasible or necessary, such as in message-passing systems or distributed algorithms where causal relationships between events are important.

Vector Timestamps

  • Vector timestamps, also known as vector clocks, extend the concept of Lamport timestamps by maintaining a vector of timestamps instead of a single timestamp per process.
  • Similar to Lamport timestamps, each process maintains a vector of timestamps, where each entry corresponds to a process in the system. When a process sends a message, it increments its own entry in the vector. Upon receiving a message, the receiving process updates its vector by taking the maximum of each corresponding entry in its vector and the received vector.
  • Vector timestamps allow for more precise tracking of causality relationships and concurrency among events. They can distinguish between causally related events and concurrent events, providing finer-grained ordering.
  • Like Lamport timestamps, vector timestamps do not require global clock synchronization. They rely on the exchange of timestamp vectors between processes to establish event ordering.
  • Use Cases: Vector timestamps are particularly useful in distributed systems where causality relationships and concurrency need to be accurately tracked, such as in distributed databases, distributed file systems, or distributed version control systems.

*************************************************************

A peek into True Time

TrueTime is a concept used in distributed systems to provide a more accurate notion of time compared to traditional system clocks. In distributed systems, where multiple machines are communicating over a network, ensuring consistency in time across different nodes is crucial for various tasks such as logging, synchronization, and ordering events.

Google introduced TrueTime as part of its Spanner distributed database system. TrueTime relies on a combination of GPS and atomic clocks to provide a highly accurate time source. Here’s how it works:

  1. GPS Receivers: TrueTime utilizes GPS receivers to obtain timestamps from satellites. GPS provides a highly accurate time source, as the satellites are equipped with atomic clocks.
True time used by Google Spanner (Source : https://cloud.google.com/spanner)
  1. Atomic Clocks: In addition to GPS, TrueTime also employs atomic clocks. Atomic clocks are extremely precise timekeeping devices that use the vibrations of atoms to measure time.
  2. Clock Offset and Uncertainty: TrueTime doesn’t just provide a single timestamp but rather a range of possible times. It includes two values: a timestamp representing the current time and an uncertainty interval representing the possible error in the timestamp.
  3. Spanner’s Use of TrueTime: In Google Spanner, TrueTime is used for transaction ordering and consistency. Transactions are ordered based on the timestamps obtained from TrueTime, and the uncertainty interval helps ensure that transactions are correctly ordered even in the presence of network delays or clock inaccuracies.

By using TrueTime, distributed systems can achieve strong consistency guarantees across multiple nodes despite the challenges posed by network latency and clock synchronization issues. TrueTime provides a more reliable and accurate notion of time, which is essential for maintaining consistency and reliability in distributed systems.

*************************************************************

Sources : https://en.wikipedia.org/wiki/Clock_synchronization, https://cloud.google.com/spanner, https://lamport.azurewebsites.net/pubs/time-clocks.pdf

--

--