On Time

Time is deceptively complex. Rarely do we think of it, apart from passing moments of existential crises that we encounter throughout our lives. To software developers, it is perhaps one of the most dangerous traps. In software, you have a half-dozen switches that you must flip to decide on the particular flavor of time you’d like. Depending on your decisions time may freely travel backwards, forwards, or not at all. There are entire articles on the fallacies that programmers believe about time. Even apart from a programmer’s decisions, the decisions an administrator may make influence how time works. Time is hard.

For the majority of cases, being oblivious to time in a single program rarely results in issue. When time violates the assumption that it moves forward at a constant rate, we can consider it a corner case. Perhaps it’ll get the program into an inconsistent state, or crash it. The user might have to restart the program, but once they do, they’ll be back on their merry way.

If there’s some state involved — perhaps if these timestamps persist on disk — it might introduce new complications. Logs may print out of order, or worse, audit logs may be rendered invalid. These cases will typically repair themselves without user involvement. A worst case scenario might require the user to blow away their state, or restore from backup.

There are more serious situations in which a neglect of time can result in chaos. In sensitive applications, like health records, or financial systems, an improper understanding of time can be disastrous. As we design systems that are more sensitive to corner cases, such as drift and jitter, failure suddenly seems inevitable. I’m here to tell you that we can take back the uncertainty of time.

On Physical Time

Physical time is what most people consider as the concept of time. As mentioned earlier, it comes in a variety of flavors. Most people care only about local time, which is a type of time standard. These time standards dictate the rules by which to measure time. Local time is UTC (Coordination Universal Time), adjusted by a time zone. UTC itself doesn’t step forward at a constant rate, but is in fact based on yet another time standard, TAI (Atomic Time).

TAI is probably considered the purest type of time. TAI is computed by taking the weighted average of more than 300 atomic clocks, where the weight of each clock is derived from its stability. Each clock advances based on the frequency of vibration of individual molecules, making it extremely accurate. Unfortunately, since these clocks sit all around the world, we cannot determine the current TAI time. We can only determine it once we gather the measurements and do the calculus. In addition to this, there are certain truths about reality that make TAI time inconvenient to use day-to-day.

The real world doesn’t take 365 days to make an exact rotation. For this reason, we have leap years, but there are even smaller corrections that need to be made — leap seconds. These leap seconds are added to UTC on some schedule to ensure that UTC time lines up with the rotation of the Earth, taking into account the slight imperfections in the Earth’s orbit. This offset, of leap seconds, is what separates UTC from TAI.

Lastly, there is local time, which is a mess. Local time is UTC adjusted by some offset, which varies throughout the year based on daylight savings time (summer time). The rules to calculate these offsets are compiled into a database, called the tz database. This database is constantly changing depending on the geopolitical climate, policy changes, and whatever is happening in a given part of the world on any given day. Local time should never be stored in any state without proper metadata, even if it may quickly be rendered incorrect.

Wall Clock Time

In computers, physical time is wall clock time. It is obviously useful for measuring when events happen(ed) compared to the real world. However, keeping track of this type of time is quite complicated. It is measured by using a counter based on the frequency of an oscillator to determine that a period has passed. These ticks can be translated into units of time based on a known reference. Unfortunately, these oscillators tend to be quite unstable rendering them unusable over time. We can instead use atomic clocks — which leverage very stable oscillators — although embedding an atomic clock in every PC would be cost prohibitive. Due to the unstable nature of wall clocks, we cannot use them to order events, even on a single system, since they may act erratically. Even with the help of Atomic clocks, using wall clocks to keep track of events across systems is an exercise in futility.

To try to make wall clocks usable for tasks in which it is necessary to provide a global reference, we can employ NTP (Network Time Protocol). NTP synchronizes a host’s clock with others in a network. It uses a mechanism of tiering clocks that all eventually lead to reference time, a close relative of TAI. All clocks attempt to synchronize to this reference time. This forms a tree, distributing time throughout its leaves. Every node in this tree now has the same goal, allowing NTP, Stratum 0 time to be a global reference across machines.

Most systems that rely on wall clock time recommend using NTP, but unfortunately do not verify whether or not the system clock is synchronized, or whether it is sufficiently stable, and therefore accurate for use. This is because the typical APIs recommended to retrieve time do not care whether the clock is synchronized, or disciplined. clock_gettime, and gettimeofday are designed for speed and not correctness. Unfortunately, they do not provide enough fidelity to determine whether or not the value they return is usable.

The problem of getting a timestamp, and determining whether it can be used for a given purpose has existed long enough that people have tried to come up with a solution. Perhaps the most widely available solution is provided by RFC1589, “A Kernel Model for Precision Timekeeping.” Mechanisms exist in the Linux kernel by which to get this information with a specific API. These calls are rarely discussed, and even more rarely used.

adjtimex

The system call adjtimex(2) returns information about the time keeping standard in use. It returns a status code that states whether or not the clock is synchronized, and it sets the value of a buffer to the current time and some timex metadata that can be used to determine the usefulness of that timestamp. Such a call is more expensive than getting a trivial timestamp, but often, correctness must come at a cost. An alternative call, ntp_gettime(3), is somewhat faster, at the cost of providing less metadata about the provided timestamp.

struct timex {
int modes; /* Mode selector */
long offset; /* Time offset; nanoseconds, if STA_NANO
status flag is set, otherwise
microseconds */
long freq; /* Frequency offset; see NOTES for units */
long maxerror; /* Maximum error (microseconds) */
long esterror; /* Estimated error (microseconds) */
int status; /* Clock command/status */
long constant; /* PLL (phase-locked loop) time constant */
long precision; /* Clock precision
(microseconds, read-only) */
long tolerance; /* Clock frequency tolerance (read-only);
see NOTES for units */
struct timeval time; /* Current time */
long tick; /* Microseconds between clock ticks */
long ppsfreq; /* PPS (pulse per second) frequency
(read-only); see NOTES for units */
long jitter; /* PPS jitter (read-only); nanoseconds, if
STA_NANO status flag is set, otherwise
microseconds */
int shift; /* PPS interval duration
(seconds, read-only) */
long stabil; /* PPS stability (read-only);
see NOTES for units */
long jitcnt; /* PPS count of jitter limit exceeded
events (read-only) */
long calcnt; /* PPS count of calibration intervals
(read-only) */
long errcnt; /* PPS count of calibration errors
(read-only) */
long stbcnt; /* PPS count of stability limit exceeded
events (read-only) */
int tai; /* TAI offset, as set by previous ADJ_TAI
operation (seconds, read-only,
since Linux 2.6.26) */
};

If programs used the adjtimex or ntp_gettime API in lieu of other APIs to determine the current time, programs could make specific assurances to the user about time. The potential use cases are endless. For example, there are systems like Multi-Paxos that depend on time. They do not use wall clock time to establish global order, but instead for leader leases. These leader leases provide significant performance benefits to the system, but can compromise safety if they’re based on invalid assumptions. Even systems that are sloppy with their use of time can benefit from making sure that the actors in the system are using NTP and have a stable clock with bounded error.

A small step may be to check the status of time at the startup of a program. This may be a more acceptable step than changing the fundamental foundation, of a program and the APIs that a programit consumes. For that, a simple C program will do:

#include <stdlib.h>
#include <stdio.h>
#include <sys/timex.h>
#include <string.h>
#include <errno.h>

#define USEC_PER_MSEC 1000L
#define MAX_EST_ERROR_US 100L * USEC_PER_MSEC /* 100 millisecond */

int main() {
struct timex tx = {0};
int rc;
long long int error;

rc = adjtimex(&tx);
error = tx.esterror - MAX_EST_ERROR_US;

if (rc == TIME_BAD)
fprintf(stderr, "Time is marked as bad\n");
else if (rc == -1)
perror("adjtimex");
/* This is to check if NTP thinks the clock is unstable */
else if (error > 0)
fprintf(stderr, "Max error exceeded: %lld(usec)\n", error);
/* If NTP is down for ~16k seconds, the clock will go unsync */
else if (tx.status & STA_UNSYNC)
fprintf(stderr, "Time not in sync\n");
else
return 0;
return 1;
}

On Distributed Time

Let’s consider a real-world problem — Electronic Health Records. Electronic Health Records systems are notoriously fickle, and tend to face the most onerous audits. Additionally, they must integrate with dozens of external systems. For the sake of discussion, let’s say that a doctor generated an update on patient(1)’s chart, and a nurse, from a different workstation, generated an update on patient(1)’s chart for the same field. How do we resolve this simultaneity of events?

The simple problem of determining the order of events in is one of the core concepts of distributed systems engineering. This order can be dictated by the concept of a happens before relationship between events in a system. This relationship is not trivial to derive. Fortunately, several solutions have come about from the 50+ years of computer science. We can lean on these to solve the problem of creating a preorder.

Lamport Clocks

Logical time is a mechanism to capture the happens before relationship of events in a distributed system. Lamport Clocks are one of the earliest types of logical time. They work by timestamping each event. Every actor starts their local clock at 0. When an event is generated, they increment the number and affix that time to the event. When an event with a larger timestamp than their local is received, they set their local “clock” to the event’s timestamp, and add one.

Unfortunately, this has several problems. First of all, if we have multiple unrelated actors in a system generating events for the same piece of data simultaneously, we cannot determine which event came first. Lamport clocks are predicated on the idea that all actors in the system are participating in logical time. In the real world, there are external entities outside of the current frame of reference. These entities, such as browsers, can generate updates, and events.

Lamport Clock

In this diagram, we can see that Actor3 and Actor4 generate two unrelated updates for patient(1). Although these updates are independent, the logical timestamp for the latter update is smaller than the earlier one. Unfortunately, the Lamport Clock only tells us that if event B occurred after event A, the timestamp for event B will be greater than the of timestamp of event A.

Vector Clocks

Vector clocks are an extension of Lamport Clocks. They allow us to compare two clocks, and determine the order of events. If a given vector clock B strictly dominates vector clock A, it means that the event B strictly came after event A. If vector clock B does not strictly dominates vector clock A, and vector clock A does not strictly dominate vector clock B, then the events occurred simultaneously.

The way that vector clocks work is that rather than treating the clock as one single number, they instead expand it to a vector of Lamport Clocks. Each actor has its own Lamport clock in this vector, which only it is allowed to increment. We can determine whether vector B strictly dominates vector A, if for all actors in vector B, the entry in vector A is smaller than or equal to that of B’s and at least one is smaller than that of B’s. Also, events can be uniquely identified by their clocks as no two unique events will ever be tagged with the same clock.

Version Vector

Here, we see the same system as above, except using vector clocks instead of Lamport Clocks. We can see that the two simultaneous updates for the patient have the clocks {A1: 2, A4: 2} and {A3: 1}. Since neither of these vectors descend nor dominate from one another, we can determine that these events occurred simultaneously.

These clocks are not without their flaws. The Charron-Bost result shows that the cost of this capability requires O(N) metadata, where N is the number of actors in the system. Modern client-server systems have millions of participants; this has obvious problems.

Beyond Time?

There are clear issues with physical time — it is unstable and unreliable. Different problems occur with logical time — it’s complex, and it’s not easily consumable by the users of a system. Also, a fundamental aspect of logical time is that it requires two nodes to communicate in order to advance time. Requiring communication is incredibly clumsy, as we must know all of the participants that could be involved in advance of the transaction. Side-channels, such as browsers and queues, may be hidden from our time-keeping system, resulting in correctness being compromised.

Google made a splash with their Spanner paper, which highlighted the use of high accuracy clocks in their datacenters. Spanner’s time-keeping capabilities are provided via an API that goes by the name, “TrueTime.” TrueTime contains a timestamp with a tightly bounded error. This error value allows us not to only choose one timestamp, but instead it provides a smeared timestamp. We can use this smear to determine the newest possible value, or the oldest committed value. This information can be used to take a causal cut, or timestamp a transaction without requiring all participants communicate. TrueTime itself has been the topic of much discussion as Google has not confirmed its exact purpose.

As opposed to trying to implement an alternative to TrueTime, most engineers throw up their hands in surrender, thinking that the task would be impossible without the aid of Google’s atomic clocks. In fact, if we look at the API described in RFC1589, it gives us a particular TAI timestamp, and ε, the estimated error of that timestamp. This is the same API that exists in the kernel. The ε in commodity hardware may be in the 100s of microseconds, but it is a better strategy than giving up.

Recently, AugmentedTime, and Hybrid Logical Clocks have been proposed as a compromise between wall clock time, and logical time. Hybrid Logical Clocks take physical time, as the background reference for all events in a system, and combine it with logical time, capturing the causal information required to build a relationship of events. AugmentedTime, on the other hand, combines vector clocks and physical time to build a bounded clock based on ε. The aforementioned techniques allow us to avoid the data loss bugs which have become ever so common.

What now?

There are clear, incremental approaches to making sure that you’re using the right kind of time for your use case. As opposed to depending on the administrator, time should be a core concern of your application. Do not only recommend to the user to use NTP. Use the APIs provided to ensure that their time is of a high enough quality for the application at hand. Without this, we as engineers are only unnecessarily making ourselves vulnerable to further chaos. If you need causally correct time, implement one of the many tools available to you, depending on your needs. Perhaps, one day, HLCs, or AugmentedTime will simply become yet another API provided by the system itself, as seen in Google’s environment. Until then, we need to start with what is available.

Most of all, realize that all hope is not lost. The APIs and techniques described above are incredibly accessible, and have been available for years. By being proactive, and using the tools at our disposal we can avoid losing people’s health records, financial data, or even their lives. By neglecting time, we only signal that we do not value it.