Understanding Etcd3

Ahad Rana
9 min readJun 10, 2020

--

Most of us have probably heard of Etcd, the distributed key-value store Kubernetes uses to manage its persistent state. Maintaining a coherent and consistent state, even in the face of node crashes, out of order messages or network partitions, is one of the fundamental problems in distributed systems design. Etcd goes to great lengths to not tolerate split-brain scenarios and provides a highly consistent and fault-tolerant key-value store. Besides metadata storage, distributed systems can also use Etcd for service discovery and, most importantly, as a concurrency/coordination primitive. In this article, we’ll touch on how Etcd (version 3) goes about providing these guarantees and explore the high-level implementation details of its core APIs, with the hope that we can use this knowledge to leverage this tool in our own distributed systems projects.

Before we dive deeper into Etcd, we need to take a slight detour and discuss the CAP theorem. The CAP theorem states that distributed systems can have at most two out of three of the following properties: high levels of data consistency, high availability of access to the data, and tolerance of network partitions. Networks cannot be assumed to be reliable, so distributed systems need to account for network partitions. Hence, in terms of the CAP theorem, distributed systems can either be AP or CP. That is, they can either optimize for availability at the expense of risking data inconsistency during network partitions, or they can focus on consistency of data at the cost of availability.

Etcd is a strongly consistent system. It provides Linearizable reads and writes, and Serializable isolation for transactions. Expressed more specifically, in terms of the PACELC theorem, an extension of the ideas expressed in the CAP theorem, it is a CP/EC system. It optimizes for consistency over latency in normal situations and consistency over availability in the case of a partition.

Etcd uses the Raft algorithm to achieve distributed consensus. Raft focuses on three specific areas: leader election, log replication, and safety/correctness of the log. Raft nodes can be in one of three states: Leader, Follower, or Candidate. Raft tries to ensure that there is only ever one leader active at one time. It also maintains an append-only replicated log. All writes are sent to the leader, and it replicates them to follower nodes. Only when a quorum of nodes confirms the write does Raft return a confirmation to the client. A new monotonically increasing Term number defines each successful leader election. Each entry in the replicated log is identified by the term and its position in the log. Raft ensures the correctness of the log by guaranteeing some key safety properties. First, Raft maintains the Log Matching property, which states that if two logs contain an entry with the same index and term, then they are identical in all entries up through the given index. Secondly, Raft maintains the Leader Completeness property, which states that if a log entry is committed in a given Term, then that entry will be present in the logs of the leaders for all subsequent terms. A final Raft safety property stipulates that if a server has applied a log entry at a given index to its state machine, no other server will ever apply a different log entry for the same index. Once an entry has been committed to the replicated log and accepted by a quorum of the nodes, the leader updates its CommittedIndex property. All the nodes in the cluster then can asynchronously apply all entries up to the CommittedIndex position to their local state machines.

Etcd’s data model supports key/value pairs, where both the key and value are represented as binary arrays or strings. There doesn’t seem to be a limit on key or value size, but by default, the Raft request payload size is 1.5 MB. Etcd persists its state machine to disk in the form of a Btree based database (boltdb). Persisting the state machine enhances Etcd’s resiliency. It also limits the potential size of the replicated log, since nodes can now recover state by acquiring a snapshot of the database from the leader (something explicitly supported in the Raft protocol).

To enable concurrent access and support versioning, Etcd implements multi-version-concurrency-control (MVCC) on top of Boltdb. Every single state mutation (ex. Put, Delete, Txn. etc.) increments a global revision number in the database. Read transactions can target records that are at or below a specific revision number. The key part of the key-value pair is not used as the primary key in the database. Instead, Etcd uses a revision as the primary key. Revision is a composite key consisting of the current major revision number, and a minor sub-version number, the relative position of the current change within the atomic unit of changes. Values stored in the database contain key-value pairs, as well as metadata. Boltdb supports multiple concurrent read transactions and a single read-write transaction. This meshes well with the constraints of the Raft based state machine: Mutations can only be applied serially, in a deterministic order.

Etcd’s database design begs the question: How do you access a value by its key, or do a range scan for a set of keys? To enable this obvious use case, Etcd maintains an in-memory Btree index mapping key to a data structure that contains the latest revision and prior revision history. This Btree is rebuilt by doing a full database scan on startup and updated during state transitions. A range scan now involves first walking the tree and doing a seek into the database for each target revision. The decision to have this type of index structure is probably driven by Etcd’s desire to support features such as Watch (discussed later). It seems evident, based on the database design, that Etcd needs to run on systems with SSDs, or even better yet, NVMe based local disks.

The Etcd APIs fall into three functional groups: Basic key-value APIs, Lease related APIs, and Watch APIs. Key-value APIs support GET, PUT, and DELETE, as well as Transactions.

A sophisticated RANGE API can fetch single keys or ranges of keys. GETs in Etcd are not a simple matter of doing a bunch of concurrent BTree lookups. Etcd supports linearizable reads. Linearizability implies that once a write completes, all later reads should return the value of that write or the value of a later write. To accomplish this, the Etcd servers that are servicing the GET contact the leader to establish the latest CommitIndex position in the replicated log. The leader, in turn, sends out heartbeats to a quorum of nodes to validate its leadership position. Confirmation of its leadership position establishes the validity of CommitIndex, thus putting in place the conditions for a linearizable read. Once the origin server gets a response from the leader, it must wait for its local state machine to catch up to the leader’s CommitIndex. In other words, it must wait until its local database reflects all state changes up to at least the CommitIndex. Only then can the server do the necessary database reads and return a response to the client.

Unlike GET API calls, PUTs and DELETEs require the leader to write Raft messages to the replication log. Only when these messages have been committed to a quorum of nodes, is it safe for the client to receive a confirmation from the Etcd server.

Etcd version 3 introduces the Transactions API. Transactions allow you to compose a series of operations (GETs, PUTs, DELETEs) and execute them atomically, conditioned on the evaluation of a set of predicates against the current state of the key-value store. Because transactions are applied to the Raft log and executed against the state machine in a serial and deterministic ordering, they offer a Strict Serializability isolation level. They are probably one of the most powerful additions to Etcd API. When used in combination with Etcd’s MVCC data model, they can be used as a building block for more sophisticated abstractions and have the potential to significantly reduce the code complexity of concurrent operations in client applications.

Etcd’s Lease API allows the client to create leases with an explicit time-to-live (TTL). Clients can then attach keys in the Etcd key-value store to a lease. Once the lease expires, Etcd deletes all keys associated with the lease. Clients also have the option to cancel leases or renew them before their TTL expires. Lease declaration and deletion are implemented as Raft messages. Once they are committed to the replicated log, they are applied to all the state machines in the Etcd cluster. The leader monitors all active leases, but because leases are persisted in the state-machine / database, they can be restored on a new node during a leader election. Furthermore, since the Raft log is serial, all keys associated with the lease can be tracked on all nodes after lease creation. Lease renewals, however, are not applied via the Raft log, but instead are applied to the leader, which distributes the renewal state and new TTL to followers via RPC.

The Watch API allows clients to set watches on a range of keys and receive guaranteed updates for any changes to those keys. Clients can specify the revision at which they want to start monitoring a key. By tracking the revision number of the last update they received, they can pick up any missed changes in the advent of a client disconnect or Etcd server crash. Unlike Leases, Watches are more ephemeral and must be restored in the case of a client disconnect. The design of the Etcd schema, using revision as a primary key for key/values in the database, allows servers to satisfy watches using database scans instead of random-access reads.

Etcd is a highly available, strongly consistent distributed key-value store that provides advanced features like transaction, watches and leases. It can be used to provide robust concurrency control for distributed systems development. The rapid adoption of Kubernetes has focused a lot of engineering effort into making Etcd more robust at scale. Recent releases have focused on operational improvements like cluster stability, by, for example, introducing the concept of Learner nodes. There has also has been a lot of attention devoted to performance improvements, like, for example, the recent work to significantly improving concurrent read performance. Etcd compares highly favorably to other similar systems in this space, including Zookeeper and Consul. If you are building a distributed system and need to store runtime state or need to provide robust concurrency primitives for coordination, it is worth looking into Etcd as a solution. Of course, like all complex systems, there are tradeoffs involved in achieving the high levels of availability or concurrency Etcd provides. For example, Etcd’s read and write paths are not super lightweight. Furthermore, the Raft state machine is essentially single threaded, so a single Etcd cluster cannot take full advantage of additional cores. There are also limits to what size of database an Etcd cluster can support. When building systems using Etcd, it is best to keep these constraints in mind and minimize the state you want to store and limit the frequency with which you need to read it or update it. This is just an introductory post and there is a lot of work involved in operationalizing Etcd, including addressing security and data integrity concerns, for example. I’ll try to cover some of these topics in future posts.

--

--