Optimistic, scalable ACID transactions
The bounty offered (1000 €) is awarded to the first person disproving a fundamental aspect of the framework described in this paper.
This article is the first in a series. In these, we will describe a way to implement a fully non-blocking but still strongly consistent and distributed datastore.
- Provides strong, ANSI SQL-compatible consistency guarantees;
- Is distributable across multiple computers, datacentres and even geographical locations;
- Can have a resilient, fault-tolerant architecture, where even failures of multiple different datacentres can allow the operation of the datastore to continue seamlessly, depending on the configuration.
This is presented through the idea that sets of resulting data from multiple computers do not necessarily have to be consistent immediately, but that they can be filtered before the results are presented to the client. When looked at this way, the problem can be reduced to making sure that the filtering algorithm references an atomic event which can be deterministically compared to others, and it will result in the same answer anywhere in the system.
The document outlines:
- A fully optimistic, lockless conflict resolution mechanism and write process, which allows the results of write operations (potentially writing to multiple computers) to be exposed atomically to readers.
- An architecture where the internal state of each computer can be reconstructed from the same ordered set of inputs. This allows the whole system (which is the sum of these computers) to be reliably distributed, persisted and generally be made more resilient.
This is different from current systems, where writes either use some form of two-phase commit algorithm (which are inherently blocking), lack transaction ordering semantics (and thus are unable to provide strong consistency guarantees) or are unable to provide proper isolation between transactions.
Another aspect of current database systems is that their internal state is protected by certain mechanisms, so that writes and reads can be finished according to the required consistency guarantees. In this system, the internal state of a computer is considered to be the result of an ordered list of input-messages and, using a different definition of time, when the computer process their stack of inputs physically is not important, since the next instruction will always come at the end of their input queue. This means that two computers will always hold the same internal state when this list is consumed by both.
The above means that there can be no race conditions or other sources of ambiguity anywhere in the system, which mandates that computers only communicate with each other using asynchronous methods (since race conditions are inherently non-deterministic).
Our aim is not to be a documentation of requirements for implementing such a system; it only aims to show that this is possible. Some optimisations are suggested, others hinted at, while others again are ignored. This is because the goal of these articles is to provide the first proof of concept that such a system could be built and could be maintained, even considering the stringent requirements of the software industry. Opportunities for optimisation will be mentioned in later chapters, the appendix and in footnotes.
In the following, we describe a theoretical framework which can be used to implement a distributed, linearly scalable and strongly consistent, SQL-compatible datastore. We define these terms as:
Distributed: The system maintains its information on multiple computers.
Linearly scalable: If additional computers are introduced into the system, its capacity will roughly increase by the ratio of the total capacity of the new computers compared to the total sum of all capacity of the computers utilised in the system before they were added.
For example, if the system used 100 computers previously and 10 new computers are introduced, capacity will grow by about 10%.
Strongly consistent: A global ordering exists between each piece of information. The order between these is referred to as “earlier,” “later” or “at the same time.” The information implementing this ordering is called time (this definition of time is not the same as its equivalent meaning in English). The system will only return results which contain all relevant information introduced earlier or at the same time compared to a chosen point in time (and does not contain any information introduced later than this time).
To achieve the goals described above, the following components are specified:
- A number of clocks ordered into a hierarchical tree structure, advancing independently from each other. On this structure, an algorithm exists which defines global timestamps. Any two timestamps can be compared to each other to decide which happened later, earlier or at the same time.
- An algorithm which guarantees that each piece of information will be readable and can be evaluated using a certain timestamp and that information introduced later cannot be a part of the results.
- A solution which guarantees that if there are conflicting modifications, there is a system-wide consensus regarding which modifications (if any) have been accepted and which were rejected. More formally: if version is defined to be the value of a piece of information at a given time, then each version can only have at most one successive version.
We also give technical proposals for a possible implementing infrastructure that enables:
- Fail-safe, redundant communication between computers in the distributed system,
- And latency mitigation.
Comparison with SQL
The system described here allows for the implementation of an SQL-compatible database. The SQL standard allows for:
- A transaction to be refused for any reason, at the discretion of the server;
- The database to escalate the defined isolation level without asking for permission from the client.
The system described here guarantees that it follows the industry’s ACID properties, which are:
- Transactions are introduced into the datastore atomically and the other transactions will also acquire any information introduced by it atomically.
- The redundant constraint-information introduced by a transaction will only become available at the same time as all the other information entered by the transaction and there can never be two simultaneous parallel transactions committed which would together violate the constraint. Thus, it is always possible for the system to evaluate the validity of a redundant constraint-record either at the time the information is introduced or at commit, depending on how the index or constraint is implemented.
- Transactions cannot read or write each other’s half-finished states.
- Information already accepted by the system cannot be lost at a later time.
As implied, this system does not solve the existing issues posed by the SQL-standard, such as phantom reads, phantom writes or the necessity of using SELECT FOR UPDATE statements for signalling write-protection of a set of information when that is needed.
The term statement means roughly the same in this document as it does in the SQL-standard, so a read and/or write instruction from the client, which might have resulted in some information being presented back to it.
Distributed hierarchical clocks
The following covers the first point of the components listed in the Implementation section.
A potentially globally distributable hierarchical clock is described in this chapter. Its goal is to allow computers whose data are closely correlated to have a fast and efficient information exchange mechanism and still allow for a system which could potentially be grown to a global scale in which even remote computers could participate in transactions with strong consistency guarantees and relatively quick response times.
The way these “clocks” are arranged is indifferent from the perspective of the specification. Each clock could be hosted on a specific computer or the clocks could be distributed even further. The essence of this chapter is that a strong ordering can be drawn up between values assumed by a set of distributed clocks by establishing an algorithm which is capable of doing so.
Groups and local clocks
Consider a number of computers which send packets of information called clock token-messages between each other in a circle. A set of computers such as this is called a group. The message contains a natural number which is incremented by one by each computer before passing it on to the next in the chain. In such a setup each computer (as long as it knows its position in the chain) knows which numbers it will receive and which numbers it will send on. The information which is unknown to the computers is the time which will elapse between two token-messages. Since the clock itself is composed of all the participating computers, the group clock is essentially synonymous with the group. The only distinction is a fine one, in that when the group clock is mentioned, a specific ability of the group to create versions is being discussed.
The time window between processing two token-messages is called a period.
- Periods which have finished before the latest token-message was received (i.e., each one except for the last one) are called closed periods.
- The latest period, which will be closed by processing the next token-message, is called an open period.
For example, in a group which contains three computers, the first computer is going to assume the following values: 0 = 3*0, 3 = 3*1, 6 = 3*2 … m=3*k where m is the assumed value, k is the number of the period numbered from the start of the message-exchange mechanism. Since each computer knows the latest number it also knows the next one. For example, a period on this computer is the time elapsed between processing the messages containing number 3 and 6; or in other words, the time it takes for all the computers in the chain to complete a whole round of message-passing.
The system never needs to look ahead of the currently open period in any group. This means that if the set of computers needs to change within a group, the new number of computers in the group could potentially be passed along in the message, so that the newly introduced or removed computer can start/stop participating in the message-exchange once the token-messages have gone full circle and reached the node that initiated the start/stop process.
A computer’s local clock can be described with the following two attributes:
- What was the value of the last message received?
- How many computers will be present in the group in this round?
The period last closed by the receipt of a message is called the group clock value. This is therefore not a global value. To identify it, one needs to know which group is being discussed.
Group clock values are also referred to as versions. Since each period identifies a group clock value, the terms open and closed versions are also used to mean the group clock value identified by the open or closed period.
Given a group of computers, we can associate new groups of computers to each. The computer in the original group will be referred to as the parent or parent computer, the new group of computers associated to it will be called the child or child group. These child groups will have a clock of their own, which will advance independently from the parent’s clock. This establishes a tree hierarchy of both computers and groups (as parent computers are also members of groups).
We define the top of the hierarchy to mean the set of computers which have no parent computers and the bottom or leaf as those which have no children.
It is assumed that communication between computers on the bottom of the hierarchy is orders of magnitude faster than it is between the ones at the top of it.
An ancestor computer is either a parent of a computer or a group, or an ancestor of a parent of a computer or a group. An ancestor group is a group an ancestor computer belongs to. A common ancestor group is a group which is an ancestor group to both of the two specified computers or groups. The lowest common ancestor group or lowest common group clock is the first common ancestor group of two specified computers or groups when searching from bottom to the top (or more formally: the group, included in both sets, which has the highest number of ancestors). The root group is the only group which has no parent and is therefore the top of the hierarchy.
The lowest common ancestor group should always be computable by any computer based on two known versions and the references to their group clocks.
The root group is a common ancestor group to any two groups in the system.
From the child group, a message called value query message containing:
- The name of the sending computer and
- A group clock value created by it (which can be either closed or open at this point)
can be sent to the parent computer.
When the parent processes the received message:
- If no value of the parent’s clock has been associated with the received one, it returns the currently open period. In the same atomic operation, it also assigns the open period’s value to all child-group clock values which are higher than the highest value already queried, up until and including the value present in the message.
- If the received group clock value already has an associated parent-clock value, it returns the associated group clock value.
We refer to this relation as a child group clock version being associated with the parent-clock version. Associations of versions are bidirectional. Every child version is associated with exactly one parent version however one parent version can be associated with multiple child versions. The association-relation orders child- and group-versions into a tree. We also use the term transitively associated when describing a chain of association (potentially involving any number of elements) in a sequence of parent-child relationships.
As an example, take a child group from which we send value query messages with the values 21, 22 and 23 to the parent. Suppose that there are 3 computers in the parent’s group, of which the parent computer is the first one. Let’s say the parent receives the values 22 and 21, in this order. When it received 22 in the message, its current open period was 3, so it will associate the child’s clock values of 21 and 22 with its own clock’s value 3. Once it receives the second message, it will immediately answer 3, since that is a value that has already been associated. Let us also assume that by the time it receives the message containing 23, it is on its next open period (and so 3 is now a closed period), so it will associate 6 with the value 23.
Weak and strong ordering
The association described above can be used to implement a global ordering between any two versions. In the following chapter, we’ll describe two strategies to decide if a version comes before, after or at the same time as another one.
We define the at or before operator recursively as the following: if x and y are two versions in the clock hierarchy and X is the group clock of x and Y is the group clock of y, x is at or before y if
- x and y are version on the same group clock (X=Y) and y is the same as or earlier than x (y<=x),
- X is an ancestor group of Y and there is a y’ transitively associated with x and y <= y’
This algorithm defines a top-down search for everything that is either equal to or lower than a chosen version. This is called a weak ordering because no strict order can be established using this algorithm between the elements within the set. This method is also referred to as the weak ordering strategy.
Using this, we can draw up a set of versions which must have been created at the same time, or before the specified version, without having to go deeper into the hierarchy to explore their detailed order.
The strong ordering strategy is:
- If two versions are from the same group, their values are directly compared.
- If the two versions are from different groups, their associated value is compared on their lowest common group clock. They are guaranteed to have an order on a common ancestor if they originated from two groups where neither is an ancestor of the other, since they can’t have the same parent-computer.
- If on the lowest common group clock, one version is transitively associated with the other and therefore no order could be established this way, the one which is lower in the hierarchy (i.e.: the one which has more ancestors) is the lower one.
The strong ordering is stricter then the weak ordering by the 3) point. These guarantee a deterministic order between any two versions.
Atomicity of associations between clocks
Continuing the example in Hierarchy: If we were to ask which values are at or before the parent’s group clock’s value of 3, we know that the answer is 21 and 22. If, however we were to ask which ones fulfill the same requirement for 6, our answer cannot be complete as that period is still open and therefore can still accept higher values.
Since there is no guarantee that the parent computer will receive the child group’s clock values in order, the answers coming from the parent computer can only be consistent if it does not only consider its current open period, but it also takes its earlier replies into account.
The association between clocks is therefore atomic, since it is defined by when the parent-computer first received a higher value than the highest one stored previously.
Continuing the previous example, if:
- The last value received on the parent from the child clock was 22,
- The next one was 27,
- The current open period’s value is 6,
Then the value 6 on the parent’s group clock will be associated with every natural number between 23 and 27.
Imagine a group which includes 3 computers, A, B and C. Each of them has an associated child group named after the parent computers, so group A, group B and group C, respectively.
The illustration here shows a possible result of the group clock’s independent execution and the value query messages which associate the versions of the 4 independent clocks.
In the upper half, in the first column, from top to bottom:
- The number in the top row is the value of the group clock. So, the first value is 0. The values received by A are shown with a larger number to show at which points the cycle started again.
- The A underneath 0 shows that the value is associated with a value of the group clock of group A.
- The number 5 under the A shows that the parent group’s value of 0 is associated with version 5 of group A.
The second column is computer B receiving its first message from computer A, the third one is computer C receiving its first message from computer B and so on.
The lower half shows the three independent group clocks. These are denoted by their parent computer’s names, so they are labelled as A, B and C in the bottom row.
The leftmost curly bracket above the segment 0 and 5 shows the values of group A associated with computer A’s 0 period. The arrows show an association-message that was sent to the parent computer from its child group. The leftmost arrow is the first one, where the parent was A, and the message contained group A’s clock value of 5. To help differentiate between them, each group has a different style for the line of the arrow. Group A has a solid line, B has a dotted line and C has a double-dotted-triple-dashed line.
The figures in this chapter are an elaboration of the example given above. The values in them can be correlated back to that.
Assume that a client’s purpose for using the clock is that it should always be able to decide which clock values happened at or before (or inversely, after) a specific time. First, say that the client uses the root group clock, so when it poses this question, the read operation’s reference value (against which all others are compared) belongs to the root group clock.
The next figure shows which values will be at or before if the reference value is 1.
The dotted area represents the clock values which are at or before the root group clock’s value of 1. The values included from group A are the ones between 0 and 5, for group B, between 0 and 2. Nothing is included from group C. The reason for this is that even though the child groups’ clocks advance independently from their parent clocks, their advancing determine which values will be assigned from the child groups to the parent group.
In the figure above, group B’s group clock’s value is 3. The values between 0, 1 and 2 were created before computer B received its first token-message. For a client looking in from the outside, these values represented by group B will first be visible after the computer B processed its first token-message, even if the computer B has processed the values sent by group B earlier. Thus, the client will receive the values represented by the dotted area, as the ones which are at or before value 1 on the root group clock.
The figure above illustrates the state of all group clocks after the next round (of passing the token messages) has finished. Since in the previous figure the root group clock assumed the position 1 and since there are 3 computers in the root group, the clock’s value is 4.
If we compared the dotted area in this figure with the one above, one can tell that neither group A nor group C provided new values during last round. Group B, however, has extended its set of values since the previous cycle, and now includes values between 3 and 11. This illustrates that group clocks do not assume that child groups associate new values in each round to their parents.
This last figure shows the state after the next value was assumed on the root group clock. Compared to the previous figure, the only change is that the next computer in line (computer C) has processed the token-message it received from child group C. Based on this, one can tell that while group clock assumed values 2 and 5, the highest value computer C received from child group C was 9.
An important goal for the hierarchical clock is that it should allow for the comparison of two versions (which are potentially remote to each other) without necessarily needing to exchange messages to distant computers with high network latency. It is assumed that the network distance between two computers at the top of the hierarchy is some orders of magnitude higher than between those lower down. The other aspect of this is that (assuming optimal geographical distribution) the closest member of the root group is orders of magnitude quicker to reach than the others. Therefore, if the information used to decide whether a certain version is “at or before” another version can be gathered without having to move very far, the operation becomes a lot quicker.
Since querying the clock hierarchy for open periods is allowed and since open periods are closed independently from each other in each group, it is possible that a parent’s closed period is associated with open periods in its child group. So, even if the parent closed its period and even though this version no longer accepts new associated versions from its children, there is no guarantee that its children will not accept transitively associated versions from their own children. This is possible for an open version on the child group which was already associated with a version on the parent.
This guarantee would only hold if, by the time the parent clock’s period was closed, the periods the child sent earlier were also guaranteed to have been closed. Ensuring this would mean that the parent clock has to wait before closing its currently open period until it receives a signal for at least the highest period from the child computer associated with it. This is not only complicated, but would also slow down the closing process, which would affect the throughput of the whole system.
There are two versions, x and y on two computers. If, on the lowest common group the associated version are x’ and y’ respectively, we call x and y being transitively closed if all the versions in the x → x’ association chain and all the versions in the y → y’ association chain are closed.
We can use this information to know that each version that is associated (directly or transitively) with the specified versions in all the groups between the two versions have already been closed and so cannot include more information than what it already has. This does not guarantee that all versions transitively associated with these versions have also been closed.
Each computer holds an association list of all its ancestor computers and its highest version known by it to have been transitively closed. This associated list is called the list of known transitive closures.
Transitive closure messages are communicated from top to bottom. The message contains the list of versions and their originating computers (which is a chain of transitively associated versions). The first element of the list is the highest, triggering computer. The versions in the chain are added to the end of the list once they were also found to be closed.
Each time a parent computer closes a period, it checks if that version was associated with another one on its child group. If it was, it lists the associated versions for each computer in the child group and sends a transitive closure message to each, containing the version it just closed as the first element of the list.
When a child computer processes the message:
- It replaces the highest transitively closed version information for the ancestors listed in the message in its known transitive closures list if the newly received version information is a higher number than the previous one stored for the ancestor.
- If the local version associated with the parent’s version is still open, it waits for the next token-message to arrive to continue with step 3,
- Adds the local version associated with the parent’s version to the end of the transitive close message.
- List the associated child periods for this version for each computer in its child group and sends the transitive closure message to each, or terminates if it is not a parent computer.
This way, each computer has a list of which versions are stable for their ancestor computers, i.e., the highest known version to which no new values will be associated (either directly or transitively) on its ancestors.
If you like, you can continue to chapter 2, exploring the question of how the filtering algorithms can guarantee consistent views using the global clock and how these guarantees can be used to implement the consistency guarantees required by the ANSI SQL standard.