Rossplex — Not quite HyperDex, not quite Replex

Ross Teixeira
Princeton Systems Course
18 min readMay 17, 2018

Ross Teixeira and Natalie Wilkinson

“Where there’s a Wilkinson, there’s a Way”

The goal of our final project in COS 518 this semester was to implement Replex, a multi-index NoSQL database system which solves the large storage overhead that traditional multi-index stores suffer from in order to tolerate failures. We followed the original Replex design by building on top of HyperDex, a multi-index database which makes a full copy of the database for each index, times the number of faults that can be tolerated. Replex cuts down on storage overhead by allowing data recovery between different indices of the data. It then extends this idea with “hybrids,” which are replicas of the database that are partially indexed on multiple attributes. Here, we summarize the key ideas behind HyperDex and Replex, and explain our process and contributions.

Background

NoSQL

Distributed NoSQL databases are a popular method of storing key-value data. However these databases have drawbacks when it comes to accessing data. In most key-value stores, it is only possible to access data efficiently via the primary key. Searching for a particular value in the database would require investigating every key-value pair. This can be highly inefficient, and as there are many circumstances which require searching by some attribute of the value (a secondary attribute), this can lead to poor performance.

HyperDex

HyperDex is a NoSQL database that provides a rich API for key-value and document storage with multiple attributes. It was developed by Robert Escriva, then a PhD student at Cornell. HyperDex is a multi-index store, in which queries can be performed efficiently on multiple attributes. It does this by not only hashing on the key to determine which server the data maps to, but also on each attribute independently. This collection of hash values can be treated as a coordinate in n-dimensional space, where n is the number of attributes including the key.

Figure 1

This is visualized in Figure 1, with 3 dimensions or attributes: first name, last name, and phone number. Servers are split among these dimensions. If you have the values for all three attributes, this maps to a single point, and thus a single server. Information about each attribute essentially flattens the space along that dimension, reducing the number of servers needed to be contacted to find the data.

Figure 2

Figure 2 makes this clearer for one and two dimensions. In the one dimension case we have one attribute: first name. We have chosen 4 partitions, and thus the hash values are divided among the four partitions. In 2 dimensions, we still have four partitions, but only two in a given dimensions. Thus if you have information about a last name you can determine which column to search in, but cannot determine whether your data is in partition 2 or partition 4 (assuming the hash mapped to the second column).

As more attributes are added, the number of dimensions increases, which drastically increases the number of servers that need to be searched to find the data. In order to remedy this issue, HyperDex introduces subspaces, which produce copies of the data using a subset of the attributes. Since keys are primarily used the access the data, a subspace with only the key is created automatically. To handle fault tolerance, each partition of a subspace is duplicated. Since each subspace has an entire copy of the data, indexed by the subspace’s attributes, this means that each subspace is duplicated separately for each fault that needs to be tolerated.

Figure 3

For the rest of this report, we will be using the diagrams of the format in Figure 3 to demonstrate the concepts. We will be talking primarily about two subspaces, although all concepts can be extended to more. The number of attributes which make up a subspace simply determine which attributes are hashed to which regions, and thus are not relevant to the remainder of our paper.

Replex Improvements

While the HyperDex partitioning scheme is useful for reducing the number of servers that need to be contacted to retrieve data, the combination of different subspaces and replication for fault tolerance leads to a large storage overhead. A diagram for HyperDex with 12 partitions, tolerating 2 failures, is shown in Figure 4. The number of partitions and the fault tolerance leads to a total of 72 nodes. This does not directly correlate to the physical storage layer, as servers are spread across nodes, but contacting each node requires contacting a server and thus there is some correlation to the number of nodes and the general delay in the system. Due to their replication scheme, data has to be replicated 6 times in this case, 3 times per subspace, tolerating 2 failures. The replication scheme means that reads are fast in the case of failure as only the backup node has to be contacted.

Figure 4

This increase in replication and large storage overhead motivates Replex, a paper by Amy Tai which aims to make use of the fact that each subspace stores an entire copy of the data. Thus instead of using replication to handle fault tolerance, if a subspace node goes down it can be recovered using an alternative subspace. Tolerating n failures requires n + 1 subspaces, but since each subspace allows indexing over a different attribute, increasing subspaces comes with additional benefits. A case with 12 partitions, handling 1 failure is shown in Figure 5. The diagram indicates the potential problem with a simple replex. If a node goes down, every other node in the system has to be contacted to find data. This makes reads during failure much slower. However data is only replicated twice, making inserts much faster.

Figure 5

To handle the downsides of simple Replex, the paper introduces a novel concept they call a hybrid replex. This is shown in Figure 6 with 12 partitions, 2 subspaces, tolerating 2 failures. In a regular replex, data is sharded differently on each subspace and thus there is no relationship between two different subspaces. A hybrid subspace is a combination of two subspaces, and make use of both sharding functions to determine where data is placed on the hybrid. Thus in Figure 6, if a node goes down only 3 or 4 nodes have to be contacted for a read, depending which subspace the dead node was in. Because there are only 3 total subspaces, data is only replicated 3 times, but during failures, reads contact fewer nodes than in the normal replex case.

Figure 6

A comparison of all three are shown in the table below.

Table 1

Implementation

Installing HyperDex

Before we could begin running and modifying the HyperDex code, we first needed to get HyperDex installed. This proved to be a greater challenge than we had hoped. HyperDex contains over 50,000 lines of C++ code spread across multiple libraries, and has not been actively maintained since 2016. Eventually, after much trial and error, and after consulting with Robert (the creator of HyperDex) we were able to get the HyperDex code compiled and installed. During this process, we created a document with compilation instructions for those looking to compile the latest official release of HyperDex.

HyperDex Compilation Instructions

Debugging HyperDex

Debugging HyperDex also proved to be a challenge. Because HyperDex is written in C++, GDB was our main tool used to step through and understand the HyperDex code, along with fixing bugs in the HyperDex code as they arose. The HyperDex code is separated into four main sections: the coordinator, daemon, admin and client. The coordinator handles the back-end management of the cluster topology, while the daemons are the physical machines that store data and process requests from the client. The admin and client are user-facing tools that allow a user to configure the cluster’s spaces and partitioning as necessary, while the client is used to actually store and retrieve data.

The daemon, admin and client were all debuggable using GDB, which was a great help to us throughout the project. Unfortunately, even after much effort, we were unable to get GDB working on the coordinator. The coordinator process is actually handled by a separate library, Replicant, that interfaces with the coordinator itself as a replicated object, and we were unable to run Replicant correctly on its own. This means that when it came time to modifying the rebalancing of the partitions across our server nodes when a server went offline, we had to run through the code by hand in order to understand what each function was doing. In fact, at first we were not even able to get log messages from the coordinator! To solve this, we added a hacky logging system to the Replicant library that wrote our logs to a hardcoded file. Using these, we were able to eventually modify the coordinator code.

Implementing basic Replex

The initial HyperDex paper included Figure 7, which seemed to imply that all of the subspaces covered the entire range of attributes, and that no attribute was present in more than one subspace. Our initial testing of the add_space method indicated that neither of these assumptions were true, and that the creation of a HyperDex space was quite lax. This was fortunate for us, as it simplified the initial conversion from subspaces to replexes. All that was necessary was to ensure that all created subspaces only had one attribute thereby making them a replex. In addition, it was necessary to ensure that each replex indexed on a different attribute, i.e., that there were no duplicate subspaces.

Figure 7

This secondary restriction was due to a difference between replexes and subspaces mentioned in the Replex paper. In HyperDex, each object is hashed by each of its attributes. Since each attribute adds a dimension, this hashing produces a point in n-dimensional space where n is the number of attributes including the key. With the addition of subspaces, this means that a point is hashed according to the attributes in that particular subspace. For example, to insert a value into the key subspace, only the key is hashed to determine which region the value is in. In Replex however, each replex gets its own hashing function. By restricting subspaces to not share attributes, we ensured that each object was hashed differently via the attribute of that replex, and no additional hashing functions were needed.

While these two modifications were expected to be the only ones required to implement a basic replex, we found through testing that both the Get and the Search methods also required some tuning. A Get in HyperDex is a query over the key. HyperDex automatically creates a single subspace for just the key, called the key subspace, which is the first point of contact for Get. Unfortunately, we found it to also be the last point of contact. If a failure occurs which impacts the key subspace, Get will return an error instead of continuing to search among other subspaces for the data.

From the paper, this design choice is expected. Replex’s contribution is that it can make use of the other subspaces for recovery, which would imply Replex can make use of other subspace for reads as well, as opposed to HyperDex. However upon investigation into the Search method, we found that a regular search allows searching among subspaces and returns whichever subspace would results in contacting the smallest number of servers. Therefore we modified Get by redirecting any Get requests to the Search method, allowing Gets to search through other subspaces if the need arises. In the non-error case we still only contact the key subspace.

During this investigation of the Search method, we also found an additional problem which occurred when more nodes failed than were tolerated. This problem has two ramifications. The first is that if all nodes in a given region were down, only partial data is returned. This is shown via Figure 8 and Table 2. If server 1 is down, data stored in node 1, 2, 5, and 6 are offline. As each subspace stores a copy of the data, this means that we can either return data from node 3 and 4 or 7 and 8, depending on the selected subspace, which is approximately 50% of the data.The usual response in most databases is to return an error or at least flag the client that only partial data is being returned. This is potentially not a problem in HyperDex as this theoretically only occurs when more failures occur than are tolerated, and HyperDex doesn’t guarantee the data in this case. However this is still a problem for Replex as even if all nodes in a region are down, Replex is supposed to be able to make use of the other subspace to recover the data fully and continue running as normal.

Figure 8
Table 2

The second ramification is due to the way servers were selected. For each subspace, the number of servers needed to gather the data is calculated and the subspace which contacts the fewest number of servers is selected. This is done to minimize the load on the servers and the network. In the case of all nodes in a region being down, instead of marking the subspace as containing partial data, the server is simply not counted. This can lead to choosing a subspace with dead regions and fewer servers to contact over a subspace with fully live regions but more servers to contact. HyperDex uses a partitioning scheme which is symmetric across subpaces as shown in Figure 8. Due to this partitioning scheme, this situation does not arise in HyperDex. Replex removes this symmetry which can cause this situation to occur. To fix this for the Replex case, we modified Search to prioritize complete data over incomplete data, regardless of the number of servers, and warn the client if they are receiving incomplete data.

Implement Replex failure handling

As mentioned in the previous section, HyperDex’s partitioning scheme is symmetric across subspaces. An example space with 12 partitions, two subspaces, tolerating 2 failures, is shown in Figure 9. This symmetry is a major issue for Replex, as Replex assumes that other subspaces can be contacted if a failure occurs, and does not rely on duplicate nodes. If a server goes down using HyperDex’s partitioning scheme, every subspace will lose nodes. While some data could be found in other regions and used to reconstruct the lost nodes, any data that existed in the same regions on the subspaces will be permanently gone. Referring to Figure 8, any data in both node 3 and 7 would be lost if server 2 went down.

Figure 9

In addition to this symmetry, HyperDex’s partitioning scheme has another issue. In Figure 9, servers 1 and 11 handle 4 nodes, servers 2, 4, 5, 9, and 10 handle 6 nodes, servers 3, 6, 7, and 8 handle 8 nodes, and server 12 only handles 2. This is highly unbalanced. We are not sure if this was intended. In Figure 8, server 4 doesn’t even handle any nodes, that is, server 4 will never be contacted to store data. While this unbalanced partitioning scheme is not necessarily problematic for Replex, since the partitioning scheme already needed to be modified to remove symmetry, we ensured that the new partitioning scheme was also as balanced as possible. Figure 10 demonstrates the Replex partitioning scheme on the same space.

Figure 10

To compute this scheme, we first divided up the servers among the subspaces. Thus we have one requirement that there are at least as many servers as subspaces, one for each subspace. If the servers do not evenly divide among the subspaces, we share the k leftover servers among the first k subspaces which always includes the key subspace. This is done so that the key subspace will always have the same or more servers than all the others, ensuring that any servers that drop will have minimal impact in the key subspace. For each subspace, the allocated servers are then divided up among the regions by modding by the number of servers allocated to the subspace, and shifting by 1 for each replicated node. An example of a non-even subdivision is given in Figure 11, with 12 nodes, and 9 servers. Server 1 through 5 are assigned to the key subspace, and server 6 through 9 assigned to the Attribute 1 subspace Since 4 evenly divides 12, server 6 through 9 each handle 3 main nodes (9 total with replication). Since 5 does not evenly divide 12, server 1 and 2 each handle 1 additional main node.

Figure 11

Initial server assignments are assigned when the space is created, and thus before any data exists in the database. When servers enter and leave the space however, rebalancing must occur. In HyperDex, if an entire region goes, it is no longer included in the rebalancing, as it is in an unrecoverable state. However for Replex, if a region goes down the data can still be recovered from another server and thus Replex’s rebalancing takes dead regions into consideration as well. Because of this decision, we implement rebalancing in a different manner. We are able to calculate the reassignments initially, knowing which servers are down, as it does not matter if an entire region is dead. HyperDex, on the other hand, rebalances region by region, which allows it to determine if a region is dead and rebalance accordingly.

Implement Replex Recovery Transfers

When a new server comes online, it connects to the coordinator and registers itself as available. The coordinator then rebalances the partitions across the nodes to accommodate the new server, and pushes this configuration out to the daemons. Once the daemons have received the new configuration, it is their job to communicate with each other in order to satisfy the rebalancing scheme by transferring data to the appropriate servers. In HyperDex, the new server only needs to contact one server per partition that it manages. In Replex, the new server needs to talk to every node in the other subspace in order to recover its data, and then selectively keep the data corresponding to its own partitions. This was the last step required to complete our implementation of Replex. Specifically, we would need to create a new Transfer protocol which would essentially send out a Search for the data. The daemon could act as a client, and use a similar procedure to the client Search function, searching by hash value instead of by key or attribute value. Then the contacted daemons would return the data as they would if they were contacted by the client. Unfortunately, we did not have time to complete these transfers.

Hybrid Replexes

While we did not manage to implement hybrid replexes, it is worth mentioning how we would have implemented it given the time. In the original paper, each replex is created with an arbitrary sharding function. Since we ensure that all replexes are distinct, we used the same hash function by assuming that a subspace only hashed on the appropriate attribute. To implement hybrids in an identical fashion to the paper, we would thus have had to add a sharding method to our replex subspaces, and then use the hybrid equation from the paper shown in Equation 1. This would require adding more attributes to the current subspace object, including values n_1, n_2, and the sharding functions for each attribute that is part of the hybrid. In addition this does not scale well to hybrids which are shared across r replexes.

Equation 1

If we deviate from the implementation suggested in the paper, there is in fact a simple mechanism to implement hybrid replexes for both the 2 replex case, and the r replex case. It is motivated by the realization that there is very little difference between a subspace with 2 attributes and a hybrid replex. In a subspace with 2 attributes, we hash an object on the two attributes. This can be thought of as a different sharding function than hashing on only the first attribute or the second. This will result in a slightly different distribution of objects among the subspace. However, the functionality will be practically identical. The values n_1 and n_2 in the equation can be converted directly to the number of partitions along each of the dimensions. This is demonstrated in Figure 12, with values of 3 and 4 for n_1 and n_2.

Figure 12

All that is needed then to implement a r hybrid replex in hyperspace is passing in n_1 through n_r and validating that their product is equal to the number of specified partitions. Then in the component of HyperDex which creates the regions for each subspace, you pass these values directly to create the dimension division shown in the top figure in Figure 12. This is also an improvement on the current procedure to divide up the p partitions among the n attributes. One would expect that if p partitions are specified, that p partitions are created per subspace. However, HyperDex divides partitions into regions according to the equation shown in in Equation 2, with p partitions, n subspaces, s_i attributes per subspace, and f failures. HyperDex will create more partitions than p per subspace, specifically when p is not an even power of s_i. By having the user specify s_i, the exact number of partitions the user specified will be created.

Equation 2

Evaluation

Our evaluation goal for the project was to reproduce the Replex paper’s analysis of server performance as a node crashed and a new node was brought back online, in order to demonstrate Replex’s ability to recover from failure faster. Figure 13 demonstrates this evaluation. After 25 seconds, a node is taken down from the cluster, causing performance to decrease.

Figure 13

In HyperDex’s case, the server needs to contact a single node. In Replex-2, without hybrid replexes, the performance, measured in operations per second, is much lower than HyperDex because the server needs to recover data from every node in the other subspace, causing all of them to experience increased load. Reads during that time are thus incredibly slow. However, because only a small portion of data is needed at each node and these queries can be ran in parallel, this results in an overall faster data transfer to bring the server back online, which is why Replex-2 comes back online at around 75 seconds.

In Replex-3, with hybrid replexes, the recovering server needs to contact a smaller set of nodes to retrieve all of its data, so performance is not as bad, and due the the increased number of nodes that could be contacted, this too can be run in parallel, resulting in faster data transfers to bring all servers back online. It appears that having to read from 3 or 4 nodes instead of a single node does not decrease the number of operations that can be handled per second, compared to HyperDex.

To reproduce this scenario, it was crucial to implement support for recovery transfers in our system. Because we were unable to finish this in time, we were unable to reproduce this graph.

We ran a benchmark on our system using YCSB, or Yahoo! Cloud Serving Benchmark, a common framework for testing database systems. The test was run on my local machine with a 4 core processor, running in an Ubuntu VM. We tested a cluster with 1 coordinator and 4 daemons running. We loaded the database with 100k records, and then ran a 95% read/5% update workload with 1m operations.

Our results showed that the system was highly stressed, reporting a maximum of 1k operations per second and decreasing to only 200 ops/s. This is likely due to contention among the daemons running on a single machine, as well as contention with other programs running on my computer. Ideally we would deploy our system on a school cluster or AWS for better performance and more accurate benchmarking.

Conclusion

Though we faced many challenges, we came very close to a complete implementation and we are very satisfied with the progress we made. In addition to nearly completing replex, we also determined a more efficient way to implement hybrid replexes over r replexes, instead of just the 2 that was mentioned in the paper. In addition we created our own partitioning scheme that is balanced across servers and separate across subspaces. Finally we created a clear set of instructions on how to install and run HyperDex, useful for anyone who wants to give HyperDex another shot.

All our code can be found here:

GitHub

Acknowledgements

A big thanks to Robert Escriva, the author of HyperDex, and Amy Tai, the author of Replex, for being so responsive and willing to help us through email. We could not have completed the project without their assistance. Thanks to Mike Freedman and Andrew Or for a great semester!

--

--

Ross Teixeira
Princeton Systems Course

MSE CS student at Princeton. Cal ’17. Everything is awesome.