CRDT storage for the Edge
Despite advancements in connectivity for user devices by service providers, mobile devices will remain subject to frequent periods of disconnection by their very nature. In order for the user to be able to interact with an application during periods of disconnection, the application must store its data on the device. Users can then interact with their local copy. In addition, updates need to be maintained and delivered to the server once the connection is re-established.
But what happens if this data has already been modified in the meantime? Such modifications can stem, for example, from another user on some shared data or even the same user, but on a different device. Typically, implementing conflict resolution for concurrent updates on application data is left to the programmer. The demand for offline support in native, mobile, and web applications has therefore led to many ad-hoc solutions that often do not provide well-defined consistency guarantees.
This system model introduces a number of additional challenges:
- Clients may disconnect frequently and at arbitrary points in time. If a client disconnects, it is not known when and if it will be connected again;
- Clients have only limited storage capacity;
- Clients and servers do not share a common clock; their physical time is likely to diverge.
Systems that allow to integrate offline capabilities for clients must tackle these requirements in a scalable safe and secure way. In the context of the European projects SyncFree and Lightkone, we have been working on principled solutions that we highlight in this article.
SwiftCloud has been the first system providing fast reads and writes via a causally-consistent client-side local cache backed by some geo-replicated cloud storage. The idea is to cache a subset of the objects from the data centers to provide availability and local latencies. As our experimental evaluation showed, SwiftCloud significantly improves latency and throughput when compared to general geo-replication techniques. Further, it provides availability during faults by enabling clients to switch between data store servers when the current one does not respond. To provide convergence of concurrent updates, SwiftCloud relies on CRDTs for a correct and principled conflict resolution strategy. It also guarantees Transactional Causal Consistency (TCC), even across sessions.
To simplify the development of web apps with offline availability, we developed WebCure, a framework for partial replication of data for web apps (Figure 1). It consists of a client-side data store that maintains both the data that has been received from the cloud storage server and the updates that have been executed by the client but have not been delivered to the cloud store, yet. A service worker acts as a proxy on client-side. While the client is offline, it forwards the requests to the client-side database. Finally, a cloud storage server maintains data and shares it between different clients.
To demonstrate the applicability of WebCure, we have implemented a collaborative calendar application (Figure 2). It provides the functionality of sharing calendars with different users who can then concurrently modify their appointments. A preliminary evaluation of the user handling for the app shows that it is possible to reduce latency for user interaction by two orders of magnitude from 321ms to 2.3ms when operating with the local cache instead of accessing the server deployed in the same machine. Further, the app’s size increased by mere 400KB for the additional libraries.
Finally, Legion shows another exciting approach on how to address availability and scalability by using the cache at the client-side. In contrast to Swiftcloud and WebCure, Legion avoids a centralized infrastructure for mediating user interactions, as it causes unnecessarily high latency and hinders fault-tolerance and scalability. Instead, Legion enables client web applications to securely replicate data from servers to groups of clients and afterwards synchronize the data among the clients directly by peer-to-peer interaction (Figure 3).
To support these interactions, (subsets of) clients establish overlay networks to propagate objects and updates among them. This change makes the system less dependent on the server. Moreover, it reduces the latency of interactions among nearby clients.
How do the presented systems differ and what is their major technical innovation? To prevent data loss or corruption, edge storage systems need to provide ways to ensure that updates are not duplicated due to connectivity problem during transmission and synchronization. Further, all replicas — whether on client or on server — must safely converge. When these guarantees need to be provided across sessions with potentially different data servers or peers, the data management requires additional meta data such as unique identifiers, tags and (vector) timestamps. Causality tracking is particularly tricky as it requires further properties like monotonicity from the meta data. For edge systems, these meta data become unfeasible if they scale with the number of clients due to potential churn. All presented systems therefore use tricks to subsume updates from (groups of) clients in some form of meta data that is shared among a smaller number of nodes. For instance, in SwiftCloud client updates are tagged with vector timestamps scaling in the number of data centers, similarly for WebCure.
The direct-peer-to-peer interaction model of Legion is particularly suited for edge networks. It covers a sweet spot in robustness and safety. In current work, we are investigating how we can enrich this model with transactional constructs to simplify the programmer’s quest for correct application semantics and support of the Just Right Consistency Approach.
By Annette Bieniusa, TU Kaiserslautern