SATIS 2018 — Day 2

Javier Bustos
Aug 16, 2018 · 8 min read

First day report:

0:20. I wanted a picture of northern lights, but I’ve only got this one… and a cold.

Second day started… with the breakfast, followed by Robbert van Renesse lecturing about Fail-Stop replication. As you can read in the school’s program, State Machine Replication protocols such as Paxos have received much theoretical attention lately and many implementations are available, but compared to traditional Primary-Backup (PB) protocols they suffer from much inefficiency. With recent advances in failure detection, PB protocols are still of practical interest, but in their presented forms they do not support self-configuration and recovery from total failure. Robbert has shown how Chain Replication and Primary-Backup can indeed support self-configuration and recover from total failures.

The talk started with some key points of replication history: going back to early seventies with the work of Mullery “The distributed control of multiple copies of data” (1971), Johnson & Thomas “Maintenance of duplicate databases” (RFC 677, 1975), Alsberg & Day “A principle for resilient sharing distributed resources” (1976), Lamport “Time, clocks, and the ordering of events in a distributed system” (1978), and Thomas “A solution to the concurrency control problem for multiple copy databases” (1978).

What is State Machine Replication? A generic way to tolerate failures, the objective is to have a single copy behavior, simply starting multiple replicas (copies) of a deterministic state machine, and keep them in sync by agreeing on the inputs and the order in which to apply them. The model runs under a set of assumptions (“the more assumptions you make the worst it is, each assumption is a vulnerability”) about timing (no bounds on timing!, such as latency, speed, clock synchronization, etc.), node failure, and communications (loss, reording and duplication allowd, fair links, perfect checksums). But replicas are expensive, so we have to think well what kind of fault tolerance we want.

Van Renesse focused on Fail-Stop, because replicas are expensives and Fail-Stop is a reasonable assumption in datacenters, Asynchrony because latency bounds would have to be very conservative and results in slow systems, and FIFO communication because it simplifies life. Also, he did not considered disks because logically there is no significant difference between a processor with or without a disk.

Some existing replication protocols for the Fail-Stop Model are Primary-Backup (1976) and Chain Replication (Van Renesse & Schneider, 2004). Both assume an external configuration service that reconfigures surviving replicas after a failure. But, in practice, you don’t need such a service.

The system is based in two replicated systems: a head and a tail. Updates are directed from a proxy to the head, which propagated to the tail. Queries are directed from the proxy to the tail. When there is a failure, the remaining processor becomes both the head and the tail: if the head fails you only have to look to the tail replica (there is the state of the system, mostly nothing happens), and if the tail fails magically its history becomes instantly stable so it can handle the queries.

When a new process came to the system

No external configuration service needed, state transfer is entirely in background, after state transfer, tail failure can be tolerated.

Generalizing to > 2 replicas, it is called chain of replicas, where each replica maintains a “speculative configuration” based on the configuration operations it has in its speculative history. A configuration command becomes stable when it reaches the tail.

Chain of replicas

Following talk was Christian Cachin, with his talk:

Cachin explained that a blockchain is a distributed system for executing and recording transactions, which is maintained by many nodes without a central authority. All nodes collaboratively validate the information to be included in the blockchain through cryptography and distributed consensus. Blockchains offer resilience and security based on the collective trust placed in the nodes maintaining it.

He revisited protocols for Byzantine consensus and explore older and newer protocols that power blockchains. He started from the very beginning: the Ledger. Ledgers record all business activity as transactions, every market and network defines a ledger, which records asset transfers between participant. Every market has its ledger and every organization has it own ledger.

Blockchain: Distributed Ledgers

So, the question is, do you need a blockchain? (NdR: no), four necessary features of a distributed blockchain task are: it stores data, multiple nodes write, not all writing nodes are trusted, and operations are (somewhat) verifiable. If any feature is missing don’t use blockchain.

One of the best things of blockchain is transactions can be arbitrary code (smart contracts), embody logic that responds to events (on blockchain) and may transfer assets in response. How consistency is ensured? giving the probabilistic approach, forks occurs regularly but do not last forever (with high probability). Probability of k-blocks long fork is exponentially small in k

When constructing a block the node has to validate all Tx and decides an ordering within block. Then, it must assure that only valid transactions enter the blockchain, that is, re-run all the smart-contract code. Validation can be expensive, for instance, bitcoin blockchain contains the log of al Tx ~180GB as of 8/2018

Further information at “Distributing Trus on the Internet”, Christian Cachin, in Proc. Intl. Conference on Dependable Systems and Networks (DSN-2001), Gothenborg, Sweden, IEEE, 2001.

“In cryptography it is impossible to demonstrate that a cryptosystem works because one can only demonstrate that it fails, we must treat consensus like cryptosystems”.

The talk also introduces Hyperledger Fabric, a modular and extensible blockchain platform that is developed open-source under the Hyperledger Project and which was originally contributed by IBM. Fabric introduces a novel architecture for building resilient distributed systems that differs from the conventional paradigm, in order to accommodate flexible trust models, to cope with non-determinism, and to prevent resource exhaustion. There are currently several hundred prototypes, proofs-of-concept, and production systems of distributed ledger technology that use Fabric as a platform for distributing trust over the Internet.

Final lecture of the day was Michael Franklin with: “Distributed Data Analytics”.

“The relational model provides a basis for a high level data language which will yield maximal independence between programs on the one hand and machine representation on the other” (E.F: Codd, CACM 1970, emphasis added by Michael)

But, how to get good performance?

Shortly after the relational model was introduced, several companies sells “relational database management systems”, machines “capables to run relational queries” with domain specific hardware. But the academia disagreed:

We claim that “of-the-shelf” processing components provide sufficient processing power and that there is no need to implement customized database processor either through microcode or VLSI...” (Boral & DeWitt International Workshop on Database Machines, Munich 1983)

Also, it was interesting to study “The case for shared nothing” by Stonebraker, comparing three architectures for BD machines: shared memory, shared disk, and shared nothing. (High Performance Transaction Processing Workshop, Asilomar 1985). The paper argued that shared nothing was better. This kind of design scaled very well until around 2010 thanks to declarative, set-oriented language (SQL) and Moore’s Law.

Shared-Nothing Queries basics: types of parallelisms can be inter-query and intra-query (inter-operator such as tree and pipeline and intra-operator such as divide & conquer). Data partitioning By range? By hash? Round robin? (why is good round robin? You will be always bounded by your slower processor), scans in Parallel? Because each system is a full-blown DBMS indexes can be built at each partition and partial query plans can be executed at each node. Sorting? Is more challenging, so you make a local sort and then a merge them, or having partitioned data the sort is made after receive the data. In the latter, one can sample to estimate data distribution and choose ranges to get uniformity. Joins? The good case is without data movement (all tables well organized & partitioned parallelism), but…

Grouping? Again, skew is an issue, approaches: avoid (choose partition function carefully) or react (migrate groups to balance load)


That was all the last 30+ years. Shared-nothing on commodity servers has been “gold standard” But… a massive shift is coming: Data analytics moving to the cloud, closed and open source “big data” platforms (hadoop, spark, impala, etc), cloud architectures are becoming disaggregated, end of moore’s law and rise of machine learning.

Why people is moving to the clouds?, mostly because their rock bottom storage prices and high elasticity. Also, initial security concerns were addressed (and largely unfounded).

This is, the rise of shared storage.

Then, Michael compared 3 examples of this “new” design, using the slides of “The End of Shared Nothing” by David J. DeWitt because, on his words, “it was better explained there” (PPTX slides available here, and in fact, the systems are veeery well explained on David’s slides :-))

  1. Amazon (AWS) Redshift (single leader node, one or more compute nodes — EC2 instance, one slice/core per database, memory, storage & data partitioned per slice), hash & round robin data partitioning, unique fault tolerant approach (each 1MB blocks get replicated on another node and on a S3)

2 Snowflake Elastic DW: compute decoupled from storage, highly elastic. Query-level control through Virtual Warehouse mechanism

3. SQL DW: DB-level adjustment of DWU capacity

Rise of Machine Learning: ML increasingly part of the Data Analytic pipeline, increasingly, application require multiple modes of analytics: query, graph, streaming, machine learning, deep learning.

Michael Franklin finished his Lecture with the following message:

IoT and Edge processing, for latency, reliability, privacy, legal and actuation reasons, some analytics will need to be processed at the edge of the network, this is an extreme version of “shared nothing”. There’s a desperate need for innovation and lots of great research to be done.

Poster Session

There were no many posters as students on this school, but the few the better :-) and most of them were very interesting. I post my two favorites:

Lukas Burkhalter presented his work using a very nice mathematical trick for creating a set of keys where the K(i+1) key depends on K(i) for its calculation. The whole system is intended to protect the users privacy on crowd data storages.

Nikos Kouvelas presented his work about a new LoRaWAN-MAC potocol which guarantees maximum channel throughput, one of the few network-related works at the school, it is interesting because actually LoRaWAN throughput has been reported less than 50 kbps in optimal conditions, and its MAC protocol seems to be one of its bigger bottlenecks.

Final activity of the day, the dinner:

    Javier Bustos

    Written by

    PhD, computer networks researcher.