VLDB 2015: Concurrency, DataFlow, E-Store

12 min readSep 8, 2015

Last week was the 41st edition of the Very Large DataBases Conference, a gathering of academia and industry to present and discuss recent work on database management, theory, and experimentation. I learned a tremendous amount during the conference and wanted to cover some thoughts on the event, itself, as well as some of the more notable research I saw during the sessions. Complete notes for all of the sessions are on GitHub, and the proceedings can all be downloaded from the conference site. Note that is is just one perspective on the conference; there were five parallel tracks for most sessions, so there is an abundance of content not covered here that is available in the proceedings.

As a first time attendee, there were a few things that stood out to me:

The database community is very lucky to be backed by huge commercial interests. Thinking back to traditional science/engineering conferences I attended in grad school that were largely funded by the attendees (usually via government money in the form of a grant or lab working budget), it’s a luxury to have sponsors with $B revenues.
The conference does an excellent job balancing academic and industry interests. Each keynote included one industry speaker and one academic speaker, and although there was occasional tension between the messages coming from the two groups, the atmosphere was very friendly and respectful.
It was encouraging to see women keynote & session speakers, women in review/organizing/leadership VLDB positions, and a new 2016 VLDB award specifically aimed at women doing inspiring work in databases. That being said, right now the VLDB conference does not have a Code of Conduct, something I would like to see them implement for the 2016 conference.

Day 1: In-Memory Database Workshop

The keynote for this workshop was Towards Hardware-Software Co-Design for Data Processing: A Plea and a Proposal, by Jignesh Patel. For those who haven’t been following database research closely, industry and academia are both doing incredible work to optimize algorithms and data structures to utilize every bit of hardware bandwidth and capacity. Jignesh’s keynote argued that database software has been playing catch-up to rapidly changing hardware designs, and that database researchers should be planning ahead and looking at algorithms that optimize for bandwidth rather than latency.

This message was emphasized the next morning in the keynote by Juan Loaiza from Oracle, who detailed the SPARC M7 processor that Oracle is designing specifically for their Exadata clusters. This tight system-level integration of hardware and software is one area where Oracle’s size and revenue can buy a performance edge on open-source databases running on commodity hardware. I have a feeling that there will be no ultimate winner, but it’s interesting to see them going down this route.

If you use Memcached in production and have any interest in how it manages memory, it’s worth your time to read Cost-based Memory Partitioning and Management in Memcached, by Carra et. al. Interestingly, they have exposed the ability for the application to set a cost value for a given cache item. For example, you can set a key-value pair with a relative cost that Memcached will use to manage eviction:

set(k,v,c)

This provides the application developer an opportunity to improve or deteriorate cache performance, which could get blamed on Memcached itself in the end. It’s an interesting choice in API design.

The highlight of the day was the second keynote, Near Memory Databases and Other Lessons, from Ryan Betts from VoltDB. He discussed the trials of taking a piece of Stonebraker research (H-Store) from “Research to Code, and Code to Product”. He covered a lot of the technical lessons learned from building a fast OLTP in-memory database, as well as some of the commercial lessons trying to sell a database product.

What surprised me is that VoltDB itself is actually a really compelling product for streaming systems; this is not your standard OLTP database. VoltDB has built-in streaming consumption and replication for most of the popular messaging systems, as well as easily configurable stored procedures to aggregate and trigger events with full transactional semantics. Notably, VoltDB has fully serializable isolation.

Day 2: Gobblins and Document Indexing

During Day 2 I primarily focused on the industry sessions, as I was most interested in the lessons other companies learned while applying research to production data systems.

The LinkedIn team talked about Gobblin, their data ingestion framework for Hadoop. They presented some details of their system architecture, as well as some case studies on how they are using it, including:

Kafka Consumers to output time-partitioned sets of data
Database/JDBC Consumers to back-up database snapshots to Hadoop
Data Filtering jobs to filter for sensitive data

What’s notable to me is how well the LinkedIn Data team designs composable, reusable architecture pieces for their data tools. It’s very easy to build single-purpose data pipelines that work, but it’s difficult to extend those pipelines for other purposes or meaningfully refactor them down the road. Below is an example of a data pipeline from the Gobblin paper:

LinkedIn has built small units of task functionality for various jobs such as converting, quality checking, or publishing records in the pipeline. Each of these tasks can be updated or removed independently, and pipelines can be composed of these units.

I know, I know: if you squint, this looks like Every Other Data Pipeline Tool Thing. However, I know I have been guilty of building the previously mentioned one-off data pipelines, so it’s nice to see a team building, open-sourcing, and writing about their pipeline tooling.

The other notable paper of the session was Microsoft’s Schema-Agnostic Indexing with Azure DocumentDB. The session and paper go into a tremendous amount of detail on how Microsoft has built a schema-agnostic database system at scale. It was an excellent talk and a really great (if dense) paper, covering indexing JSON documents via representation as trees, building a query engine on SQL and JavaScript, efficient indexing to support heavy read and write volume, and resource governance. It was one of the best and deepest presentations of the week, and it’s impressive that MS is already using this system for pretty large production workloads.

JSON Inverted Index for DocumentDB. Source.

Day 3: Coordination, Schemas, Gorillas, DataFlow

The opening keynote by Todd Waters covered much of the history of “big data” systems at Teradata. In 1979, “200MB weighed 30 lbs and took up the space of a washing machine”, and by the end of this year, Teradata will ship their first 100PB system. The most notable comment was that he agreed with Michael Stonebraker’s comments from the night before, echoing that the “#1 thing [for a database product] to be successful is to talk to users”.

The academic keynote by Magdalena Balazinska at the University of Washington had an important message: database researchers in academia can solve Real Big Data Problems by looking inward at some of the data workload challenges posed by scientists and researchers on-campus. She recommended using existing tools, breaking them in interesting workload-specific ways, and then contributing the fixes back to open-source. The best one-liner from her talk: “[Universities] don’t compete with industry for engineers — we provide them!”

The opening paper of the day was, IMO, one of the conference’s most important: Coordination Avoidance in Database Systems, by Peter Bailis et. al. The short-and-dirty summary of the paper is as follows:

Serializable transactions are expensive. Coordination for read/write conflicts does maintain correctness, but at a very high cost.
A framework, e.g. invariant confluence can be used to determine whether coordination can be safely avoided for some transactions, while still maintaining correctness.
Avoiding coordination entirely for some transactions can result in very nice performance gains.

The above is a a very succinct summarization of the work; this paper was one of the must-read papers of the conference for me, and I highly recommend you skim it, at least. The enduring message was that perhaps we’re a little too focused on the details of how to achieve coordination, and not whether we need to do it in the first place.

The session featured a second must-read paper, Schema Management for Document Stores. Let me know if this sounds familiar:

Your API is receiving JSON data from multiple client sources, you don’t have schema control over the said sources, and you need to get the data into a relational warehouse.
The schemas are changing pretty regularly, and you’re constantly chasing the new field values to figure out how to align them with columns.

This is largely why NoSQL, document-stores, and nested-datatypes exist, and unstructured data isn’t getting less popular anytime soon, so it’s really nice to see research focused on analyzing document schemas.

The paper focuses on efficiently identifying unique document schemas, storing them, and allowing them to be efficiently queried. Their tool supports querying both at the schema level (“I have a complete document and want to know if I’ve seen the schema before”) and at the attribute level (“I want to know if this attribute exists in my schema collection”).

This research could be developed into really powerful tools for schema discovering and debugging. I’ve written a brute-force tool called malort to perform a similar task, mapping unstructured documents to relational column types, so I am extremely heartened to see this work. I’ll certainly be revisiting the paper to see if some of the data structures and algorithms should be built into the tool.

Must-read paper number three: The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing.

DataFlow Streaming with Fixed Windows. Source.

To put it simply, if you are building or operating a streaming platform, the the concepts in this paper are well worth learning. The paper and presentation are both great, and in case the slides never make it online, the O’Reilly post by the author covers much of the same ground. The paper and article do an excellent job breaking down the terminology and patterns of streamed data, as well as Google’s models and abstractions for addressing these unbounded data streams.

One of the last talks of the day was the Facebook team talking about Gorilla: A Fast, Scalable, In-Memory Time Series Database. They are not a stranger to this conference, having presented an excellent paper two years ago on Scuba, their in-memory analytics database. I’m always interested in seeing what the infrastructure teams at Facebook are building, as they’re dealing with data scales that most of us (thankfully) don’t ever have to think about. Facebook is the exception to the “should I build it from scratch” rule, where off-the-shelf open-source solutions don’t quite scale to their use cases.

For example, Gorilla is serving 2B timeseries, with an ingestion rate of 700M data points per minute. The database has a rolling, 26-hour retention period, which is 16TB of data that has to be stored, replicated and served at~40k queries/s. They have to ask questions like “How will insert-language-here behave with heap sizes of 20–30 GB?” when building their systems, which (again, thankfully) is not something most of us have to worry about on a daily basis.

The paper does a nice job laying out the open source alternatives that Facebook considered, and why they moved forward with a new implementation. It also does an excellent job describing their compression algorithm, which results in a very efficient average ratio of 1.37 bytes/data point:

Facebook Gorilla Compression Algorithm Diagram. Source.

One last note on Gorilla: I appreciated the candor with which the team acknowledged the tradeoffs they made with the system. They describe this in the paper:

In our example, if host α crashes before successfully flushing its buffers to disk, that data will be lost. In practice, this happens very rarely, and only a few seconds of data is actually lost. We make this trade-off to accept a higher throughput of writes and to allow accepting more recent writes sooner after an outage.

Sometimes the loss of data is acceptable and is a reasonable tradeoff to make in favor of throughput or availability.

Day 4: Kafka, Graphs, E-Store

The Building a Replicated Logging System with Apache Kafka talk and paper do not cover much ground that users of Kafka aren’t already familiar with. However, there were a couple interesting pieces of information that I noted from the talk.

LinkedIn is using Kafka as the replication layer in EspressoDB.
The entirety of LinkedIn’s messaging infrastructure (which recently passed 1T messages/day) is managed by 5 data engineers and 2 ops engineers.

The next Facebook infra talk, One Trillion Edges: Graph Processing at Facebook-Scale, was interesting to me both in its scale (1T edges is a lot of edges), and that it is built entirely on an open-source Apache project, Giraph. It’s comforting to know that if you need a Graph Database, you probably won’t ever push it harder than Facebook needs to.

The last notable papers of the day both came out of Michael Stonebraker’s group at MIT. The first, Staring into the Abyss: An Evaluation of Concurrency Control with One Thousand Cores, gets back to our earlier discussion on transaction isolation and concurrency: as we move towards massive on-chip parallelism, our ability to manage transaction conflict begins to break down.

Write hroughput vs. Number of Cores. Source.

The paper does a great job covering all of the major concurrency control schemes used today, from two-phase locks to timestamp ordering; if you find yourself needing to brush up on how databases guarantee isolation and atomicity, this paper wouldn’t be a bad place to start.

It goes on to make some recommendations about scalable timestamp allocation. A couple thoughts here:

It is amazing to me that we’re talking about hardware-based single-cycle timestamp allocation optimization at this point.
Maybe it’s time to step back and look at the big picture a la Peter Bailis’s concurrency work covered above; perhaps there’s too much focus on the how, and not on the why when it comes to coordination.

Finally, E-Store: Fine-Grained Elastic Partitioning for Distributed Transaction Processing Systems. This appears to be the Next Big Thing From The Stonebraker group, and author Rebecca Taft did an exceptional job presenting some very compelling work.

The core idea behind E-Store is as follows: OLTP workloads will vary tremendously based on a number of factors including the time of day, season, or significant news events. At some point the “Bieber effect” was mentioned, wherein a small group of data (tuples) becomes very “hot” for a short period of time. The goal of E-Store is to elastically balance hot and cold ranges of data, with dedicated nodes for hot tuples and cold storage for data accessed less often. Highly skewed data can be a big problem, with ~10x latency increase and ~4x throughput decrease:

Throughput and Latency for Different Skew. Source.

Actually building this type of elastically balanced system requires some very interesting engineering choices. The paper breaks the problem into three questions:

How do we identify load imbalance requiring data migration?
How do we choose which data to move and where to place it?
How do we physically migrate data between partitions?

For those interested in the answers, I highly recommend reading the paper! It’s a nice balance of classic database and distributed-systems work, with an interesting approach for optimally assigning tuples to partitions, as well as cluster monitoring to detect imbalances. The MIT database group is a brilliant bunch, and it’s great to see the evolution of their projects.

Must-Read Papers

Coordination Avoidance in Database Systems

E-Store: Fine-Grained Elastic Partitioning for Distributed Transaction Processing Systems

The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing

Schema Management for Document Stores

Schema-Agnostic Indexing with Azure DocumentDB

Should You Attend VLDB?

I was the odd attendee that was not researching or building databases as part of my day job, but this was one of the richest learning experiences I’ve had as a Data Engineer.

I highly recommend VLDB for other data and distributed-systems engineers who want to stay current with the state-of-the-art in database research and technology; the industry and research sessions I attended were both excellent. Thanks to the organizers for curating and organizing such a great edition this year, and I hope to see everyone in 2016.

Thanks to Katherine Fellows and Trey Causey for reading drafts of this article.