CSR Tale #8: Why Terrible Reviews Are Good For You!
CSR Tale #8 comes from Prof. Barzan Mozafari at Michigan. Barzan leads a research group designing the next generation of scalable databases using advanced statistical models. Recently, their VATS lock scheduling algorithm was adopted by MySQL and MariaDB, so I chatted with Barzan on how the VATS work came to be done. I also chatted with one of their industrial collaborators, Sunny Bains from Oracle, to get the industry side of this story. Overall, it makes for a great CSR Tale! First, lets hear from Barzan:
It was the summer of 2013. I had just started as a tenure-track professor at the University of Michigan, and I was trying to figure out what to work on. During the my postdoc at MIT, I had contributed to a number of different projects, from analytics (e.g, BlinkDB), to transactions and cloud computing (e.g., DBSeer) to crowdsourcing. Out of these, DBSeer was by far the hardest challenge I’ve ever taken on. The goal with DBSeer was to predict the performance of a database system (in this case, MySQL) given a workload. Over nearly two years I tried every machine learning algorithm in the textbook. While we had some success in predicting the resource usage (e.g., disk IO or CPU), our results were mediocre when it came to predicting actual latency of a transaction.
There were two fundamental reasons for this. First, we had decided that we were only interested in non-intrusive solutions. For example, we were going to only look at MySQL’s global status variables. That meant we didn’t have access to some critical statistics simply because MySQL didn’t expose them. (Andy Pavlo’s Peloton project is a great way of addressing this first problem by having access to the database internals). But at the time, our rationale was that if we could make this work, then adopting DBSeer would become a no-brainer. The second reason was much more straightforward: MySQL just wasn’t a predictable system to begin with! You could submit the exact same query under the exact same conditions multiple times, and you’d still get widely different latencies each time! With this much noise in a signal, there just isn’t much to do for machine learning. It’s as simple as that. And I should point out that this wasn’t just MySQL: every other DBMS that we looked at was just as unpredictable. “Our ancestors, when they were designing these legacy systems, were too focused on making them faster, and as a result didn’t have the luxury of worrying about their predictability.” Or at least that’s how it seemed.
So there I was, having just started at Michigan, and I realized there was a great opportunity to change this once and for all: let’s make database systems more predictable, not just faster. I had recently recruited Jiamin Huang, an impressive student with unbelievable systems skills. We set out to decipher what might be causing unpredictability. Our first suspect was L2 cache misses. We teamed up with one of my hardware colleagues (Tom Wenisch) to investigate this and a number of other potential causes, but very quickly it was an impossible feat to solve this manually, given how complex the MySQL codebase was. We soon found ourselves creating our own profiler which, unlike Dtrace and other profilers, was specifically designed to find the root causes of performance variance in a large and complex codebase. It turned out that our profiler wasn’t specific to databases, and we ended up turning it into VProfiler and later published it at EuroSys 2017. It came down to looking at millions of lines of code and then informing the developer of a few necessary couple of functions to use in order to make your application predictable.
The database part of our findings has a much more interesting story. VProfiler revealed a bunch of culprits for causing unpredictability. For example, one of our findings was that in MySQL “queueing” behind shared resources was the main cause of latency variance. In hindsight, this is very intuitive: when N transactions are waiting in a queue for the same shared resource, the one ahead of the queue is going to experience a very different wait time than the one at the end of the queue, and both are going to be very different than the one in the middle of the queue. This is why they end up exhibiting very different latencies to do the same amount of work.
Long story short, we shifted our focus toward how queues are managed and, shockingly, we realized that every single database in the world was still using some variant of FIFO (first-in, first-out). We came up with a new algorithm, called VATS (Variance-Aware Transaction Scheduling), to reduce variance and published it in our SIGMOD 2017 paper (“A Top-Down Approach to Achieving Performance Predictability in Database Systems”). One of the great pieces of insight this offered was that predictability does not have to come at the price of average performance. In other words, there is a way to make systems more predictable while also making them faster: by reducing contention-based variance.
Later, we formulated the general problem of lock scheduling. We explored a different, novel algorithm that was optimal for reducing mean latency (and increasing throughput) and published in VLDB 2018 (“Contention-aware lock scheduling for transactional databases”). We called this new algorithm VATS 3.0 (we never published VATS 2.0) which we later named CATS (Contention-Aware Transaction Scheduling).
From an industry-adoption perspective, things went pretty smoothly: both MySQL and MariaDB ran their own benchmarks with our new algorithm and decided to adopt it in their standalone versions (MySQL even made it their default strategy). We’re grateful to all of our open-source collaborators in the MySQL community, including invaluable feedback and help from Jan Lindstrom (MariaDB), Sunny Bains and Dimitri Kravtchuk (Oracle), Laurynas Biveinis and Alexey Stroganov (Percona), Mark Callaghan (Facebook), and Daniel Black (IBM).
From an academic standpoint, however, things didn’t go quite as smoothly. Here is a timeline of what happened:
Disclaimer: Some of the conference names/years might have been different
- SIGMOD’16: We submitted our MySQL results.
- Rejection: “How do we know if it works for anything other than MySQL?”
Commentary: Interestingly enough, you only receive comments such as this one when you apply your idea to a real system. If you simply prototype your idea from scratch and create a mock-up database, typically nobody will ask if it applies to other real systems! Undeterred by these comments, my student decided to prove the reviewer wrong…
- VLDB’16: We applied VProfile to Postgres and VoltDB too.
- Rejection: “If this problem was important enough, someone else would have done it by now!”
Commentary: To this day, this is still one of my favorite reviews! Receiving an unfair review is never fun, but this one was so ridiculous that it didn’t bother us — in fact, we found it quite amusing. Though we wish we’d had the opportunity to ask that reviewer two questions:
1) Does the reviewer measure his own research against the same bar? (We know it was a “he”; see Lesson 5.)
2) How could we ever achieve advancement in research by applying this principle? If something has already been done, then it is not new and publishable, and if it has not been done, then it is just not important enough and there must be no point in publishing them!
- OSDI’16: Nevertheless, my student decided once again to prove the reviewer wrong and prove that the problem is real. So we sent our algorithms and results to both MariaDB and MySQL. VATS was picked up by MySQL as default scheduling and we mentioned it in the paper.
- Rejection: “You have applied VProfiler to MySQL, Postgres, and VoltDB. How do we know if it works for anything other than a DBMS?”
Commentary: This was a fair concern. After all, OSDI is an OS conference. We were actually very pleased with the quality of the reviews we got. As a DB researcher, I always envy the OS community for their significantly better review system.
- SIGMOD’17: We submitted VATS and other database findings.
- Accept! Finally!
- EuroSys’17: We generalized our profiler, which later became VProfiler and applied to Apache Web Server in addition to database systems.
- Accept! Yay!
- VLDB’18: Finally, once we managed to formalize the lock scheduling problem, we were able to solve for performance as well (not just predictability). This became the CATS algorithm.
- Accept. We actually got great reviews from VLDB’18. We also got received exceptional feedback from Peter Bailis after the paper was published.
So here are some lessons from this long post:
Lesson 1. Terrible reviews are actually good for you. They push you (and your student!) to do more work, which is not a bad outcome!
Lesson 2. Don’t trust what people say on panels. Even though everyone in academia would go on record about valuing real-world impact and real-world validations, some of them are sometimes subconsciously lying. When they wear their reviewer hat, often they penalize papers that use a real system for evaluation, e.g., by asking for a billion other things that would only be possible if you were using a prototype simulation. I don’t mean you should give up on using real systems in your experiments, but simply be prepared for doing more work!
Lesson 3. Academia and industry have different wavelengths. For those of us in academia, three months feels like an eternity. Product teams, however, run on their own timeline — sometimes it may take up to a year before they have any cycles for you. You just need to be patient and appreciate the great work they do in maintaining a high-quality open-source eco-system.
Lesson 4. Not every reviewer is a mathematician. In one of our earlier submissions, we were reporting the percentage of total variance that we had eliminated, which was something around 65%. Obviously you can’t eliminate anything by more than 100%. Unfortunately, that’s just a limitation of laws of mathematics. But one of our reviews (I think SIGMOD’16 or VLDB’16) was of the opinion that 65% reduction is not significant enough. So in our next submission, we switched to reporting the improvement ratio, defined as the variance of the original MySQL divided by the variance of our modified version. That same 65% reduction was now reported as a 3x reduction of variance, and the reviewers (though probably different people) were happier this time.
Lesson 5. Be nice, or be anonymous. This applies to our favorite reviewer: If you’re going to write a nasty review, it’s probably not a good idea to ask the authors to cite three of your own papers. It doesn’t fare well with maintaining your anonymity ☺
Lesson 6. Working on real systems is a pain, but it’s totally worth it. If you’re willing to put up with poor reviews and significantly longer time-to-paper, working on real systems is an extremely rewarding experience!
Lets switch to the industrial perspective on this work. I chatted with Sunny Bains from Oracle who was instrumental in getting VATS adopted into MySQL/InnoDB. I present my conversation with him as a Q & A.
CSRTales: How did you first learn about the CATS work?
Sunny: IIRC Barzan and Jiamin published a paper with benchmarks and that was forwarded to me by somebody in our organization. They then contributed a patch which is what got me interested. Locking always needs some optimization or the other and at that time we were looking into optimizing the InnoDB lock manager. Therefore it was very timely. CATS as it was called then seemed a promising candidate to look at. At that stage we were too narrowly focused on uniform distribution and didn’t see any gains. Also, the focus shifted to fixing other things inside InnoDB. Once we had some good solutions for other issues around the transaction management inside InnoDB the locking issues came to the forefront again. The InnoDB team started to look at CATS again and coincidentally Mark Callaghan from Facebook sent me an email the very next day introducing me to Barzan and Jiamin. Closer collaboration started after that and with direct communication and their help it was all downhill from there.
CSRTales: What did you like most about the CATS idea when you heard about it?
Sunny: It addressed a real problem and the content itself is very general and I think it has applications beyond the lock manager. Equally important from a practical perspective is that it came with a proof of concept. A patch that we could test right away. There are lots of good ideas floating around. To have something concrete to test is a HUGE plus. It saves us a lot of time and resources. I also think this is one of the advantages of open-source software. This is a very good example of that.
CSRTales: How long did it take from first hearing about the work to getting it merged?
Sunny: It took about a year I think. From when we decided to commit to it and to the time it actually got pushed took six months. Our QA process is very rigorous and I can’t thank our QA enough. There were some bugs in the original patch. We decided to rewrite it too. We also want to reduce the number of configuration variables as policy. There were some issues with gap locks that had to be fixed in the patch.
CSRTales: Typically, in open-source communities such as the Linux kernel, you need to have credibility already built up before your patches are accepted. I’m guessing something similar happened here?
Sunny: Of course. We get quite a few patches/contributions. But I would like to stress that it is not a pre-requisite. We are open to accepting patches that work out of the box and have some documentation that demonstrates the problem and how the patches fix these problems. It is our genuine desire to encourage researchers/users/developers anybody that has an interest in this field to leverage the advantage of open source and send us their patches. I would like to stress, talk is cheap. Show me the code :-)
CSRTales: Is it common practice to rewrite contributed patches?
Sunny: Yes, once we decide to commit to some work then we prefer that it is done entirely within the team. There are practical reasons for this. The QA process is quite intense and there is a whole infrastructure around code reviews and issue tracking that outsiders will not have access to. Also, the person that commits the code has to take ownership for later bug fixes etc. It is mostly due to practicalities that we don’t get original authors to do the work as it progresses. We (actually I did) rewrote the patch. There was a constant email exchange between Jiamin, Dimitri, Barzan, and myself discussing the issues. Dimitri is our performance architect. Their input was very helpful in ensuring that we all had the same mental model around their ideas.
I would like to point out several interesting things in this CSRTale. First, this is a great example of how to get industry adoption. Barzan and his group contributed a patch that could be directly tested. This is crucial. Second, Barzan and his group spent a large amount of time discussing their idea with Sunny to ensure they were on the same page. Tech transfer takes a lot of time, above and beyond the time spent in publishing the paper. If you are looking for impact, this is a great path, but be aware of the time cost. Third, reviewing quality is not very consistent in the broad systems community. I have seen reviews similar to the ones Barzan received: my take is that certain reviewers (not all thankfully) first make up their mind, and then try and find any sort of excuse to reject the paper. So sometimes reviews are very noisy signals: you might think if you fixed what the reviewer was asking for, you would get accepted, but that is not often the case. There might be other weaknesses in your paper that you might better spend your time fixing. For example, if the ideas in your paper didn’t come across clearly, a reviewer might not like your paper, and complain about the evaluation. Fixing the evaluation (without fixing the writing) would not improve your chances of acceptance. Talk to your advisors about this :)