The Coordinated Omission Problem in the Benchmark Tools

siddontang
3 min readMay 26, 2018

--

I read an article Your Load Generator Is Probably Lying To You — Take The Red Pill And Find Out Why from Gil long time ago. In this article, Gil mentioned that there is a coordinated omission problem in the benchmark tools, but I didn’t pay any attention to it at that time. Recently, a guy discussed me about this problem here, then I realized that we have the same problem in our benchmark tool go-ycsb.

What is coordinated omission

Let’s talk about the problem of coordinated omission at first. For the most of benchmark tools, they have the similar benchmark flow:

  1. Start multiply threads.
  2. In each thread, send a request, wait the response, then go on.
  3. Record the latency of response time - request time.

This flow seems sensible, but actually, it has a problem.

For example, I go to KFC to buy fried chicken (Em, this is not an advertisement, I just like KFC), and fall in the end of a line. There are three people in front of me. The first two people both use 30 seconds to buy their foods, but the third one uses nearly 300 seconds. Finally it is my turn and I use 30 seconds too. So for me, my total time to buy the fried chicken is 390 seconds (2 x 30 + 300 + 30), not only 30 seconds. 30 is the service time for me, and 360 is the waiting time. Maybe you have already noticed the problem, most of the benchmark tools use the service time to represent the latency, but not include the waiting time.

Another example here, assume we want our benchmark tool to send the request at a frequency of 10 ops/sec, we need to send a request every 100 milliseconds. The first 9 requests only take 50 microseconds, but the 10th one costs 1 seconds, and the following ones take 50 microseconds again. Obviously, for the 10th request, there must be something wrong in the server, but we just send one request for benchmarking in one second, we need to send more.

YCSB

For the first example, to get a more accurate latency, YCSB introduces a concept of intended time to record the waiting time. It uses a local thread variable to record the start waiting time in the throttle function:

Then, for the every operation, uses the intended time to calculate the waiting time:

Notice here, the intended time can only work if you set target in YCSB. By the way, when I browsed the YCSB source code at first, I was very confused with the intended time, I didn’t know what’s it used for. Now, after I review the coordinated omission, I figure it all out.

But YCSB still doesn’t solve the second problem above. If the server hangs up, the client must wait the response of the last request and then send the next request. This is synchronization.

I find a paperCoordinated Omission in NoSQL Database Benchmarking which solves this problem. In this paper, the authors use an async way — Future. They create Future according to the frequency of target, the previous Future won’t block the next Future even the server hangs up.

Go YCSB

Now let’s talk about go-ycsb, how can we solve the coordinated omission problem? The only way I think is to take advantage of the Go goroutine — create the goroutine according to the frequency of target, in each coroutine, send one request, wait the response and record the latency. Of course, we need to care the latency of Go scheduler too. Another problem is that If the server hangs up, we may create lots of goroutines which may cause OOM finally, but this is as expected.

This is just my thought, I haven’t started yet, if you are interested in it, you can send me a PR or email me (siddontang@gmail.com) for more deep discussion.

--

--

siddontang
siddontang

Written by siddontang

VP of Engineering / Chief Architect at PingCAP. Author of TiDB, TiKV, Chaos Mesh, etc. Contract me: https://www.linkedin.com/in/siddontang/