Why Is Ruby Multithreading Slow?

I decided to play around with multithreading in Ruby. So I wrote the following basic script:

I opened irb and ran load 't.rb' and got some expected output:

Thread 1: 0
Thread 1: 1
Thread 1: 2
Thread 1: 3
...
Thread 2: 23
Thread 2: 24
Thread 2: 25
Thread 1: 22
Thread 2: 26Thread 1: 27
Thread 2: 29
Thread 1: 28
Thread 1: 31
Thread 1: 32

Since this is running inside IRB, and I didn’t put any exit condition on the thread blocks, I had to use Ctrl+d to terminate the script, which also terminated IRB.

Having some things printed on the same lines didn’t surprise me — that’s a pretty common race condition. A perfect opportunity for a Monitor! I updated the script to use some simple synchronization around those puts calls:

And I decided to run it with ruby 't.rb' instead of from within irb. Of course, I didn’t get what I expected — I got no output at all, and my program exited immediately. It didn’t take me long to figure out the problem — the main thread exiting before the other threads ever got the processor — or to fix it by adding [producer, printer1, printer2].each { |t| t.join } to the end of the script. Now when I ran ruby t.rb the output made a lot more sense:

Thread 2: 0
Thread 2: 1
Thread 2: 2
Thread 2: 3
Thread 2: 4
Thread 1: 5
Thread 2: 6
Thread 2: 8
Thread 2: 9
Thread 2: 10
Thread 1: 7
Thread 1: 12
Thread 2: 11
Thread 2: 14

But what I didn’t understand initially is why on earth it was running so slowly. If you run the above script using time ruby t.rb and hit Ctrl+c when it gets to about 30, you’ll probably see something similar to what I did: ruby t.rb 3.03s user 0.32s system 99% cpu 3.369 total which just seemed ridiculous to me. Why is it so damn slow?

I initially thought the printer threads were exhausting the producer thread and spending a lot of time waiting, so I tried biasing toward the producer thread in a really naive way:

I should have realized that this wouldn’t do what I wanted — if anything, adding the extra loop will take more CPU and cause the producer to produce *fewer* integers per timeslice. But in my naïvete, I expected my app to get to about 5 times faster. But if anything, it got a little slower: ruby t.rb 4.63s user 0.23s system 99% cpu 4.886 total

So I decided to try to debug by printing the number from the producer thread — and just to make things pretty, use the same monitor that the printers used:

This time when I ran ruby t.rb, it went MUCH faster. Before I could react, we’d passed 30,000. So I hit ctrl+c and ran it again, but this time hitting ctrl+c much faster. When I looked at the output, it looked more or less similar to what I expected:

Thread 2: 2851
Thread 2: 2852
Thread 2: 2853
Thread 2: 2854
Thread 1: 2850
Thread 1: 2856
Thread 1: 2857
Thread 1: 2858
2876
2877
2878
2879
2880
2881
2882
2883
2884
2885
2886
2887
Thread 1: 2859
Thread 1: 2860
Thread 1: 2861
Thread 1: 2862
Thread 1: 2863
Thread 2: 2855
Thread 2: 2865
Thread 2: 2866
Thread 2: 2867
Thread 2: 2868
Thread 2: 2869
2888
2889
Thread 1: 2864
Thread 1: 2871

The producer outpaces the printers by just a little bit. That’s not surprising, since the printers are effectively serialized, so behave similarly to a single thread (but slightly slower because locking and context switching adds overhead).

At this point, I got tired of running and hitting ctrl+c all the time, so I decided to make my script terminate a little more quickly:

No surprise, it terminated quickly: ruby t.rb 0.11s user 0.17s system 72% cpu 0.385 total

I decided that running quickly was less interesting than running slowly, so I removed the synchronized print from the producer to make sure it still ran slowly. This time, though, it ran very fast:

ruby t7.rb 0.06s user 0.05s system 51% cpu 0.203 total

Other than the early termination of the producer, this script is the same as the earlier one that ran slowly. At this point I started to get a hunch: the performance has to do with the number of cycles the producer does. It’s not the case that it was underperforming — it actually outpaced the printers. Maybe it was overperforming, and the synchronization just brought it down to speed? So I made a little change to test:

It ran much faster than originally, but the size of the queue kept climbing! As an example:

Thread 2: 40244
Thread 2: 40245
Thread 2: 40246
16642
Thread 2: 40247
Thread 2: 40248
Thread 2: 40249
16640
Thread 2: 40250
Thread 2: 40251
16638

When the printers were spitting out the forty-thousandth element, the queue already had sixteen thousand MORE items waiting! And that’s with the addition of a puts in the producer, which would definitely cause periodic context switches while waiting on stdout to flush.

I was pretty convinced I knew the answer at this point, but wanted to test further. To show that the queue kept climbing in size, I reduced the IO overhead of the producer by a very simple expedient: change puts q.size to puts q.size if q.size % 1000 == 0 . And the results are astounding:

Thread 2: 29149
Thread 2: 29150
Thread 2: 29151
Thread 2: 29152
4370000
4370000
Thread 2: 29153
4371000
Thread 2: 29154
4371000

That’s right — when the printers had done 29K iterations, the producer had already enqueued 4.3 million additional items.

So we need a way to throttle the producer. Thinking about what I know about producer/consumer implementations, I realize this makes perfect sense: any sane producer/consumer implementation enforces a finite upper bound on the work pool size. So after a quick Google search for “ruby queue max size”, I find the SizedQueue class, and all is solved:

And the output looks sane :

100
Thread 1: 99801
100
Thread 2: 99802
100
Thread 1: 99803
100
Thread 2: 99804
100
Thread 1: 99805
100
Thread 2: 99806
100
Thread 2: 99808
99
Thread 1: 99807
100
Thread 2: 99809
99
Thread 1: 99810
100
Thread 2: 99811
100
...
Thread 1: 99989
10
Thread 1: 99990
9
Thread 1: 99991
8
Thread 1: 99992
7
Thread 1: 99993
6
Thread 2: 99988
5
Thread 2: 99995
4
Thread 2: 99996
3
Thread 2: 99997
2
Thread 2: 99998
1
Thread 1: 99994
0
Thread 2: 99999
0

And it took about 3.5 seconds.

So this definitely resolved the issue, but it’s not obvious WHY it happened in the first place. To understand that, we have to dive underneath the hood a little bit and talk about the Ruby Global VM Lock, or GVL (sometimes called, by people familiar with Python, the Global Interpreter Lock, or GIL). The GVL is a characteristic of the CRuby runtime that prevents any two threads from running simultaneously, even in separate cores. Instead, each thread is given a CPU quantum and gets it exclusively for that timeslice before getting preempted and the CPU getting handed to another thread. (In that regard it’s very similar to multithreading on a single core processor.) Any Ruby thread that blocks on IO or voluntarily goes to sleep is preempted, and the next waiting thread gets the CPU.

What that means for our original t.rb script is is that the producer, because it never put itself to sleep or blocked on IO, got to run at full throttle, while the printers voluntarily relinquished the vast majority of their timeslice by blocking on IO (or a mutex). When we added printing to the producer, it also began relinquishing its timeslice, so the printers ran much more quickly because they started receiving a fairer proportion of the CPU. By switching to the SizedQueue, we forced the same behavior by having the producer block whenever the queue filled up.

Just to validate that this was, in fact, the explanation, I installed JRuby and ran the initial t.rb within IRB. As expected, it appeared to run very fast. Why is that? Because JRuby lacks a GVL — it’s a Java JVM-based Ruby runtime, and has full support for actual concurrency. So on my multi-core machine, all three threads can have the CPU at the same time — so when one printer releases the mutex, the other thread can wake up almost instantly, instead of having to wait on the producer to finish its timeslice before getting the processor.

So in the end it turns out Ruby multithreading isn’t slow — I just had some learning to do.