Optimizing ruby-protobuf deserialization by 50%

(This is a repost on Medium of something I wrote at the end of 2015 for Square)

I’m on a team colloquially called “WebScale” at Square, which deals with extracting legacy payments data from our monorail. One of our recent tasks has been the rollout of a new internal service to search for payments on Square. We mostly use protobuf as our data transport, and a single payment proto is about 1,500–2,000 bytes when serialized and returns up to 500 records per request.

Once rollout of the new service started, a few requests from a Ruby MRI 2.2.3 client went from 250ms to 1,600ms on a few endpoints. We already have metrics around time spent in RPC requests, so we took a guess that protobuf deserialization was the culprit.

To verify our guess, we used ruby-prof to profile code in Ruby (in this case, the protobuf gem). RubyProf is a helpful tool for profiling MRI code, and here’s how we used it to find hotspots.

Our test case:

require 'protobuf'
require 'ruby-prof'
# Just so we don't get spammed when testing
Protobuf.print_deprecation_warnings = false
# Construct a proto that mimics the size of our production responses
encoded = RecordResponse.new(
payment: 500.times.map do |i|
Record.new( ...test data... )
end
).encode
result = RubyProf.profile do
RecordResponse.decode(encoded)
end
printer = RubyProf::FlatPrinter.new(result)
printer.print(
STDOUT,
# Only want to focus on hotspots right now
min_percent: 1
)

RubyProf supports various ways of printing a report. I picked FlatPrinter by using the scientific method of it appearing “first on the list”. I’m only showing code that took more than 1% of the time for the sake of brevity.

Measure Mode: wall_time
Thread ID: 70093065552380
Fiber ID: 70093084071700
Total: 4.599523
Sort by: self_time
%self      total      self      wait     child     calls  name
16.25 0.748 0.748 0.000 0.000 2031000 Protobuf::Enum#to_i
15.35 1.454 0.706 0.000 0.748 30000 Array#select
7.30 0.745 0.336 0.000 0.410 140000 <Class::Protobuf::Decoder>#read_varint
3.44 0.176 0.158 0.000 0.018 94000 Kernel#respond_to?
3.32 1.156 0.152 0.000 1.003 72000 <Class::Protobuf::Decoder>#read_field
2.44 0.508 0.112 0.000 0.396 72000 <Class::Protobuf::Decoder>#read_key
2.26 0.298 0.104 0.000 0.194 72000 Protobuf::Message::Fields::ClassMethods#get_field
2.00 0.159 0.092 0.000 0.067 166500 IO::generic_readable#readbyte
1.76 0.124 0.081 0.000 0.043 171500 Numeric#nonzero?
1.66 0.076 0.076 0.000 0.000 410500 Fixnum#&
1.58 0.073 0.073 0.000 0.000 144000 Protobuf::Field::BaseField#repeated?
1.48 0.068 0.068 0.000 0.000 69000 Protobuf::Field::BaseField#setter
1.46 0.067 0.067 0.000 0.000 166500 StringIO#getbyte
1.19 0.055 0.055 0.000 0.000 1500 Kernel#caller
1.17 0.054 0.054 0.000 0.000 72000 Protobuf::Message::Fields::ClassMethods#field_store
1.05 0.087 0.048 0.000 0.039 72000 Protobuf::Field::BaseField#packed?
1.00 0.137 0.046 0.000 0.091 36003 Class#new

Continuing with our scientific method of “first on the list”, we’ll dig into Protobuf::Enum. There are a few uses of to_i in that class, but the one that stands out is:

# Public: Get an array of Enum objects with the given tag.
#
# tag - An object that responds to to_i.
#
# Examples
#
# class StateMachine < ::Protobuf::Enum
# set_option :allow_alias
# define :ON, 1
# define :STARTED, 1
# define :OFF, 2
# end
#
# StateMachine.enums_for_tag(1) #=> [ #<StateMachine::ON=1>, #<StateMachine::STARTED=1> ]
# StateMachine.enums_for_tag(2) #=> [ #<StateMachine::OFF=2> ]
#
# Returns an array with zero or more Enum objects or nil.
#
def self.enums_for_tag(tag)
enums.select do |enum|
enum.to_i == tag.to_i
end
end

The code has great documentation on how the method works, which makes things easy for us. We use a lot of enums in our protos, and looking around the Protobuf::Enum class it’s clear enums_for_tag is called frequently (in enum_for_tag and fetch). Looking at the all_tags and values methods, we know it’s safe to cache enum data. In this case, it’s a straightforward optimization of O(n) -> O(1). We can only map int -> enum because you can alias enums. That’s fine for this case though.

The full changes are at the end, but the gist is:

def self.mapped_enums
@mapped_enums ||= enums.each_with_object({}) do |enum, hash|
list = hash[enum.to_i] ||= []
list << enum
end
end
def self.enums_for_tag(tag)
mapped_enums[tag.to_i] || []
end

And then rerun it in RubyProf

Measure Mode: wall_time
Thread ID: 70271059223040
Fiber ID: 70271070182440
Total: 3.308983
Sort by: self_time
%self      total      self      wait     child     calls  name
10.16 0.736 0.336 0.000 0.400 140000 <Class::Protobuf::Decoder>#read_varint
4.92 0.182 0.163 0.000 0.019 94000 Kernel#respond_to?
4.62 1.148 0.153 0.000 0.995 72000 <Class::Protobuf::Decoder>#read_field
3.33 0.501 0.110 0.000 0.391 72000 <Class::Protobuf::Decoder>#read_key
3.25 0.308 0.108 0.000 0.201 72000 Protobuf::Message::Fields::ClassMethods#get_field
2.70 0.156 0.089 0.000 0.066 166500 IO::generic_readable#readbyte
2.37 0.078 0.078 0.000 0.000 410500 Fixnum#&
2.34 0.119 0.077 0.000 0.042 171500 Numeric#nonzero?

Much better! The total time listed in RubyProf is accurate in percentage saved (~25%), but the total time is higher because RubyProf hooks into Ruby to profile everything. Without RubyProf, we saw requests from the Ruby MRI 2.2.3 client drop from 1,000ms to around 750ms.

Since 750ms is still too long, we’ll take a look at the next slowest callsite Protobuf::Decoder and the read_varint method.

# Read varint integer value from +stream+.
def self.read_varint(stream)
value = index = 0
begin
byte = stream.readbyte
value |= (byte & 0x7f) << (7 * index)
index += 1
end while (byte & 0x80).nonzero?
value
end

The code is simple enough that any optimizations will be marginal at best. Ruby is just too slow for the amount of calls we’re making, which makes this a candidate for rewriting the method in C. It turns out, the authors of the ruby-protocol-buffers gem (a different ruby protobuf gem from the one we use), released released a gem called varint which does exactly what we need.

Normally, I don’t like replacing Ruby with C, especially from a library. It’s not guaranteed that the C and Ruby code will function the same way, and both debugging and maintenance are harder. Since it’s already used by another gem, and the code is self contained though, I gave it a shot.

We add a simple monkeypatch to our test case:

module Protobuf
class FastVarint
extend ::Varint
end
class Decoder
def self.read_varint(stream)
FastVarint.decode(stream)
end
end
end

Rerun RubyProf

Measure Mode: wall_time
Thread ID: 70126242488840
Fiber ID: 70126246561280
Total: 2.722766
Sort by: self_time
%self      total      self      wait     child     calls  name
5.65 0.676 0.154 0.000 0.522 72000 <Class::Protobuf::Decoder>#read_field
5.64 0.172 0.154 0.000 0.018 94000 Kernel#respond_to?
4.55 0.283 0.124 0.000 0.159 72000 <Class::Protobuf::Decoder>#read_key
4.04 0.302 0.110 0.000 0.192 72000 Protobuf::Message::Fields::ClassMethods#get_field
3.48 0.250 0.095 0.000 0.156 140000 <Class::Protobuf::Decoder>#read_varint
3.13 0.156 0.085 0.000 0.070 140000 Varint#decode
2.86 0.078 0.078 0.000 0.000 144000 Protobuf::Field::BaseField#repeated?
2.58 0.070 0.070 0.000 0.000 166500 StringIO#getbyte
2.49 0.068 0.068 0.000 0.000 69000 Protobuf::Field::BaseField#setter
2.10 0.058 0.057 0.000 0.000 1500 Kernel#caller

We got a 18% optimization by blatantly cheating and having C do it instead. Even with all these improvements, our decoding time is around 600ms still.

Moving on, we take another look at Protobuf::Decoder. Unfortunately the read_field and read_key methods are too simple for us to get any real optimization. Kernel#respond_to? is used in too many places for safety, such as returning nil if to_i doesn’t exist rather than erroring.

That really just leaves us with Protobuf::Message::Fields::ClassMethods, unless we want to optimize 50 callsites for a few milliseconds each. Looking at get_field, we see:

def get_field(name_or_tag, allow_extension = false)
name_or_tag = name_or_tag.to_sym if name_or_tag.respond_to?(:to_sym)
field = field_store[name_or_tag]
if field && (allow_extension || !field.extension?)
field
else
nil
end
end

Looking at the RubyProf trace above, we already know that respond_to? is called 94,000 times total. After adding some debugging to count calls, it turns out get_field accounts for 72,000 (75%) of the calls to respond_to?, and it only calls to_sym 2,000 times. Essentially, only 2.7% of respond_to? calls result in a true.

Instead of calling respond_to? and to_sym, we can cache a map of string field -> field and make it two O(1) lookups. The get_field method then looks like:

def get_field(name_or_tag, allow_extension = false)
field = field_store[name_or_tag] || str_field_store[name_or_tag]
if field && (allow_extension || !field.extension?)
field
else
nil
end
end

One last rerun of RubyProf

Measure Mode: wall_time
Thread ID: 70107917574660
Fiber ID: 70107924939720
Total: 2.428846
Sort by: self_time
%self      total      self      wait     child     calls  name
6.12 0.666 0.155 0.000 0.511 72000 <Class::Protobuf::Decoder>#read_field
4.70 0.277 0.119 0.000 0.158 72000 <Class::Protobuf::Decoder>#read_key
3.70 0.248 0.094 0.000 0.154 140000 <Class::Protobuf::Decoder>#read_varint
3.39 0.132 0.086 0.000 0.046 72000 Protobuf::Message::Fields::ClassMethods#get_field
3.39 0.154 0.086 0.000 0.068 140000 Varint#decode
2.96 0.075 0.075 0.000 0.000 144000 Protobuf::Field::BaseField#repeated?
2.70 0.068 0.068 0.000 0.000 166500 StringIO#getbyte
2.60 0.066 0.066 0.000 0.000 69000 Protobuf::Field::BaseField#setter

This gives us another 12% improvement, and Kernel#respond_to? was dropped to 22,000 calls (1.29% of self%). Overall, we optimized deserialization by 50%, and it only took a few hours of investigating with RubyProf.

RubyProf is a great tool, and you can get even more details using a GraphPrinter or CallTreePrinter which can be used with KCachegrind. For our case, FlatPrinter was simple, and the code was easy enough to trace through to narrow down what to optimize. Examples of various printers in RubyProf can be found on the repo.

We shipped all of the changes upstream into the protobuf gem. You can see the exact changes that were made in the varint, get_field/get_extension_field, and enum optimization PRs. If you’re using the protobuf gem, v3.5.5 and higher with the varint gem (MRI only) installed will give you the full set of optimizations.

In the future, we’re going to look at moving to protobuf3. The Ruby gem is in alpha, but it’s built off Google’s first party C++ library. In benchmarks against the optimizations above, Google’s implementation is about 7x faster than ruby-protobuf.