A Faster Perl Runtime in Tiny Steps

Booking.com uses the Perl programming language heavily across its entire technical infrastructure. At the size of our infrastructure, even small performance improvements can translate into hefty savings, not to mention the exciting new features that we will implement with those spare CPU cycles. For this reason, we are very happy to announce that we are now funding Dave Mitchell, prominently known for his many years of high-quality contributions to the Perl language implementation, to improve the Perl runtime’s performance one small step at a time.

booking.development

Published in

Booking.com Engineering

6 min readMar 14, 2014

Written by Steffen Müller

The initial work started in February. Dave is working at the pace of his choosing and sharing some details on his progress on a monthly basis. The summary for February has already hit the development mailing list. Dave’s summaries are necessarily concise as they’re intended for experts. I’ll expand on some of the changes here.

Dereferencing Overloaded Objects

Perl’s support for overloading many common operations has ramifications throughout much of the runtime. Part of the strategy to keep the overhead of proliferating overload checks down is to encode the fact if there is any overloading into a flag on any given value which is inexpensive to check. In recent versions of Perl, bug fixes have exacerbated the performance impact in certain situations. In a nutshell, when a reference carries overloading for one operation (such as stringification) but not for another (such as dereferencing), the latter operation should not be slowed down significantly. Dave’s work has improved the dereferencing of objects (like $obj->{foo}) which have overloading. The other operations no longer have such a dramatic performance penalty.

This means that we can have the more faithful overload support of recent Perls and have it be efficient, too.

Faster Array Access (Sometimes)

Perl comes with an optimization that makes array accesses faster if done with constant indexes. That means this

$foo[3]

is faster than this:

$foo[$bar]

This likely comes as a surprise to few. Alas, the optimization strikes a run-time/space trade-off in that it only has 8 bits to store constant indexes in. If you think outside the box just a little bit, you’ll realize that this is not a huge limitation. Code that does $array[5913] is rare. Code like $array[0], that is small constant indexes, is the overwhelmingly most common case of constant-index accesses.

Dave’s change is as simple as it is effective. Previously, indexes in the range between 0 and 255 were optimized. Dave changed this to be between -128 and 127. This is a net win because again, $array[-1] is common to get the last element, whereas $array[200] is very rarely seen with a literal constant.

Avoid Doing Meaningless Work in the Runtime

The output of the Perl compiler is an in-memory structure called the OP-tree. It mostly resembles the syntax of the Perl code. In fact it does so much that we can have tools like B::Deparse that recreate something very close to the original source. During a late compilation phase, Perl builds a separate linked list of these OP structures that represents the order in which they will be executed. The optimizer can determine that some of the OPs in this so-called execution chain actually serve no purpose. Depending on their type and location, it may choose to either patch them out of the chain completely (great!) or to replace them with no-ops (OPs of type OP_NULL). The latter happens when the optimizer is not smart enough to figure out all entry points that could lead to executing the soon-to-be-ex-OP.

Dave’s work in this area means there are no more OPs of type OP_NULL that are ever executed. That is, any optimization now always manages to apply the more thorough elimination of needless OPs. This change still resides on a branch for testing.

Less Overhead on Common Hash Operations

In a nutshell, this work is going to make common initial operations on hashes more efficient. These are of the “enumerate all entries” variety: keys %hash, values %hash and @array = %hash. My colleague Yves Orton contributed a nice explanation of the work that Dave has put into this. Dave and Yves have been cooperating on this quite extensively:

Dave’s keys() optimization is much deeper than meets the eye. Understanding what it does requires a little insight into how Perl's hash data structures are organized internally and a bit of their history.

Historically a hash was a fairly complex structure, which included preallocated storage for iterator state. This meant they were large. At some point it was realized that many hashes are never iterated. This meant that we could lazily allocate the iterator structure. The way we chose to do this was to embed it at the end of the bucket array, forcing us to realloc() the bucket array on "first iteration". This has the advantage that hashes which are not traversed are smaller in memory. The trade off is that the realloc() can be slow.

Dave’s patch takes advantage of the observation that there is no need for per-hash iterator state when running keys() and similar operations. We traverse the whole hash in one go, so the iterator need not be stored against the hash, and instead a preallocated structure can be used by keys() for its purposes.

Until you consider hash order randomization, things are really that simple. But it so happens that keys() and each() must produce the same output order. The per-hash randomization data is stored in the iterator structure, so there is no way for the optimized keys() to guarantee the same order as a subsequent each() would (vice versa is not a problem as each() must create the iterator struct). The solution is to figure out how to store the randomization data in the base structure.

Unfortunately, this isn’t so easy. All members of the existing structure appear fully used, so we need to figure out how to squeeze more bits of data in without expanding space usage needlessly. Multiple options are still on the table. The structure currently looks like this:

struct xpvhv {
    HV*         xmg_stash;      /* class package */
    union _xmgu xmg_u;
    STRLEN      xhv_keys;       /* total keys, including placeholders */
    STRLEN      xhv_max;        /* subscript of last element of xhv_array */
};

At the most abstract what we want to do is add another STRLEN (U32) member called xhv_rand to the structure, but we want to do it without making the structure larger.

Now it turns out that xhv_max contains the size of the bucket array, and that is always a power of two. That in turn means it uses 32 bits to store 5 bits worth of information (for performance). So at the cost of some arithmetic we can squeeze out 27 valuable bits of space. xhv_keys contains the total number of keys, which in the current implementation will always be smaller than xhv_max, so hypothetically we could somehow store both in the same field. There is also the observation that the random data need only be as long (in terms of bits) as xhv_max is. This means that if we could ensure that any hash with more than K buckets must have a preallocated iterator structure, then for smaller hashes we can store the max (or keys) and random data in the same struct member.

An entirely different solution would be to use a hash of the address of the data structure as the initializer to xhv_rand(), this however has the potential to reduce security.

What was worked on was the logic to guarantee that hashes of a certain size have a preallocated iterator structure, plus, the logic to pack the keys and randomization data into the xhv_keys structure when the number of buckets is smaller than 2**16. This also included an implementation of hashing the data structure address as the initializer.

The changes discussed here will not land in the main development branch (blead in perl parlance) before the Perl 5.20 release.

More Compact Code for Leaving Scopes

When leaving a lexical scope, perl has to do a fair amount of work such as reference count related book keeping on things that have gone out of scope. This is a very hot code path (the leave_scope() function in scope.c). Dave tweaked part of this function and it now has smaller object code (97 bytes less). The new version skips a lot of tests for "simple" lexical variables going out of scope.

Micro-Optimizations for Perl Function Calls

Dave worked on micro-optimizing the implementation of the entersub OP, that is the OP that executes the majority of logic related to Perl function calls. The common code paths of this hot function have been tweaked, and branch prediction hints added where appropriate. The changes reduce the object size of the function by 11%.

Summary

Most of the above changes all small tweaks, but an excellent start into the project. I’m incredibly happy that Dave is willing to work on Perl for us and to everyone’s benefit.

Would you like to be a Developer at Booking.com? Work with us!