Hardening Perl’s Hash Function

In 2003 the Perl development community was made aware of an algorithmic complexity attack on the Perl’s hash table implementation[1]. This attack was similar to reports over the last few years of attacks on other languages and packages, such as the Java, Ruby and Python hash implementations.

booking.development

Published in

Booking.com Engineering

15 min readNov 6, 2013

Written by Yves Orton

The basic idea of this attack is to precompute a set of keys which would hash to the same value, and thus the same storage bucket. These keys would then be fed (as a batch) to a target which would then have to compare each key against each previously stored key before inserting the new key, effectively turning the hash into a linked list, and changing the performance profile for inserting each item from O(1) (amortized) to O(N). This means that the practice of loading arguments such as GET/POST parameters into hashes provided a vector for denial of service attacks on many HTTP-based applications.

As a response to this, Perl implemented a mechanism by which it would detect long chains of entries within a bucket and trigger a “hash split”. This meant it would double the number of buckets and then redistribute the previously stored keys as required into the newly added buckets. If, after this hash split, the chain was still unacceptably long, Perl would cause the hash to go into a special mode (REHASH mode) where it uses a per-process random hash seed for its hash function. Switching a normal hash to this special mode would cause Perl to allocate a new bucket array, recalculate all of the previously stored keys using the random seed and redistribute the keys from the old bucket array into the new one. This mitigated the attack by remapping the previously colliding keys into a well distributed set of randomly chosen new buckets.

At this point the Perl community thought we had put the subject of hash collision attacks behind us, and for nearly 10 years we heard little more on the subject.

Memory Exhaustion Attack On REHASH Mechanism

Over the years occasionally the subject of changing our hash function would come up. For instance Jarkko made a number of comments that there were faster hash functions and in response I did a bit of research into the subject, but little came of this work.

In 2012 this changed. I was working on several projects that made heavy use of Perl’s hash function, and I decided to invest some efforts to see if other hash functions would provide performance improvements. At the same time other people in the Perl community were becoming interested, partly due to my work and partly due to the publicity from the multi-collision attacks on Python’s and Ruby’s hash functions (MurmurHash and CityHash). Publicity I actually did not notice until after I had pushed patches to switch Perl to use MurmurHash as its default hash, something that got reverted real fast.

In restructuring the Perl hash implementation so it was easier to test different hash functions, I became well acquainted with the finer details of the implementation of the REHASH mechanism. Frankly it got in the way and I wanted to remove it outright. While arguing about whether it could be replaced with a conceptually simpler mechanism I discovered that the defenses put in place in 2003 were not as strong as had been previously believed. In fact they provided a whole new and, arguably, more dangerous attack vector than the original attack they were meant to mitigate. This resulted in the perl5 security team announcing CVE-2013–1667, and the release of security patches for all major Perls versions since 5.8.x.

The problem was that the REHASH mechanism allowed an attacker to create a set of keys which would cause Perl to repeatedly double the size of the hash table, but never trigger the use of the randomized hash seed. With relatively few keys the attacker could make Perl allocate a bucket array with up to 2³² hash buckets, or as many as memory would allow. Even if the attack did not consume all the memory on the box there would be serious performance consequences as Perl remapped the keys into ever increasing bucket arrays. Even on fast 2013 hardware, counting from 0 to 2³² takes a while!

This issue affected all versions of Perl from 5.8.2 to 5.16.2. It does not affect Perl 5.18. For those interested the security patches for these versions are as follows:

maint-5.8:  2674b61957c26a4924831d5110afa454ae7ae5a6
maint-5.10: f14269908e5f8b4cab4b55643d7dd9de577e7918
maint-5.12: 9d83adcdf9ab3c1ac7d54d76f3944e57278f0e70
maint-5.14: d59e31fc729d8a39a774f03bc6bc457029a7aef2
maint-5.16: 6e79fe5714a72b1ef86dc890ff60746cdd19f854

At this time most Perl installations should be security-patched. Additionally official Perl maintenance releases 5.16.3, and 5.14.4 were published. But if you would like to know if you are vulnerable you can try the following program:

perl -le'@h{qw(a h k r ad ao as ax ay bs ck cm cz ej fz hm ia ih is
      iz jk kx lg lv lw nj oj pr ql rk sk td tz vy yc yw zj zu aad acp
      acq adm ajy alu apb apx asa asm atf axi ayl bbq bcs bdp bhs bml)}
      =(); print %h=~/128/ && "not "," ok # perl $]"'

The following are statistics generated by the time program for the full attack (not the one-liner above) against a Perl 5.16 with and without the fix applied (identical/zero lines omitted) on a laptop with 8GB:

Without the fix patch (0ff9bbd11bcf0c048e5b3e4b893c52206692eed2):

User time (seconds): 62.02
System time (seconds): 1.57
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:04.01
Maximum resident set size (kbytes): 8404752
Minor (reclaiming a frame) page faults: 1049666
Involuntary context switches: 8946

With the fix patch (f1220d61455253b170e81427c9d0357831ca0fac) applied:

User time (seconds): 0.05
System time (seconds): 0.00
Percent of CPU this job got: 56%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.09
Maximum resident set size (kbytes): 16912
Minor (reclaiming a frame) page faults: 1110
Involuntary context switches: 3209

But this doesn’t explain all of the changes in Perl 5.18

The observant reader will have realized that if we could patch older Perls to be robust to CVE-2013–1667 that we could also have patched 5.18 and avoided the problems that were caused by changing Perl’s hash function. The reason we went even further than those maintenance patches is that we found out that Perl had further, if less readily exploitable, vulnerabilities to attack, and we wanted to do our best to fix them all.

This part of the story starts with Ruslan Zakirov posting a report to the perl5-security mailing list. The report outlined the basis of a key discovery attack on Perl’s hash function. At first the Perl security team was not entirely convinced, but he then followed up with more code that demonstrated that his attack was sound. This development meant that the choice of a random seed would not make Perl’s hash function robust to attack. An attacker could relatively efficiently determine the seed, and then use that knowledge to construct a set of attack keys that could be used to attack the hash function.

Nicholas Clark then ramped things up a notch further and did some in-depth analysis on the attack and the issues involved. At the same time so did Ruslan and myself. The conclusion of this analysis was that the attack exploited multiple vulnerabilities in how Perl’s hash data structure worked and that the response would similarly require a multi-pronged approach.

Changes to the One-At-A-Time function

The first vulnerability was that Bob Jenkins’ One-At-A-Time hash, which Perl used, does not “mix” the seed together with the hashed data well enough for short keys. This allows an attacker to mount a key discovery attack by using small sets of short keys and the order they were stored in to probe the “seed” and eventually expose enough bits of the seed that a final collision attack could be mounted.

We addressed this issue by making Perl append a four digit, randomly chosen suffix to every string it hashed. This means that we always “mix” the seed at least 4 times, and we mix it with something that the attacker cannot know. This effectively doubles the number of bits used for “secret” state, and ensures that short keys do not “leak” information about the original seed. The reason we use a suffix is that adding a prefix is the same as starting with a different initial seed state, so does not add any extra security. A suffix modifies the final state after the user input is provided and increases the search space an attacker must consider.

Related to this change was that the original One-At-A-Time function was potentially vulnerable to multi-collision attacks. An attacker could precalculate one or more suffixes such that

H(x) == H( concat(x, suffix) )

which would then allow an attacker to trivially construct an infinite set of keys which would always collide into the same bucket. We hardened the hash by mixing in the length of the key into the seed. We believe that this more or less eliminates the possibility of a multi-collision attack as it means that the seed used to calculate H( concat(x, suffix) ) would not be the same seed as H( concat(x, suffix, suffix) ). Cryptographers are invited to prove us wrong.

Reduce Information Leakage

The second vulnerability was that it is all too easy to leak information about the hash function to an attacker. For instance a web page might accept a set of parameters and respond with information for each of those parameters in the natural key order for the hash. This might provide enough information to mount a key discovery attack.

In order to prevent this information leakage we randomize the element order returned by the keys() and each() functions on the hash. We do this by adding a mask to each hash, and when an insert into the hash occurs we modify the mask in a pseudo-random way. During traversal we iterate from 0 to the k-th bucket and then XOR the iteration value with the mask. The result is that every time a new key is added to the hash the order of keys will change more or less completely. This means that the "natural" key order of the hash exposes almost no useful data to an attacker. Seeing one key in front of another does not tell you anything about which bucket the key was stored in. A nice side effect of this is that we can use this mask to detect people inserting into a hash during each() traversal, which generally indicates a problem in their code and can produce quite surprising results and be very difficult to debug. In Perl 5.18 when this happens we warn the developer about it.

A third vulnerability is related to the case where two keys are to be stored in the same bucket. In this case the order of the keys was predictable: the most recently added key would be “first” out during a keys() or each() traversal. This in itself allows a small amount of data to leak to a potential adversary. By identifying such a case one could find two (or more) strings which had the same least significant bits. By stuffing more keys into the hash and triggering a hash split an attacker could determine that the newly added bit of the hash value was different, or the same, for the two keys. Without the key-order randomization logic mentioned previously the attacker could also determine which of the two had a 1 or 0 in the most significant bit of the used part of the hash value.

While we were not yet able to construct an actual attack based on this information we decided to harden against it anyway. This is done by randomly choosing whether we should insert the colliding key at the top of a bucket chain or if we should insert at the second from top in the chain. Similarly during a hash split we also make such a decision when keys collide while being remapped into the new buckets. The end result is that the order of two keys colliding in a bucket is more or less random, although the order of more than two keys is not.

People complain. Randomization is good anyway!

We introduced randomization into Perl’s hash function in order to harden it against attack. But we have discovered that this had other positive consequences that we did not foresee.

The first of these initially appeared to be a downside. Perl’s hash function behaved consistently for more than a decade. Over that time Perl developers inadvertently created dependencies on this key order. Most of the examples of this were found in test files of CPAN modules: Many of us got lazy and “froze” a key order into the test code. For example by embedding the output of a Data::Dumpercall into one's tests. Some of these however were real bugs in reputable and well tested modules.

By making Perl’s key order random these dependencies on key order became instantly visible and a number of bugs that probably manifested themselves as “heisenbugs” became regular occurrences, and much easier to track down and identify. I estimate that for every two “non-bugs” (in things like test code) that we found, there was one “real bug” that was identified as well. Considering one of these was in the Perl core, and the other was in DBI, I personally consider this to be a good result.

Many people object that randomization like this makes debugging harder. The premise is that it becomes difficult to recreate a bug and thus debug it. I believe that in practice it is the opposite. Randomization like this means that a formerly rare bug becomes common. Which in turn means it becomes much more obvious that it is related to subtle dependencies on key order. Effectively making it much easier to find such problems.

A last benefit of randomizing the hash function is that we can now, at any time, change or replace the hash function that Perl is built with. In fact 5.18 is bundled with multiple hash functions including the presumed cryptographically strong Siphash. Since hash order from now on will be random, we don’t have to worry if we change the function. External code should already be robust to the hash order being unpredictable.

How tangible are attacks on Perl’s hash function?

There has been a lot of discussion on this subject. Obviously erring on the side of caution on security matters is the right course of action. Nevertheless there is a lot of debate on how practical an attack like this is in real production environments where Perl is commonly used, such as web servers. Here are some of the points to keep in mind:

Perls hash algorithm uses an array of buckets whose size is always between the number of keys stored in it and a factor of two larger. This means that a hash with 20 keys in it will generally have 32 buckets, a 32 keys hash will be split into 64 buckets, and so on. This means the more keys are inserted in a hash the less likely further keys will be put into the same bucket. So attacking a hash cannot make non-attack keys slower[2]. An attacker basically only slows their own fetches down and except as a by-product of resource consumption they will not affect other requests.
For an attack to reach DOS proportions the number of items inserted into the hash would have to be very, very large. On modern CPUs a linked list of thousands to hundreds of thousands of keys would be necessary before there was serious degradation of service. At this point even if the attack was unsuccessful in terms of degrading Perl’s hash algorithm, it would still function effectively as a data-flooding denial of service attack. Therefore, focusing on the hash complexity aspect of the attack seems unwarranted.
Rudimentary out-of-band measures are sufficient to mitigate an attack. Hard restrictions on the number keys that may be accepted by publicly facing processes are sufficient to prevent an attack from causing any damage. For instance, Apache defaults to restricting the number of parameters it accepts to 512, which effectively hardens it against this type of attack. (This is one reason the attack on the rehash mechanism is so important, a “successful” attack requires relatively few keys.) Similarly, a well designed application would validate the parameters it receives and not put them in the hash unless they were recognized.
So long as the hash function chosen is not vulnerable to multi-collision attacks then simple per-process hash seed randomization makes the job of finding an attack keys set prohibitively difficult. One must first perform a hash-seed discovery attack, then generate a large set of keys. If restrictions on the number of keys the process will accept are in place then the keys must be very large before collisions would have a noticeable effect. This also makes the job of finding colliding keys all the more expensive.
In many circumstances, such as a web-service provider, the hosts will be behind load balancers. Which will either mean that every different web host uses different hash seeds, making hash seed discovery attacks very difficult. Or it requires an attacker to open a very long-running, persistent session with the server they wish to attack. This should be easily preventable via normal monitoring procedures.

For all these reasons it appears that hash-complexity attacks in the context of Perl and web hosting environments are of limited interest so long as:

The hash function does not allow multi-collision attacks.
The hash function uses at least a per-process hash seed randomization
The interface to untrusted potential attackers uses simple, hard limits on the number of keys it will accept.

These properties are relatively easy to accomplish without resorting to cryptographically strong hash functions (which are generally slow), or other complicated measures to prevent attacks. As the case of Perl’s rehashing flaw has shown, the cure may be worse than the disease. The code that this CVE exploits for its attack was added to Perl as part of our attempt to defend against the hypothetical attack vector of excessive bucket collisions. We are at this time unaware of any real attack of this form.

Hash Functions For Dynamic Languages

It seems like the bulk of research on hash functions has focused on finding fast, cryptographically secure hash functions for long strings containing binary data. However, dynamic languages, like Perl, make heavy use of hash functions on small strings, which are often restricted to simple alphanumeric characters. Examples are email addresses, identifiers like variables and method names, single character keys, and lists of numeric ids and similar use cases. Not only are they relatively short strings, but they use restricted sets of bits in their bytes.

So far it appears that Bob Jenkins’ One-At-A-Time-Hash with minor modifications provides an acceptable solution. It seems to have good distribution properties, is reasonably fast for short strings, and — with the hardening measures added in Perl 5.18 — it appears to be robust against attack. Analysis done by the Perl5 Security Team suggests that One-At-A-Time-Hash is intrinsically more secure than MurmurHash. However to my knowledge, there is no peer-reviewed cryptanalysis to prove it.

There seems to be very little research into fast, robust, hash algorithms which are suitable for dynamic languages. Siphash is a notable exception and a step forward, but within the Perl community there seems to be consensus that it is currently too slow, at least in the recommended Siphash-2–4 incarnation. It is also problematic that its current implementation only supports 64 bit architectures. (No doubt this will improve over time, or perhaps even already has.)

Universal hashing also seems promising in theory, but unproven in practice for cases like Perl, where the size of the hash table can be very small, and where the input strings are of variable size, and often selected from restricted character sets. I attempted to implement a number of Universal Hashing-style hash functions for Perl, and was disappointed by extremely poor distributions in the bucket hash for simple tasks like hashing a randomly selected set of integers in text form. This may have been a flaw in my implementation, but it appeared that Universal Hashing does not perform particularly well when the input bytes are not evenly distributed. At the very least further work is required to prove the utility of Universal Hashing in a dynamic language context.

The dynamic/scripting language community needs the academic computing community to provide a better tool box of peer reviewed string hash functions which offer speed, good distribution over restricted character sets and on short strings, and that are sufficiently hardened that in practical deployments they are robust to direct attack. Security is important, but theoretical attacks which require large volumes of key/response exchanges cannot trump requirements such as good distribution properties and acceptable performance characteristics. Perl now makes it relatively easy to add and test new hash functions (see hv_func.h), and would make a nice test bed for those interested in this area of research.

Afterwards

I would like to thank Nicholas Clark, Ruslan Zakirov, and Jarkko Hietaniemi for their contributions which led to the changes in Perl 5.18. Nicholas did a lot of deep analysis, and provided the motivation to create my attack proof, Ruslan provided analysis and a working key-discovery attack scripts on Perl’s old hash function, and Jarkko motivated me to look into the general subject.

[2] If anything, such an attack might make access to keys that aren’t part of the attack faster: The attack keys cluster in one bucket. That means the other keys are much more likely to spread out among the remaining buckets that now have fewer keys on average than without the attack.

Would you like to be a Developer at Booking.com? Work with us!