Exploiting A/B Testing for Fun and Profit (Part 2)

Juan Berner

This is the second part of a post based on a talk I gave in the Ekoparty security conference in 2016 on exploiting A/B testing frameworks. You can find part one here.

A/B Testing for Penetration Testers

An important aspect of A/B testing, that impacts directly on penetration testers, is that when a website uses A/B testing for anything that affects their logic a single web vulnerability scan won’t give them a complete picture of the actual website. Multiple times I have heard from confused penetration testers how they were able to find a vulnerability and realise they could not trigger it anymore when starting from a fresh new session, given that an experiment would place them on a different variant which was not vulnerable. Even though A/B testing is becoming the norm, this has remained largely ignored by security specialists.

If any page might have multiple experiments at a time with multiple variants, each of which your normal vulnerability scanner would only see partially, what could be done about it? For example, if a website was performing only 4 experiments at the same time with each of them presenting two different backend logics, a single scan would only detect half of the possible views of those experiments for that website, ignoring possible vulnerabilities in the other half.

A naive thought might be that scanning such a website 2 times would be enough to get all it’s possible versions, but there are two things to take into account:

  1. Experiments are usually sticky, which means that you would need to ensure that the same cookies are not being used (or anything that could identify the scanner as the one seen before).
  2. Since we are assigned to experiments under probabilities, we can’t guarantee we will or won’t be part of an experiment with complete certainty.

Now, if you knew that a website usually has 10000 concurrent experiments happening, does that mean you would need to scan it 20000 times to ensure you have seen all of them? While the answer is that you can never be sure you have seen all possible experiments (unless there is data leaking from the experimentation library) you could estimate how many scans you need to satisfy a degree of certainty.

Taking the previous case as an example, if we state that the experiments are controlled by a probability of 50% to fall in one of two versions, that would mean there would be 33.554.432 (2¹⁰⁰⁰⁰) possible combination of the website to scan. Of course this is not practical and we don’t need to get 100% coverage, but the higher the amount of independent crawls (which means that the victim site believes they are all independent users) will increase the possibility of coverage.

Let’s place a simple example for websites that only run experiment with two versions, giving them a 50% chance on falling on any of them. A formula to calculate how many scans we need to satisfy a particular confidence condition would be:

log_2(E) + log_2(1/(1-P)) + 1

P: Desired probability of getting all scans
E: Amount of experiments expected on site

log2(10000) + log2(1/(1–0.99)) + 1 = 21

Which means that we would need to perform 21 scans of a website that runs 10000 experiments (E) if we want to have a confidence of 99% (P) of seeing all possible versions of the experiments. Given the logarithmic nature of the problem, we could configure safe defaults (for example 50 to 100 scans for very large websites) to cover with a good probability that we will have checked all possible versions of a website.

Amount of scans of a website needed given an amount of experiments to get a degree of probability of seeing all possible versions of the website. While for a small amount of experiments it might take many scans to get a good certainty, the logarithmic growth means a site with thousands of experiments could be fully scanned with only tens of iterations.

These experiments can become a very important part of a security assessment, since they provide a glance into fresh new code which might be untested due to the nature of an experiment making it less secure than its otherwise established counterpart.

This also means that all scanning tools right now are giving only partial results, and even if they do find a vulnerability in an experiment, the user of the tool might not actually be able to reproduce it since its session might not exist in that particular experiment.

On my talk’s third demo I used the a web application vulnerability scanner (w3af) to scan a website with a vulnerability behind an experiment to show that the amount of scans needed given the amount of pages and experiments on that site.

Embracing experimentation

As a penetration tester, experimentation should not be seen as problem which reduces your results but as an opportunity to find new code which might have gone through lower amounts of testing. Web vulnerability scanning tools should adapt to spot experiments improving their chances of finding vulnerabilities which might otherwise remain hidden improving the security of said applications. This could be done by providing the ability to detect with multiple scans experiments on a site and saving the necessary information so that when a user wants to replicate a vulnerability that was found, it could do it under the profile which had the vulnerable experiment selected.

Additional information

Slides of the talk
Video of the talk
The code used in the talk

Juan Berner

Written by

Security Researcher & Architect

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade