A/B Testing vs Caching

How we managed to have the best of both worlds in Expo

Mario Martins
Walmart Global Tech Blog
8 min readOct 10, 2020

--

Image credit: holdentrils

Let me start by telling you how I began working on this project.

I joined Walmart Global Tech to work on our in-house A/B Testing & Experimentation Platform called Expo. It had been a month or two since I joined the company, and of course, I was still in the process of onboarding. That day, a colleague turned to me and asked if I was going to be responsible for taking care of the Expo issue with ESI template caching.

Needless to say that I was confused. He briefly went through what ESI templates are, how our reverse proxy was setup to cache them, why we were having issues, and their proposal to fix it. It sort of made sense. They had a valid solution for the problem, but there were some gaps in it that I assume it was my lack of knowledge of the subject.

The next day we had a meeting with the reverse proxy team to discuss the issue and talk about the timeline to resume the work on it. To be honest, I still didn’t know enough about the problem to understand it entirely. Still, I could tell at the meeting that there were unfortunate experiences in the past, and I’m glad my manager handled most of the conversation. It was only later that I learned that the team had enabled caching in production once, and it didn’t work as expected. And hence, people were not eager to try it again.

The problem there is that caching is an inevitable necessity. For high traffic websites like Walmart.com, there is always a pressing need to increase user conversion by supporting more concurrent users or faster load times, and leveraging only code optimizations or adding more computational power is not always the answer. Caching plays a crucial role in any website architecture, and it’s used in many different scenarios. For example, browser caching, Content Delivery Network (CDN) layer caching, Reverse Proxy caching, In-memory caching, distributed caching, etc.

However, because we have an A/B testing platform that intentionally changes the site for different users in order to compare their behavior, enabling cache in certain scenarios can lead into users participating in an A/B test to see the wrong experience. This happens simply because of the nature of cache systems that save previously generated data to serve the next user accessing the site. For example, let’s take two users making the same request. The first user gets assigned to treatment A. The second user gets assigned to treatment B but gets the cached response of treatment A. This makes the second user’s experience invalid for the test because they see the same site version the previous user saw.

For that reason, as the owners of the Walmart A/B testing platform, we had to push back on any initiative that involved caching of frontend or backend applications until we have a proper solution for it.

The first attempt

The first solution proposed by the Expo team was to have our Assignment Engine generate a general cache key extension that could be added to any cache system’s key.

Since Expo assignments are generated globally when users access the site, we know beforehand what will be the combination of all experiments and their variations that each user is assigned to. Using that information, we generate a cookie that is unique for each combination of experiment assignments. Cache systems can then append this value to their cache key to ensure assignments are accounted for when saving data to the cache storage.

The problem with this solution is that the number of possible assignment combinations users can have is dependent on the number of experiments running in the system. By adding this cookie to a cache system as part of its cache key, the number of combinations increases considerably, causing the cache system to fragment its data store. This drastically reduces how effective the cache solution is by increasing the timings to fetch data from the storage and forcing it to free up space more frequently.

The problem with ESI templates

ESI stands for Edge Side Includes¹. It’s a templating solution that allows dynamic content assembly at the edge of the network. It could be implemented on a Content Delivery Network (CDN), in a Reverse Proxy, or even directly in the browser.

The specification defines many tags that can be used in the language, but for this article, let me show you a simple template. In the example below, you will see that the template has some custom content and includes the content from another HTML page as part of the final result.

ESI template implementation example

Since we had the ESI processor implemented in our Reverse Proxy layer, a rough sequence diagram of how the above template would be processed is as shown below. The reverse proxy is in charge of identifying that the ESI templates are loaded and executing internal requests to compose the final response.

ESI implementation on Reverse Proxy

Now, as you can see, a single request can trigger the reverse proxy to execute multiple requests, which is not optimal. This can easily cause stress on the reverse proxy or on the application servers that serve templates or fragments used in multiple templates. The way to minimize this problem is to enable caching on the reverse proxy for template pages and overused fragments.

But what happens if there is an A/B Test running that modifies one of the ESI base template pages or one of the fragments? The caching system needs to serve the correct version of each one of these pages based on the users’ assignments. Of course, as explained before, they could achieve this by adding our Cache Key Extension Cookie, but that will rapidly fragment their data store, and it will simply not be effective. In fact, this precise issue was what happened when caching was turned on in production for the first time.

The solution there was to pre-process the Expo Cache Key Extension cookie and optimize its value depending on if the request is for an ESI template.

A flag was added to experiments to mark the ones that modify those pages, and that information is then used to recalculate the Extension cookie value on every request. The logic is relatively simple. We check if the request is for an ESI template page. If it is for a base template, check user assignments to see if he is assigned to any of the flagged experiments. If he is not assigned to any of those experiments, rewrite the cookie with a constant value. Otherwise, if he is assigned to any of those experiments, calculate a new cache key considering only the flagged experiments. This only happens on the request flow, in a way that these optimized cookies are never persisted in the response.

The above is a good solution, and in fact, Walmart has a patent on it, but the implementation can be tricky.

First of all, it only works on caching systems being set up on the reverse proxy or after in the request flow. It does not allow any caching to be used on the browsers or the CDN layer. The second issue is the fact that all requests have to go through this logic, making the site slower in general. But, the tricky part is is that it only works when the setup is correctly coordinated between DevOps Team, the Application Development Team, and our Team. Everything needs to be set up correctly for this solution to work. That is often not the case, since teams are always making changes to their applications and modifying DevOps configurations.

When we finally had all systems properly set up, we executed an A/B test to check whether enabling caching would improve the site’s performance, and to our surprise, that was not the case. We found out that enabling caching in our Reverse Proxy layer did not perform as expected and we had to turn it off.

Custom Cache Key Extensions

As time went by, some engineering teams had to push and they started implementing other caching solutions even if that means A/B testing using our platform was not supported. The pressure to find a more robust and general solution was getting stronger as the issue was still pending.

We needed a solution that would work on any caching solution. That allows Application Development Teams to choose the caching technology they want to use. Something that would allow them to leverage CDN layer caching or even users’ browser caching if needed. On top of all that, we needed something that would perform well enough to not impact the website.

It turns out that the solution wasn’t really hard to achieve. It was a combination of the two presented above. We needed to allow our Assignment Engine to generate custom cache key extensions for specific applications. In the same way that we generate our existing cache key extension at assignment time, we could generate other extensions optimized for specific applications.

The optimization process is similar to what was described before. We added an option to allow users to mark experiments with treatments that affect applications that use these extensions, and we use that information to calculate them right after our experiment assignment process. This way, all cache key extensions are generated on assignment time and persisted on cookies until it's time to calculate new assignments for a user session, thus mitigating any major performance impact.

One of the main advantages of this solution is that the application development team can choose pretty much any caching solution as long as they add our extension to their cache key. For example, we currently have one application using Reverse Proxy caching for backend requests and client-side browser caching for XHR requests.

Of course, it is important to say that we must limit the number of cache keys to be generated but we have full control over this mechanism and it's all a matter of aligning the need with the application development engineering teams.

Lessons learned

One important lesson from this project is about the value of having an in-house platform for A/B testing. Being able to customize it as we will is crucial to be able to deliver such features. We had multiple iterations of projects tackling this problem with slightly different requirements and it was a real team effort to get to the level of understanding of the problem we have now and be able to deliver a robust solution for the company.

Another lesson was to have an honest sit down with the application engineering team and hear their thoughts on the problem. We had a team struggling to keep their application stable and we were holding them to apply a much needed caching solution. They are in the trenches fighting to keep the site up and running and it was really important for us to hear their opinions closely on the matter and find a common ground.

All in all, we finally managed to have a robust solution for applications that need to enable caching while still supporting A/B Testing using our platform.

References

[1]: (04 August 2001). ESI Language Specification 1.0 https://www.w3.org/TR/esi-lang/

--

--