Experimental Development

Michael DiCicco
Turo Engineering
Published in
6 min readApr 17, 2018

As engineers, we are always being pulled between competing forces. The needs of product pull one direction, performance and scalability another, and reliability another. Sometimes these forces align, other times they compete. When we know these ahead of a design it helps to inform a solution that best balances them. However, how do you design when these forces are unknown prior to development?

This is exactly the case I’ve had to deal with several times recently at Turo. Highlighted here is an approach I’ve developed for dealing with these cases in an experimental manner that allows for developing quickly, in a low risk manner, while capturing information about the problem space.

A recent problem

A prime example of one of these scenarios I’ve dealt with recently was a home-page merchandising project. Our product team had been reviewing a lot of our analytics data, and realized that we were offering far too static of a merchandising experience on our home page. The home page is the first impression of any site, and by offering a generic experience we were leaving impressions on the floor. The product solution was to offer content specific to the person viewing the page. This was to encompass a number of factors, such as recent bookings or past searches; in the case of logged in users, or location and signature listings for anonymous users.

This data was to be provided by an recommender engine developed by our data science team. Given a number of inputs, it returns a list of vehicles that are optimal to display from an engagement/conversion perspective. This data was already being used for marketing, but had not been leveraged on our site in a live manner yet.

From the engineering/operations side, the homepage needs to be performant. As the most frequently and often first viewed page, it needs to load quickly and be high quality. Things like missing elements, or shifting page elements are not acceptable. Additionally, links need to be good data, and display high quality images (both resolution and content).

With the constraints identified, I had to find a solution that would meet them all. A traditional approach would have been to write the feature, performance test for expected loads, do a verification against a live data set, then follow this with a product A/B test to verify the impact. Then iterate as needed to hit the desired metrics, at each phase needing to refresh the implementation and reverify inputs & performance. This approach can be time consuming and has the effect of potentially moving the goalposts with every product revision.

In order to avoid this, we instead chose to leverage an experimental approach that allowed for all of these iterations to happen at the same time. With a little bit of forethought and up front preparation, we were able to collect & tune performance information while developing multiple candidates at the same time, then product testing and choosing the optimal candidate.

The implementation

The first step in the solution was determining what the give and take would be from our product team. What would the candidate changes be, and what was required as a minimum viable product? In the end it was determined we would have 3 sections that would either entirely display or not display at render time. This had the negative effect of meaning that a long delay in finding results would delay the page load. We were, however, able to concede that it was better to load less targeted data fast in the event a better set of data was not available.

We came up with 2–3 candidates for each section, and a set of thresholds for acceptable performance. Now came the challenge of how to test these. We had no idea how much data we would need to feed to our recommendation engine, and how it would handle what we were going to begin to ask of it. We had been able to run basic load testing on it, but we didn’t know what the real world requests would look like. Would there be many requests about a small number of variables, would the request rate be fairly consistent or spikey? Would the results be consistent and overlapping, or vary?

To begin I set about building a test infrastructure that would

  1. capture data about our usage model and
  2. begin stress testing our recommendation engine. In order to achieve this I leverage scientist4j, a tool we had used previously for targeted high risk changes.

Scientist allows for executing 2 code paths in parallel and doing a basic result comparison for consistency. I extended this to add additional instrumentation. In the extended setup I was able to instrument before and after the method execution, as well as placing markers inside the test methods.

(Two parallel container runners get initiated in a background task, each executing independent of the other. These containers were wrappers that took a Callable and then were passed into the Scientist async executor)

After building out the test framework, the first step was to simply capture the usage profile with no-op methods (below). This both tested that we were in fact having a zero time effect on the calling pages and collected a high level profile of call traffic. Once this was verified I was able to create and test the first piece of logic, a static fallback operation. This operation would be used in the event our external service is unavailable or returns no result. This operation would be a fast direct query with general results. Once this was in place I enshrined it as the baseline “control”.

(graph over time showing the two cases for the different input types. Purple indicating cases that would support “precise” logic and blue being cases that only support “imprecise” logic)

The second step was to develop our “precise” and “imprecise” methods for calling our external service. I developed each of these methods locally with strong unit test coverage, then dropped them into the experimental container in production. This immediately let me begin collecting performance data on each candidate approach.

(Results for our initial “precise” candidate, showing run time of the 75th, 95th and Max percentiles. This demonstrated the significant performance variance based on input and time of call. It also highlighted SLA type implications to push up-stream to our data science team. The 75th percentile fell within acceptable requirements but the others did not.)

Now, armed with data about both the calling profile, and the performance we were able to do a sit down with product and design and evaluate what the product should be that we want to release. This initial data gave us confidence that we could use the final revision of each of the 3 candidates in conjunction together.

  1. The fallback always would have results and was performant. This meant that the sections could always be displayed, even in the event of a service outage to the recommender.
  2. The “imprecise” method, would be our first attempt at getting results and was performant enough to display with a loading indicator in place.
  3. The “precise” would be our second fall back.

This solution did raise some new questions though. In the event our first pass attempt failed, we would need to run the static logic to backfill results. In practice, some percentage of results would take significantly longer to run, and we needed to know what that rate would be. The good news is we had an experimental framework now in place that could easily answer these questions with just some small changes.

This final iteration of the testing saw us develop 2 fallback models that we wanted to benchmark, measuring the average and worst case performance. I inserted markers throughout the candidates to mark when we returned in various scenarios and was able to graph how often we missed and needed to run fallback logic. It became obvious very quickly that the behavior of our precise model would not be useful on the home page. When we fell back to it, we almost always had to backfill with some static results as well.

(This follow up metric shows the instances where an execution used each piece of logic. The closeness of the yellow “fallback” logic to the purple “precise” logic shows how often the “precise” logic failed to return viable results. Conversely the large number of times the blue “imprecise” logic was able to be called without needing fallback logic shows that it is performing well)

This allowed us to finalize our design for the home page, using just imprecise and fallback candidates. It also gave us good information for iterating and generating a precise product offering that was acceptable for below the fold usages not on the home page — in this case our vehicle page.

Engineering is often times about freedom to experiment and try things that might not work. Having a safe (to users and business) means for doing this in production with buy-in from product and design can be a very powerful tool in the toolbox. Upon reviewing and taking retrospective of our final product after shipping, everyone agreed that the product would have been impossible to ship in its final form had we not had the freedom to experiment and collect data in a production setting. This has also aligned well with our general data-driven design philosophy and has been useful in several other applications since.

--

--

Michael DiCicco
Turo Engineering

Software Engineer, currently @Spotify. Climber/hiker/skier/photographer. Sci-fi/fantasy/gaming nerd. These are my personal musings.