(re) Building Trust by Doing Less

Published in

strava-engineering

9 min readDec 4, 2019

As engineers, we’re often concerned with doing more. How can we ship more features, handle more scale with our services, or write more code? While there is certainly benefit in learning how to do more, there are also times where intentional restraint can pay off handsomely.

I was reminded of this by a recent service refactoring project here at Strava where we focused on doing less, and achieved our goals in a much more efficient fashion than we would have otherwise. To share this story, we’ll first look at what we were building and why. After a brief detour to talk about the concept of opportunity cost, we’ll look at several things we intentionally didn’t do while rebuilding the service, and how this helped massively reduce opportunity costs. Finally, we’ll take a step back and see how exercising restraint and doing less can be a useful concept in general.

Xel’naga: What and Why?

Athletes all over the world trust Strava with their data. In return, we strive to provide them with a safe, secure environment to share their athletic achievements and receive motivation from the community. One issue that detracts from this safe environment is the presence of bad actors such as spammers on the platform.

A bit more than a year ago, a simple service named Xel’naga (apparently named after the Xel’naga of Starcraft lore, though the meaning of the name in this context has been lost to personnel turnover) was created to solve this problem. This service presented a simple interface to Strava systems: given an athlete, return that athlete’s “trust score” to the client. The client could take action based on this value. For example, users with a trust score corresponding to “banned” would be unable to take most actions on Strava. Alternatively, users with more trustworthy scores would be able to participate on Strava unhindered.

These trust scores were computed as a weighted sum of individual trust types. For example, a manual ban was one such trust type. Certain types of suspicious actions would be another type. Each trust type had a weight associated with it, and the weighted sum of these trusts corresponded to different discrete trust scores.

This service was initially implemented and served reliably in production for a several years. However, an increase in spammer activity on Strava prompted the trust team to re-prioritize proactive anti-spam measures. Looking at the service as it existed and comparing it to our needs, it was clear that there was some more work to be done. It was time for some remodeling!

Detour: What is opportunity cost?

Before tackling the nuts and bolts of the service refactoring, I wanted to take a brief sideroad and review the concept of opportunity cost. In economics, opportunity cost is the cost incurred by choosing one option over the most valuable alternative option.

This is typically expressed in terms of money: by leaving your money inside a jar in your kitchen versus investing it in the stock market, the opportunity cost incurred is equivalent to the expected rate of return of the stock market. You lost this return by not choosing a more profitable use for your money.

This concept is also useful for other types of resources besides money. In this case, the resource in question is the time and attention of a product development team. This resource can only be focused on a finite number of tasks at a given time, so a tradeoff is always being made between the utility gained by whatever the team is currently working on and the utility that could be gained by some alternative task. In this case, we can define utility as delivering valuable features to Strava athletes.

One of the general principles of Agile methodologies is to reduce the opportunity costs incurred by getting bogged down in an unproductive quagmire by using short iterations. Each iteration should ideally deliver some value to relevant stakeholders. By doing this instead of very long, “big bang” release cycles, it becomes much easier to assess the current allocation of resources against alternatives, and make quick adjustments if needed.

Now that we understand what opportunity cost is, we can look at the different ways we reduced our opportunity cost with the Xel’naga project by intentionally doing less. Here are a few things we chose not to do.

Not rewriting the old service

Moving forward, we wanted a more flexible system than the one described above. Rather than representing explicit “trust” quantities and weighing their relative importance, we wanted to store simple facts about an athlete (called “attributes”) in the system. These attributes would have no inherent “trust” value associated with them. Then, given these attributes, we wanted to have some way of mapping them to a trust value.

By decoupling attributes from any specific understanding of an attribute’s impact on trust, we could change the “mapping function” in the future and improve our results. We could start with rules, move to machine learning, or any other approach that fits without having to migrate or convert the data stored per athlete. This also allowed us to store broader types of data: timestamps, strings, numeric, and boolean values, rather than the old services catch all “trust” numeric value. These richer attributes allow more intelligent decision making.

Changes made to the storage and scoring model of the service

Given some of the changes required to make this happen, there was some temptation to “start clean” and create a new service from the ground up with this in mind. However, considering that the existing service was already live and serving some limited trust information to other production services, we quickly focused on simply migrating the existing service to use the new paradigm one step at a time. By doing this, we could quickly start gathering data in production and prove out the new approach. Not rewriting got us here much faster.

Not doing machine learning

This was a hard one for me personally, since machine learning is one of the ultimate shiny objects in today’s tech landscape. Machine learning is a tool that has proven itself time and time again in these sorts of spam fighting applications. However, for the initial remodelling of this service, we chose to use a heuristic/rules engine based approach instead. Why was this?

As with the other decisions we made with this project, this was all about reducing opportunity cost and seeing value demonstrated quickly. Consider the different efforts required.

Machine learning:

Create a large set of labeled features (many hours of analyst or developer work)
Train and evaluate multiple models against this labeled data (much human and machine time)
Port selected classifier into production service
Periodically retrain classifier on new data if classifier doesn’t support “online learning.”

Rules based:

Identify common features based on knowledge acquired from observing spam accounts
Build simple rules implementation based on these
Provide easy visibility into rules decisions to allow for quick iteration/adjustment as needed.

Now, I’m not going to argue that the predictions generated by the rules model are as good as they could have been with a high-quality ML model, but I am certain that we were able to ship it faster than an ML solution. By choosing to do less, we could prove out the role of the Xel’naga service, and decide if investing in an ML solution made sense down the road.

Not bringing in the kitchen sink

One temptation that faced us when scoping out this redesign was to bring in as many attributes as we could. Since the design allows for easy addition of arbitrary data, the temptation to pile on as much as we could was fairly high. After all, due to the design decisions we made, adding extra data is fairly easy.

However, we focused on *not* doing this, ensuring that we brought in the bare minimum needed to fill out our initial rules implementation. While doing more wouldn’t have added more complexity, it would have taken some amount of time that could be better focused on shipping the initial experience.

Another area where we didn’t bring in more was the choice of data store. After some debate, we chose to use a simple RDS Aurora instance. The alternative option was Cassandra, which would have more performantly supported some bulk write/backfill type operations we had on the more speculative part of the roadmap. However, holding true to the YAGNI principle (and making sure we had a plausible escape route if we had chosen poorly), we went ahead with the quick to provision, affordable, easily scalable Aurora solution.

Not spending time on boilerplate

At Strava, most backend logic lives inside our Rails monolith, or within Scala/Thrift/Finagle microservices. Xel’naga is an example of the latter. Our awesome platform team (along with contributors from product teams) have created some excellent tooling around building out such services. Thanks to this tooling, we did not have to custom build:

Basic metrics around the service endpoints
GDPR “right of erasure” compliance (handled by a simple Erasure integration)
Kafka message consumer logic
Standard format for event messages

The last two points were particularly useful. By implementing a basic Kafka consumer, we could convert existing system events associated with user activity and write them as attributes in Xel’naga with very minimal code and zero integration work with other services. This was a big win for being able to expand Xel’naga’s knowledge of athlete attributes without forcing other systems to be aware that Xel’naga even exists.

Our team knows how to deploy, monitor, and instrument these services. If the service ownership is moved on to a different team, the new team will find the ergonomics of service familiar. The service will be able to benefit from bug fixes and performance improvements to its share components. Strava is the second organization I’ve worked for that utilizes shared, standardized service tooling, and the benefits always shine through when it comes to shipping product. It may feel less exciting than trying to match each problem to the optimal new technology, but the reduced time spent on boilerplate always pays off well in the end.

The Payoff

How did doing less pay off for this particular project? After getting everything planned and tightly scoped, we began the actual coding. Within a few days, we had migrated the live system to the new data store (which was quickly provisioned, being an RDS instance). By the next week, we were writing new attributes from the production system. By the end of the next sprint, we had the rules engine live and applying trust-based feedback to aspects of the product experience.

This system used a simple rules engine, based on a small set of data, and applied its results to a small section of the product. However, it worked in production end to end, serving out a trust score on every logged in request, proving the basic approach and allowing us to iterate based on real data. With the live system, we could answer questions about whether we needed to ingest more attributes, improve the rules system, or go all out and train a sophisticated machine learning model. Had we invested in any of these prior to shipping the minimal system, we very well might have got the wrong answer and incurred massive opportunity costs in doing so. By learning from a live system, we’re much less likely to make such a mistake.

In the end, this tight focus allowed us to drastically reduce our opportunity costs. If we had instead spent months building the “kitchen sink” system, that would have been months spent not implementing account security best practices, improving the login experience, and many other changes from the Trust team that benefit our athletes. The opportunity cost would have been large, particularly if the end result proved to be only marginally more effective than our lightweight approach.

It’s hard saying no. It’s hard to look past shiny new frameworks, tools, and processes. It’s hard to pick a simple approach that delivers the shortest path to providing your users with a better experience. It’s hard to do less. If you can maintain this discipline, focusing clearly on the value you’re hoping to deliver, it will allow you to ship more features, build trust with your product organization, learn more about your users, and spend more time working on the correct things.