Delivery@Uncountable

Surviving the Enterprise “Go-Live”

How Uncountable’s engineering team tackles unexpected scaling needs

Will Goldie
Uncountable Engineering

--

Uncountable’s platform and engineers adapt to exponentially larger data sets. // Illustration by Cathy Yang

#Uncountable is hiring! To help us accelerate the future of materials development, check out our Careers page.

At Uncountable, our engineering team strives to follow an aggressive daily strategy of continuous integration and continuous delivery. We believe that deploying new code every single day is the best way to attain our values of Execution, Efficiency, and Empiricism. It forces us to focus on shipping features and evaluating whether they help solve the problems our users encounter, while developing cutting-edge new materials and products. In this article, I’ll walk through one aspect of our engineering process: how we’ve learned to scale our software platform to handle more users, data, and activity.

ToTo paraphrase Tolstoy: “Performant software is all alike; all slow software is sluggish in its own way.”

Scaling is different for every project and company. Consider a self-service personal finance app where individual users can sign up, select a monthly subscription, and receive access to the system — all of their own volition. This app might see relatively smooth user growth over time, as users are driven to the product by different marketing channels.

In contrast, in an enterprise product like Uncountable, people often sign on to the system for the first time in blocks of many users, for example, during an onboarding of new corporate clients. Often, these users will bring large volumes of historical data. The result is that we often see sudden, discontinuous jumps in usage and volume of data. Repeatedly, a new customer deployment has surpassed our previous record for volume of data or system load by an order of magnitude. For instance, we’ve seen a database table that was always under a million rows in size balloon to 10 million in a single deployment. Another time, a user group with a high resolution mechanical testing device loaded in curve data with tens of thousands of points per experiment which slowed some of our frontend visualizations to a crawl, until we downsampled.

Getting caught out by a 10x jump in system load is never fun, especially when you’re trying to create a good first impression for new users. We do our best to anticipate these jumps in usage and prepare for them, but can’t always do this perfectly. For instance, a new user base might bring in a familiar amount of experimental data, but then end up relying heavily on a particular analytical charting feature that was only used modestly in the past. That feature could be strained to its breaking point within days. Other users might bring in data at a higher sampling rate or resolution than previously seen, or build calculations to model the chemical systems they’re designing that are more complex than we’ve seen before. In reality, there are so many dimensions and nuances of the R&D processes our users work in, that it’s always possible for someone to strain some part of the system.

In summary: in addition to increased quantity of usage, the type of usage itself can shift, which is difficult to plan around. In addition to anticipating problems, we need to be agile enough to respond to performance problems quickly and deploy fixes in just a few days.

Illustration by Will Goldie

Here are some of the strategies we’ve developed for tackling this discontinuous enterprise scaling problem:

1. Plan at least one order of magnitude ahead.

ItIt easy to fall into the trap of using your largest current dataset or usage level as a benchmark for the largest dataset you’re likely to encounter in the future. In the past, our development and test datasets would generate a system load similar to that benchmark, and we were largely satisfied. In reality, when a larger dataset inevitably came along, we were often unprepared for a “jump” and the test datasets subsequently needed to be expanded.

We’ve fallen prey to this mistake too many times to trust that benchmark again, and have shifted to using a 10x multiplier on the largest past dataset to test new performance-sensitive features or optimizations to existing features. We arrived at this multiplier by looking at the sequence of “jumps” in system load we’ve seen in the past, and choosing 10x as a reasonable scaling factor to plan around. However, it’s just a heuristic, and could change in the future.

2. Everyone needs to know how to solve scaling problems

WWhen performance regressions are unpredictable and sometimes critically important to the product experience, it’s not a good idea to have just one or two engineers on the team specialized in performance optimization. They could quickly become a bottleneck for all performance problems, and a cluster of performance problems in a week might force them to drop other feature work and focus all their attention on optimization. This makes it difficult to deliver both reliable performance and new features to our users.

At Uncountable we believe engineers should develop deep ownership over the features they work on. Given this belief, our strategy for avoiding that bottleneck is: everyone on our team should become an expert at optimizing performance of the features they own. This way, performance work can become part of the natural maintenance loop of the feature, and is less likely to cause cascading failures across the workflow of multiple engineers. If your feature involves a lot of SQL queries, and those queries are slow enough to degrade customer experience or create undue server load, then it’s your responsibility to learn to optimize those queries. However, you’re not on your own; if someone else on the team is already a great SQL performance tuner, it’s going to be their responsibility to help you get there.

3. Target metrics, not aesthetics

TThere are a few software systems out there that have been optimized to within an inch of their lives, and it’s genuinely difficult to find code that could be improved from a performance perspective. Most systems are not like that. If your team has been doing a good job shipping features fast and avoiding premature optimization, you probably have lots of low-hanging fruit around your codebase that could be optimized easily. In fact, if your codebase is large enough, there might be so much low-hanging fruit that you’ll never get to all of it.

As an engineer, I find this state of affairs deeply frustrating on an aesthetic and spiritual level. I do lay awake at night and think about how many cycles a CPU is wasting when it runs my unoptimized code. But after a while, I go to sleep.

As a pragmatist, I know that optimizing all the low-hanging fruit isn’t the best way to spend my time. In fact, it’s not even the best way to spend the portion of my time set aside for performance optimization. It’s more important to spend time on optimizing the features that actually impact customer experience, either by presenting poor performance directly to the user or by overloading the overall system. These prioritization decisions about which features or systems to optimize need to be driven by empirical metrics, like API call performance or qualitative feedback from users about how responsive a frontend is. Of course, if a feature or system is slow or can’t handle a new scale of input data, it can make sense to tackle the low-hanging fruit within that subset of the codebase before the really difficult performance fixes. The low-hanging fruit might be enough. But it won’t always be enough, and sometimes, the improvement it provides will be marginal. It might be the case that no matter how many database queries you optimize, to get to acceptable performance, you really need to do that re-architecture you’ve been avoiding!

So, prioritization decisions about how to optimize a given feature or system also need to be driven by empirical metrics. This usually translates to performance profiling the code in question. [u]

Will Goldie is a fullstack engineer and engineering manager at Uncountable. Previously, he studied Computer Science, Statistics, and Philosophy at the University of Toronto. He’s also worked on various software development projects as a consultant and built features across Uncountable’s platform. Today, he focuses on accelerating the team’s delivery of experimental data tracking and machine learning features. Outside of work, you’re likely to find him reading or working on DIY projects.

#Uncountable is hiring! To help us accelerate the future of materials development, check out our Careers page.

--

--