How to Add New Features to Your App in Production and Not Ruin Anything
In this article, I’m going to show how to perform an efficient upgrade to a part of a monolithic system’s backend, with examples from my personal experience. First, we’re going to talk about different types of software development; we’ll also focus on the importance of upgrades in a competitive system’s life cycle. Furthermore, I’m going to describe the ways to minimize risks through efficient architecture, togglers for gradual deployment of a feature into production, and the developer’s good rapport with co-workers.
Types of software development
I like to view time as an element of a universal differential equation describing the development of various systems. With time, all systems either improve or cease to exist — much like in the process of evolution. For the purposes of this article, by “development” I mean the upgrades done to the backend of server apps.
There are various ways of software development (see Steve McConnell’s “Code Complete” for details). It can be done through gradual improvement and addition of new features, or through a complete rebuild in order to create a “better” system. There’s also the hybrid approach which combines the first two.
The complete rebuild approach has some serious downsides: not only does it take large amounts of time, but the new version constantly lags behind the master version, as the latter constantly has new features added to it. The process can become several times longer than estimated initially, which bears the risk of failure as a result.
The gradual improvement and refactoring approach is much more common, although it’s only applicable within certain limits. For instance, no amount of refactoring can adapt a system to high load if it was initially designed for a lower one.
With the hybrid approach, some parts of the system are rebuilt from scratch, while others are gradually improved. Think of it as a constant upgrade process in which parts of the system are swapped for new and improved components.
The plateau effect
It is common to think about development as a gradual process; however, in practice it usually happens in relatively short bursts. The pause between two consecutive bursts is called a “plateau”. A plateau period is characterized by a lack of growth (or very gradual one); after this, a leap is made to a new quality level, followed by another plateau.
When talking about software life cycles, the plateau effect can be explained by several factors. One of the reasons is that every system has a number of limitations, such as the load range it was designed for. It is never a good idea to rebuild a system without clear necessity, particularly if the company’s profits depend on it. Thus, substantial upgrades are usually performed only when the system’s existing limitations can no longer satisfy the current requirements — for instance, when the load reaches the maximum point the system can work under. This often implies considerable time constraints for the upgrade process.
The bursts between the plateau periods are commonly called “upgrades”. I’m going to use this term to imply all the actions aimed at quick feature additions and performance improvements. At IT conferences, speakers tend to use the word “upgrades” to signify the key points in their systems’ development timelines. Upgrading a system is almost always a complex task. It often implies utilizing new technologies or rebuilding a large part of the system’s code. No matter what type of development you prefer, one thing remains constant: you absolutely need to upgrade your system if you want it to stay competitive.
Upgrading a system in practice: how to minimize risks if something goes wrong
From the developer’s point of view, almost any task can be divided into three stages: lengthy development, testing, and deployment. In actuality, the developer’s initial plans are often scrapped: the main problem is that something can always go wrong, particularly in a complex process such as upgrading the whole system.
Problems can arise not only during development (such as lagging behind the initial plan) but after release as well. Something might fail either within the system or in each of its connected external components. While external systems are beyond our control, we can certainly take actions to minimize the risks within our own.
Building feedback mechanisms
When developing a new system it is crucial to constantly ask yourself a simple question:
“What am I going to do if something goes wrong?”
This will help you to identify the critical points and possible ways of recovery, should anything go not quite as planned.
Commonly, any shortcomings that happen after deployment are identified almost instantly. Everyone knows how important metrics and logs are. When skilled developers work on a good product, they use metrics for practically everything, so that no failure can slip under the radar. However, each system is different. In some cases, there are so many parameter combinations that telling apart the symptoms of a problem and signs of normal operation becomes very tricky.
Imagine you are a developer and your system has a problem. What are your actions?
How are you going to identify the problem?
What are you going to do if something fails two days after release?
Should you monitor your system after rollout, or should you immediately plunge into a different project?
If not you, who is going to monitor the system after release, and for how long? Is it possible to find at least one person with enough free time to do that?
Finally, if something goes wrong (or you at least suspect a problem), who is going to help you? Are you going to create database queries and search through the logs yourself in order to prove your suspicions?
If you have ever found yourself in this unfortunate situation, you know what it feels like. To avoid getting into it ever again, think through the negative scenarios beforehand and work out the solutions during the planning stage, while you still have plenty of time left. This way, you will know what metrics to analyze and what logs to look through when you actually find yourself pressed for time.
Before getting to the development process, you should create a dashboard — a plan highlighting the figures, metrics and logs that can be used to get feedback regarding the changes you make. If possible, start collecting data for the new metrics using the old system: this way, you will see the difference when it comes to the new one. This is somewhat similar to TDD, where you start working by first developing the tests for your future system . The practice of identifying the crucial metrics and logs before developing the system will save you a lot of trouble in the future.
You should know and love your product; without that, creating an efficient dashboard is impossible. I believe every developer needs to know the key figures and understand the core metrics of the supported product. The developer also has to have a clear image of the main clients using the product, as well as their patterns of behavior. For instance, the clients may have striking differences in traffic distribution during the day; in this case, the same figures might indicate a problem for one client, yet be perfectly normal for another. Very often, it takes a human specialist to tell such subtle nuances apart.
Minimizing losses by using togglers and gradually replacing old features
One more way to minimize risks when releasing a big upgrade is using togglers for gradual deployment of a feature into production.
One of the most common mistakes to make when upgrading the system is to perform an abrupt switch from the old version to the new one. Snap your fingers, and the system immediately goes from state A to state B as soon as the upgrade is deployed. But more often than not, snapping your fingers just isn’t enough.
In some cases, canary deployment (releasing the new feature for only a fraction of the total traffic) can help with regression testing of the system, but it doesn’t always work efficiently. Changes are rarely kept in split testing for more than one day. At the same time, the full testing process can be dependent on some external cycles, which might last for days and even weeks.
Speaking from my own experience, the most efficient way to roll out complex features is to gradually switch the traffic via togglers. Unlike split servers (to which traffic is redirected for short-term testing of a new feature in real-life conditions), togglers allow to deploy the feature in such a way that the system can work long-term in two modes: with or without the new functions. Togglers also allow switching between the two modes of operation. For example, your clients may take their time to approve the changes, which then will have to be introduced instantly. Togglers can be implemented in various ways, depending on your infrastructure and the task at hand: via databases, environment variables, or gradual migration of nginx routes to the new API version.
One of the first tasks we used togglers in was reworking the system of subscription cancellation. Initially, subscriptions were prolonged several times a day; this was acceptable while the number of subscriptions was fairly small. As that number grew, the cancellation system needed to be reworked so that subscriptions could get prolonged with accuracy down to a second.
At the testing stage, it became apparent that we couldn’t simply roll out this feature for all our partners; instead, we had to coordinate with each partner individually, so we used a project-based toggler, dividing all live projects into three “waves”. When the new feature went live, it was only active for the test project. At first, we only switched it on for small projects in the first wave. When the time has come to switch the main projects, we had already built the infrastructure, worked out the analysis, and understood all the nuances of the new system’s behavior.
Each solution comes for a cost: along with the extra flexibility, you are going to face some complications. You shouldn’t let these downsides hinder your progress, though:
- Migrating from the old system to the new one might take much time. In the process, you essentially have to support both systems and keep the development going. The logic, the tests, the logs, the configurations — pretty much everything is doubled. It’s a good idea to analyze the collection of logs and metrics in the old system and think of a way to conveniently represent it in the new dashboard. You might also want to consider solving some specific tasks at the preparation stage.
- Resolving logic-based conflicts between the old and the new version is another thing to consider. In most cases, you won’t be able to roll out new features without affecting the existing ones. You should test the system for degradation and apply the necessary fixes in both versions.
Figuratively speaking, my idea of a system upgrade is that of moving house. Instead of taking down the old house and building a new one, we live in both while gradually migrating various home appliances to the new house. It is a long process, and it may seem irrational in real life; yet it is a good choice when you have much to lose.
Think of it this way: you aren’t just rebuilding the old system. Instead, you are creating a new system, replacing the old one step by step.
When planning the switch, allow some time for further analysis. Your ultimate task is not just to roll out the new features; it is to make sure that the new features work properly after deployment — and this may take time.
There is no magic pill, and using togglers might not be optimal in each and every situation. But when applicable, it can surely make your task easier.
Establishing a good rapport
A developer who really wants to deploy a new feature in time has to take many external factors into account. Other developers, managers, admins, support staff, the need to coordinate subsystems in parallel development, the difficulty of testing, documentation maintenance, even forecasting the end users’ reactions — the developer depends on all the these. With an incorrect approach, this may very well boil down to a “war that never changes”, to quote a video game.
“We are what makes the company.” This is something I got to hear in every team I’ve ever been a part of. The thing is, every team in every department thinks so. Admins, anti-fraud specialists, support team members, managers, testers — all these people will tell you they are the basis of the company. This article was written by a developer, from a developer’s point of view, with the developer in the limelight. This doesn’t mean that an admin, a manager or a tester cannot be in the center of attention. On the contrary, it’s good when system is surrounded by different kinds of specialists, each with his or her own opinion. But keep in mind that the tasks of other departments are inherently different from yours. It may look like you’re working on one and the same problem, yet your purposes will differ. This way, having the deepest knowledge of the subject, it is the developer who has to be the driving force behind the upgrade process.
It is my strong belief that the developer and his team should fulfill the task. Once you’ve got invested in something, it should become of utmost importance to you. It is you who chooses the team members. Perhaps you should get your admin, your analyst, your technical writer, and/or your support staff member on the team. In case you haven’t done it yet, consider picking your own team members and communicating with them during the whole development cycle. Don’t hesitate to consult the analysts or any other specialists that might aid you with your cause. In an ideal world, the developer has to communicate with all other departments in the company, so that the information regarding the task’s current state flows freely both ways. This is particularly crucial for complex tasks.
In brief, these are the key points I would like to make with this article:
- It is not a good idea to rebuild the system from scratch and then abruptly switch all traffic to the new version.
- At the architecture planning stage, poorly organized dashboards containing logs, metrics and analytics will hurt your feedback capabilities in the long run.
- You should strive to provide overall transparency, accessibility of documentation, free flow of data, easy access to dashboards containing logs and metrics, clear and comprehensible descriptions of all processes, and backup plans for possible negative scenarios.
- You should establish good working relationships with admins, DBAs, analysts, technical writers, support staff, and other important co-workers. Make sure they all have a clear understanding of the task beforehand.
- Last but not least, don’t try to accomplish everything on your own. More often than not, it will lead to poor results.