On Operational Environments
If you work on a client-server application on which users rely you should soon wonder: “How can I test a new change without affecting my users negatively?” Tests in artifical scenarios are not sufficient. They cannot capture the interaction subtleties of the real environment and all of its dependencies. A better solution is to first test changes in another environment that mimics the real environment. The real environment is typically referred to as ‘production’ and the mimicked one as ‘staging’. The main purpose of staging is to reduce risk. You might desire even more environments for different purposes. At the same time you will notice the costs and challenges associated with managing multiple enviroments. In this article we will explore these operational environments.
In the context of this article we understand operations to include: provisioning, deployments, migrations, monitoring, alerting, logging, scaling, seeding data, syncing data, backups, rollbacks, and restorations. These are a lot of concerns to handle for just one environment let alone several. Of course, not all environments require all activities and the criticality of your service plays a role, too. The point still stands. Making sure that all environments are correctly set up is hard.
From my experience there are two main dimensions to consider for defining environments:
- Is it primarily used for development or for testing changes?
- Does the data need to mimic production data closely or not?
The Staging Environment
We have already introduced the staging environment (stg-env). The stg-env typically is the most similar to the production environemnt (prd-env). The idea is to test changes as closely to prd-env as sensible to reduce the risk of corrupting real data or other negative impacts on real users. Often, stg-env access is provided to a subset of your users as they tend to be better at finding problems. Note that the stg-env is not about testing the “usefulness” of changes. This is the task of user or A/B tests.
You might think that, ideally, stg-data should be an exact copy of prd-data at all times. We want to be as close as possible to the prd-env, don’t we? In practice, this is usually neither feasible, nor necessary, nor even desired! The prd-env data base might simply be too huge to sync. Caches might affect aspects of the prd-env but is it really worth it to copy them over? You have beefy machines in your prd-env to handle all the traffic. Is it really worth the costs to have the same machines on the stg-env with almost no traffic? Wait a minute! Shouldn’t we shadow production traffic to staging? And so on. Often, you do want to have some differences (in data) to test features. Remember, the goal is to reduce risk not to eliminate it as that would be far too costly.
At one job we simply overwrote the stg-env data base once daily with the content from the prd-env data base. In addition, we ran scripts to populate the stg-env with artificial data (e.g. test users) afterwards. This worked well in practice. At another job this kind of syncing would not have made sense. Instead, developers pushed changes to the stg-env and, if all looked good, they pushed similar ones (but not the same) to the prd-env. This worked well in practice, too. It really depends on the application/service and your goals.
No matter the setup of your stg-env the basic workflow stays the same. Before deploying changes to the prd-env you first deploy them to the stg-env. Again, the advantage is that you and others can check the effects of your changes in a system similar to the real one. As a side-effect, you will feel more confident about actually creating changes compared to having no stg-env. Once you are reasonably sure that your changes are correct you deploy them to the prd-env (and hope for the best). Another advantage of a stg-env for debugging and analysis purposes is that you can run expensive queries/computations there without affecting the performance of prd-env.
Besides the stg-env your project can benefit from even more environments with different purposes. The following is a list of additional environments that I observed and used in practice:
- If you use continuous integration (CI) then you can benefit from a CI environment that can be reset into a fixed state to reliably run automated tests.
- People might rely on the staging system being (somewhat) stable. Developers might then benefit from an “instable” stg-env that they can use as a playground for more risky changes. The data does not need to resemble the prd-env closely. Nonetheless, it is typically seeded from the prd-env (or stg-env) from time to time. The “instability” of such a development environment (dev-env) is a disadvantage, too. Others might readily change data or code in a way that conflicts with your own changes. It might also go down more often making it unavailable for a certain period of time.
- This is where the local environment (loc-env) shines. As the name suggests, it is local or rather isolated to the developer. It can run directly on the developer’s machine or on a remote machine owned by the developer. It is even better suited for risky changes and creating artificial scenarios. Usually, you implicitly start out with the loc-env.
- Last but not least, you can have a mock environment (mck-env) where the server just responds with (fixed or random) mock data. You could even simulate the server responses in the client-side code. The mck-env is usually the fastest and easiest environment to create artificial edge cases. On the other hand, it is also the least similar to the prd-env. I tend to only use a mck-env to kickstart the development of the application. When a mock-env is applicable it is invaluable as you can iterate extremely fast. Over time, however, maintenance becomes too troublesome which is why I abandon it at some point.
Each additional environment adds complexity and maintenance efforts. You should make a deliberate decision if the ROI of introducing and maintaining a certain environment is positive. The more automated your operations are the easier it is to manage multiple environments.
Your development setup and workflow is not really its own environment but, in the majority of cases, will interact with the loc-env. There might arise situations where you, e.g., need to connect to the staging data base with your development server in order to fix a bug. You should make sure that it is easy to connect to other environments.
Differences (except for the data) between your development setup and the environments (e.g. Node versions) could lead to unforeseen problems when applying changes. In practice, you have to make compromises in order to keep the development workflow fast and usable.
For example, I don’t see any benefits in dockerizing the web-based client-side code since its product (HTML, CSS and JS) is portable and self-contained anyway and setting everything up without Docker is rather troublefree. Running the Webpack dev server in a dockerized container during development is annoying in contrast. Since you are using Webpack’s dev server you lose the production parity advantage of Docker anyway. You might merely use Docker for the external server-side dependencies in the loc-env, such as the database, and run the server itself directly on your machine. If you restart the server often during development and running it in a container increases startup time and hot reloading does not work reliably with Docker then it is simply irrational to use Docker here. Be pragmatic. You can still have something like a make target that builds your project in a container and also runs it inside one. You would use this target rather infrequently mainly to check for unforseen differences to your development setup.
(I felt the need to write the above paragraph because I encountered several open-source projects and articles where I felt that Docker was taken too far.)
In this article we have learned why you would want to have other environments in addition to production. We have also seen that you need to consider challenges, trade-offs and the purpose when creating environments. Your development workflow should be pragmatic and easily allow connecting to other environments.
As an side, it used to be error-prone to manage multiple environments. I really wished for a reliable, self-describing, declarative, repeatable, observable, idempotent, fast and easy operational flow. The advent of Docker and “configuration management” tools (e.g. Ansible or Chef) slowly improved the situation. Nowadays, with tools such as Kubernetes we are almost there.