Avoid Production Incidents by Considering Production Conditions During Development
Things can go wrong when you ignore production during the development process
Production environments, configuration, and setup must be highly considered during the software development cycle. Otherwise, several incidents could occur in production environments because the setup was not considered during the development and testing of the software.
The software development cycle starts by implementing the application features by the engineers locally on their machines. Then these features will be tested on a sandbox before they are merged with a source code repository. Later, the changes will be tested on an environment that has the same setup as the production setup to avoid bugs or crashes on the production servers.
However, even with the above process above, it is possible that some incidents occur by not considering the production setup. In this piece, I will highlight several incident types that could happen due to this fact.
Database Schema and Zero Downtime Deployment
Zero downtime deployment is one of the most important features that most web applications are trying to achieve. Creating a process or rolling out new containers for the new releases can be managed by the operating systems or the cluster tool used in the deployment, such as Swarm or Kubernetes.
This feature can be easily broken by new changes in the database schema if the engineers did not pay attention to the database migrations that they implemented and considered running them on production.
The database migrations or changes on the schema are usually applied during the release deployment. First, the migrations will be applied to the database server, resulting in a new database schema. Then the release will be deployed and a new process with the new code will be running and serving the user's requests.
For a while during the deployment time and before rolling the new release, the old source code will be running against the new database schema. The length depends on the deployment script. If it is a manual deployment, it may last a long time. This means that both the old and new releases must be compatible with the new database schema. Otherwise, the end-users will start facing errors and crashes while using the web application.
This risk does not apply to every database schema change. For instance, adding a new column in a table or adding a new table is safe and can be done without any issues. On the other hand, removing a column or a table is very harmful and cannot be done in a single migration because the old release will not be compatible with the new schema. Let’s say you decide to remove the email address from the users table and add it to the contact details table. Applying this change in a single migration means that sending emails to users will not work using the old release once we apply the database changes simply because the email column is not available anymore.
Instead of applying this change in a single release, it should be done in two releases. The first release should take care of duplicating the email column in the contact details table and switching the source code to use the contact details table as a source for the user email. And the second release will take care of removing the email column from the users table. This is much safer since the email column is not used by any source code when it is removed.
Another example of database changes that may lead to crashes during the deployment to production environments is the changes that are performed on big tables. For instance, adding an index on a column on the database tables is a common task during software development, and it is usually needed to improve the performance of the application when searching for records based on that column’s values. However, adding an index to a database table that has 4 million records can take a long time to be applied to the table, and it could block the application deployment. There is also a risk that the database query will time out and result in an error.
These changes could easily pass local and testing environments simply because the data size on these environments is not the same as in production. This risk is not restricted to creating a new index on a database table. In fact, any change that needs to be performed on a large table (e.g. adding a new column, renaming or removing columns) has the same risk.
It is highly recommended to consider the production database schema and size for all changes on the database level as early as possible during the software development process to avoid these types of crashes in the production environments.
Restricting code execution to production environments is another risky issue that could easily crash the production environments. For instance, in Rails applications, it is common that developers write code similar to the snippet below:
send_notification if Rails.env.production?
The risk of the code above is that the
send_notification will never be executed on local or testing environments and will only be executed on production environments. This means that the function will not be tested in any of the environments, and if there is a bug in the function, it will be discovered too late in the production environment after the deployment of the source code.
To help reduce such issues, I highly recommend removing any code that is restricted to the production environment. On the other hand, such behavior should be implemented using feature flags to be able to enable and test them in multiple environments.
Iterations and Patch Processing
It is very common to use iterations and loops to perform patch processing in software applications. Loops and iterations are very helpful in applying or performing the same instructions more than one time, and they also help in reducing the code lines needed. However, iterations and loops for patch processing tasks (or any similar tasks ) could easily break production environments if it was not considered in the deployment time. I am not talking about infinite loops here.
In this section, I will highlight two examples of cases where loops can crash the production environments. I will use Rails examples, but this applies to all languages—not only Ruby.
1. Iterate over a large list
It is common in most applications to perform the same tasks for all users or all resources of the same kind. Below are two examples of functions that do so. The first,
collect_fees, will be used to collect the subscription fees from all customers where the fee is due. The other function,
import_transactions, will import user transactions. Both functions are using services to perform the actual job. Let’s assume that these services are using RESTful APIs to perform the job.
The way that the functions above are written may lead to a crash for a few reasons:
- First, both functions do not handle exceptions. Therefore, if an action failed for a given user, the loop will be crashed and the process will not continue for the rest of the users on the list. For instance, if the
User.fees_duevalue is 10 and the API request failed during the third iteration, then the loop will crash and the payment will not be collected for the rest of the users. The failures here could be due to several reasons, such as the external service is down or it responded with new response code or data that the application cannot handle. It is very important to consider production environments for such cases and handle all expected exceptions to make these functions more robust.
- Second, both functions are selecting users without specifying a limit on the number of records. This means that as the production environments onboard new users, the number of iterations for these functions will increase. As a result, the time needed to complete these tasks will increase too. And if the user base for the application is large enough, the execution of these functions will not be finished at the expected time. For instance, if the count of
User.fees_dueis 30 million and the needed time for collecting payment is
2s, then 23(30000000/(30*24*60*30)~=23) days are needed to collect the payment for all the affected users. You can argue that the provided example is extreme—and, yes, it is for some businesses. However, even with a much smaller user base (e.g. 7 million), this code still needs four days to complete. Such a function or task should be done in one day at most. A better solution for such an issue is to process these functions in a parallel or async method. For instance, create background jobs for performing the actions and use multiple workers to handle the created jobs.
- Third and most importantly, it is highly likely that the function above will consume a lot of RAM and the process running the functions will be killed by the operating system. Yes, it will be killed by the OS. The Linux kernel provides the Out Of Memory Management feature that will kill the processes that consume most of the RAM in case the host is out of RAM.
But what does that have to do with the functions above? The
each method in the above functions will try to select and load all the affected users to the RAM and then operate on these users. That is why it is more likely that the OS will kill the process before even it starts to iterate over the users. This also depends on the resources of the host node or the Docker container.
This issue can be avoided by loading and performing these actions in batches instead of loading all the users at once. For instance, Rails provides a method to achieve this out of the box that is called
2. Big transactions
Transactions are protective blocks where SQL statements are committed only if they can all succeed as one atomic action. That means the transaction SQL statements will be kept in the RAM until all statements are created and successfully executed. That being said, writing source code that will lead to the generation of large transactions will lead to crashing the software process if it starts consuming most of the RAM available on the host server. An example of such (pseudo)code is presented below:
notify_all function above is trying to perform the
notify function on all the users. It is right that we are looping over the users in batches. However, the transaction data will keep all the SQL statements in the RAM until the iteration overall users are done to be able to
rollback the statements. The results of big transactions are the same as iterating over a huge list, which is killing the process by the OS.
In the example above, there is no need actually to have the transaction in the first place (if we consider that we don’t want to notify all users or none). However, maybe the transactions are needed in the
notify function to guarantee that the notifications were done for individual users as expected.
The point that I'm trying to make here is to always try to find the correct context of the transaction to fix its scope and try to limit the needed resources for building the transaction.
Production environment configurations, database scheme, and application data must be considered by developers during development time. Otherwise, the software may crash in the production environment due to the ignorance of such parameters during the development of the software.