Impact of a digital transformation on software development
Evolution of Software Development and Operations — Part 1
In the past 10 years software development has changed significantly. So did software operations.
Software development and operations form a continuous interplay.
Let’s have a closer look, outline these changes and see how they lead to the emergence of modern application platforms. Find the cutting edge and see some of the current challenges.
The following chapters will walk through a history of software operations. The intention here is not to come up with a perfect re-narration but to roughly describe the development of software operations over time.
By getting into the spirit of a certain operational era you will receive a vivid understanding of its particular challenges. From there it is much easier to understand the subsequent evolutionary step as a logical response to a particular set of challenges typical for the corresponding epoch.
The same is applied to the chapters describing the evolution of software development. This development is also segmented to illustrate the challenges and impact of eras and their resulting technological innovations.
Production applications have long been operated on physical servers. Often a single server has been used. Depending on the uptime requirements this may have been something from an off-the shelf commodity server to high-end server hardware.
Classic web applications on such a physical server often consisted of an application server process, database server process. Files and other assets the application receives or produces have been stored to the server’s filesystem.
A LAMP stack, for example, has been such a typical web stack. LAMP means Linux, Apache, MySQL and PHP. While it’s not so important what application server, database implementation or language is used. The point is that all these components are located on a single physical machine.
This makes the server a SPOF = single point of failure. When (and not if) the server fails, the entire application goes down. With a single server you may be lucky and it keeps running for years. Even with a cheap server you pick a winner. With hundreds of servers statistics kick in and hardware failure become a regular task consuming significant work time to recover.
With hundreds of servers statistics kick in and hardware failure become a regular task consuming significant work time to recover.
The quality of services heavily depends on the organization of the datacenter and hardware. Technicians must respond quickly and spare parts have to be at hand.
Ideally, these spare parts do not come from the same batch. Components such as hard drives are more likely to fail over age. Replacing one HDD with another from the same batch may lead to sequences of failure.
A series of five failing HDDs have been reported for a single server in a single week as the server provider did supply used replacement HDDs from the same batch as the failed part.
Other frequent failing parts are power supplies. They shut down, the server shuts down. There are servers with redundant power supplies, though. If one power supply fails, the other takes over.
Of course you need to pay more for a having a 2nd power supply and the corresponding failover electronics. More than that, a 2nd power supply needs a 2nd power line. Ideally, this power line is independent from the first to also protect against a failure of the first power line. Costs escalate quickly.
Each hardware failure also affects the software layer. Failed HDDs or RAID-system may cause a loss of data. A corrupted filesystem may cause a loss of data. A corrupted data base may cause a loss of data. Losing data is everybody’s nightmare.
Therefore, a solid backup & restore strategy is absolutely essential. Just assume that every possible failure happens from time to time. Looking at this list, you need at least protection from the most likely failure scenarios.
Even with a backup & recovery strategy, failures are not neutralized. It always implies harm to business. A potential data loss resulting from the delta between production data and its most recent backup is often unavoidable in a single server scenario. Still, this isn’t necessarily the most harmful aspect of an outage.
The fact that the application may be down for hours while being recovered, is also a big influence to the overall damage. Customers won’t be able to use the application. Data being send automatically to the application may be lost if sending systems do not come with a robust re-try logic.
That’s why the mean time to recover (MTR) is an important operational quality of any system. As the name implies it gives hint about the recovery time to be expected in case of a disaster. It is therefore wise to optimize the MTR during system design by providing appropriate redundancies and avoid SPOFs wherever possible and affordable.
Applying this strategy, ideally a level 1 incident causing a system wide failure can be degraded to a level 2 or level 3 incident with a loss of redundancy leaving the system fully operational.
For the hardware of a single server, reducing MTR could mean having a technician with spare parts at hand. But think of the time a sysop needs to manually setup a server stack. Installing the operating system (OS), the application and database server, configuring both services, deploying the application, configuring and starting it, setting up the backup, monitoring and logging.
The list is long and so will be the day of the sysop. It may take hours to recover the software side of the server failure, alone. For this reason, even with a physical server it is meaningful apply automation to software installation and configuration as this reduces the MTR and thus the damage resulting from server outages.
Let’s step back for a second. Imagine the single server pattern is repeated many times. It leads to a data center full of unconnected, dedicated servers. Experience showed that the overall load of these servers is unevenly distributed.
Often the average load of such a data center is below 10% leading to a gigantic waste. Not only servers cost money, they consume power and produce heat. Heat must be cooled consuming even more power. Power needs to be redundant so that emergency generators need to be scaled accordingly.
There are two lessons learned from this. Repairing physical servers and recovering the software layer are key factors to the MTR and thus overall availability.
In order to overcome these issues clusters of servers can be build eliminating single point of failures (SPOFs). A cluster provides higher uptime as it decouples the availability of an application from the availability of a single server.
Another dimension to address the above mentioned issues is applying virtualization and software automation ultimately converging into infrastructure as code and application platforms.
As described earlier, a solo server is a single point of failure and comes with the risk of hours-long downtimes. Therefore, the combination of servers to clusters is a known strategy to increase the system’s uptime.
Let’s briefly walk through taking a single-server-setup to next level by transforming it into a cluster of servers.
Many books describe how clusters can be build. For the sake of simplicity, we assume having a web stack as described earlier. Imagine a monolithic version of a Facebook-like social web app. It needs to store user data including assets such as uploaded images and videos. Structured data such as profile information, friendship information and posts are stored in a relational database management system (RDMS). A database like MySQL or PostgreSQL will do. Assets such as pictures and videos are stored to the filesystem of the server.
In the following the single server setup is scaled out to several servers. Both load and redundancy aspects will be discussed along the way.
In order to eliminate the application server as a SPOF, an additional application server needs to be added. So we need another physical server. This scale out will also increase the load capacity as more user requests can be served by two rather than a single application server.
Now we have to application servers each running our application. However, the domain can only resolve to a single host. So we need to add a load balancer.
The load balancer’s job is to accept incoming requests on a public network interface and balance it across the application servers on a private network. This implies that the data center needs to be flexible allow the creation of private networks which not every provider is willing to do.
The load balancer adds another physical machine.
Now you have a load balancing across the application servers. If one application server fails, the other can still serve your app. But wait, this doesn’t work at the moment because our database is still co-located on one of the application servers.
So let’s move the database to a separate server, for now. We will later take care of its redundancy as there’s still an issue with the application server setup.
Setting up the app on two application servers causes a new challenge. Files stored on the filesystem by our application are randomly put on one of the two application servers.
At this point, users would not see their pictures or videos whenever their requests are balanced to the wrong application server. To overcome this issue, to store assets in a common asset store accessible by both application servers.
We do not use NFS as this neither scales well nor does it provide the adequate redundancy. This problem escalates quickly, as solutions to storing assets such as OpenStack Swift require 3 to 5 servers to reach their full availability potential.
With an object store in place, applications now can write assets to the shared asset store. Users can retrieve them either directly from the object store or being proxied through the application servers.
Time to look at the database. Although on a dedicated server, it’s still both a SPOF as well as a potential bottleneck. Most RDBMs don’t scale horizontally so that you are forced to scale it vertically by buying bigger machines.
As overcoming the limitations of an RDBMS may require fundamental changes in the architecture of your software we skip this issue for now.
To keep it simple here let’s assume the database won’t be a performance bottleneck for a while. So we rather focus on eliminating the Database as a SPOF.
PostgreSQL, for example, supports asynchronous replication out of the box. By adding a cluster manager such as repmgr also failure detection and automatic failover capabilities are added. So the cluster adds another two (2) physical machines as the DB cluster needs three (3) machines in total.
Let’s look at the architecture now. The app servers, the object store and the database are redundant. But the load balancer isn’t.
So let’s add another load balancer.
The load balancers also need a cluster manager. The cluster manager is responsible of sending and verifying heart beats from cluster nodes and trigger a failover if necessary.
Part of the failover procedure is the allocation of the public load balancer’s IP as this is required to retrieve incoming traffic. The load balancer does not maintain a relevant state so there’s no urgent need for a quorum based algorithm here. Therefore, we can leave it with two instead of three cluster nodes in contrast to the database cluster.
We are now looking at nearly a dozen servers and many system components. The overall maintenance effort of such a cluster is extensive not comparable to that of a single server.
This scenario mandatorily requires either a group of people or — preferably — consequent automation.
Generally speaking, clusters like this are expensive in both regards labor and hardware.
Generally speaking, clusters like this are expensive in both regards labor and hardware. Complexity and costs have been obstacles for smaller applications to benefit from these topologies for a long time.
You can put many of these components on — let’s say — three machines. However, without proper virtualization or containerization, the isolation between the different processes such as load balancer, application and database may cause issues and undesired interactions. Hardware costs would be reduced but the level of complexity remains.