Zen and the Art of Application Maintenance

11 min readNov 16, 2018

“Care and Quality are internal and external aspects of the same thing. A person who sees Quality and feels it as he works is a person who cares. A person who cares about what he sees and does is a person who’s bound to have some characteristic of quality.”
- Robert Pirsig, Zen and the Art of Motorcycle Maintenance

Early on in my technical career, a manager recommended I read “Zen and the Art of Motorcycle Maintenance” by Robert Pirsig. I found it to be a thought provoking narrative that deals with the struggle for Quality even though you may not know exactly how to define it. Since then the idea of Quality has been a subconscious part of my decision making process. In this article I’ll talk about the role of Operations in the area of software application maintenance, and try to relate it back to the central theme of Quality. All quotes shown here are taken from Mr. Pirsig’s book.

At its core, the goal of operations is to plan, implement, and achieve productivity, quality, and cost targets. Our job is not to just keep the lights on, it’s to keep them running at peak efficiency. There should be no brown outs, burnt out bulbs, or dark shadowy corners. When done right, all our hard work makes it appear as if no work were required at all. We are not generally thought of as software architects, but we need to know how everything fits together. Nor are we black box testers who operate on the inputs and outputs of an application. We sit at the pragmatic intersection of design and implementation.

Consider the role of operations when looking at the two pictures below. The patent application on the left is much like an architecture diagram commonly provided by application vendors. It shows an idealized view of how the application should be constructed. Contrast that with the picture on the right, which represents the user’s view of the application. In operations, we fit somewhere in the middle. We need to understand how the machine works and the how it is currently used in order to ensure the application functions correctly.

Operational understanding falls somewhere in the middle.

Release Operations Role

“It’s a problem of our time. The range of human knowledge today is so great that we’re all specialists and the distance between specializations has become so great that anyone who seeks to wander freely among them almost has to forego closeness with the people around him.”

Throughout my career in Release Engineering I’ve held various titles, but I consider them primarily to be “operations” roles. I frequently walk the line between developers and IT, using domain and application-specific knowledge to build an infrastructure that facilitates the software development process. For me, Operations has always been a blend of software development, system administration, and grief counselling. We manage the systems, but we serve the users.

The key to providing good operational support is to understand the application and how users interact with it. We must utilize domain and application specific knowledge to ensure that the experience of interacting with the application is as pleasant as possible. That typically involves responsibilities which ensure the smooth operation of a deployed environment:

User and project administration
Application upgrades and roadmap planning
Backup/DR, monitoring, change control, configuration management
Metrics collection, capacity planning, and performance tuning
Infrastructure evaluation and deployment

The challenge in Operations is that we will never be an end-to-end expert in running the application. We must work closely with teams closer to the infrastructure, as well as teams who define the application usage patterns, in order to fill a knowledge gap between those two roles.

Support Levels

“The way to solve the conflict between human values and technological needs is not to run away from technology. That’s impossible. The way to resolve the conflict is to break down the barriers of dualistic thought that prevent a real understanding of what technology is — not an exploitation of nature, but a fusion of nature and the human spirit into a new kind of creation that transcends both.”

In order to develop a sufficiently deep expertise in the breadth of topics required to managing enterprise applications, most organizations will establish some level of specialization. There is frequently an IT organization that understands the core technology infrastructure: networking, hardware, and operating systems. Sometimes there is a separate Operations team which specializes in the health and performance of the application. And hopefully there are subject matter experts (SME) who have a detailed understanding of the application. These roles each have specialized knowledge and skills which may not be relevant to the other roles, but the entire set of skills is required to ensure a healthy and successful application deployment.

For ease of reference, we’ll refer to each of these roles by “levels” to denote the fact that different organizations may participate in a role depending on the application or circumstance, and the order they get contacted may vary depending on organization preference.

The following diagram is an example of the overlapping skills between each role:

When questions or problems arise, it is important that the correct team get engaged while minimizing the amount of uncertainty or duplicate effort. If users complain that the application is down, it is not very efficient to have various members of IT, Operations, and application experts all stop their work to undertake an independent investigation of the problem. Nor does it make sense to have the entire collection of technical experts engaged on every problem. This is where the idea of an “escalation path” comes in handy.

Level 1: Application Infrastructure

Virtualization, container, orchestration, and storage infrastructure
Monitoring, alerting, and change management infrastructure
On-call 1st responders

Level 2: Application Operations

Policy enforcement (access control, data retention and cleanup)
Monitoring definition, metrics collection, and capacity planning
Application upgrades and configuration management
Licensing and operational cost analysis (per user cost)

Level 3: Application Subject Matter Experts (SME)

Application roadmap (upgrades, integration points)
Policy definition (caching, retention policy, access control)
Integration testing

Frequently, application users are not sufficiently aware of the application infrastructure to diagnose problems on their own. Without a defined escalation path, users are left confused and unsure about who to contact when they encounter an issue. With an escalation path in place, the Level 1 support contact can be engaged to perform a validation of the services in their domain before handing the issue off to the next level once they have confirmed that their responsibilities are met.

Application Administration

“Each machine has its own, unique personality which probably could be defined as the intuitive sum total of everything you know and feel about it. This personality constantly changes, usually for the worse, but sometimes surprisingly for the better, and it is this personality that is the real object of motorcycle maintenance.”

A lot of work goes in to ensuring that an application functions correctly and performs well. Operations may not have control over how the application is implemented, but we do control how the application is deployed and accessed. Our choices influence application behavior, and our ability to observe and collect data about that behavior determines how well we can make informed choices.

Some of the operational tasks that influence application behavior and performance are:

Integration with other applications (cross-application messaging)
Operating system and application configuration/upgrades
Coordinated upgrades across multiple applications (integration points)
Use of application-specific APIs to perform operational tasks
Project or account management and data retention
Policy enforcement, access control, and auditing
Impact analysis relating to performance or licensing costs

Each of these tasks is required to ensure that the application is managed effectively and within a predictable budget. It is never a simple matter of standing up an application in production and then letting it run unattended. Applications left to rot from neglect will almost certainly fail.

Budgeting

“Who really can face the future? All you can do is project from the past, even when the past shows that such projections are often wrong.”

Budgeting refers to the estimation of operational and capital expenditures for the upcoming 18 to 24 months. It is necessary to provide an accurate estimate of your budget requirements for the next fiscal year so that these costs can be included in the larger budgets being calculated within the organization. It is also important to structure the budget forecasting models so that they can be easily re-calculated based on variations in the number of concurrent projects, number of developers, or type of automated processing that the Engineering organization might wish to undertake. A Release Engineering budget cannot be established in isolation without understanding the software development process that will be used during that fiscal period. In particular, parallel development of multiple product releases can have a major impact on infrastructure requirements.

Architectural and process decisions need to be made with budgeting and cost information in mind. Operations should be able to model the per user, per project, or per host cost of an environment. Storage and infrastructure costs should be factored in to any decisions about how to scale out an environment. Even services hosted externally have associated costs and those costs should be understood because they ultimately have cost impact to the company. If the infrastructure is provided as a service, the cost of that service needs to be quantified, regardless of whether the service is provided by an internal organization or an external one.

Capacity Planning

“You look at where you’re going and where you are and it never makes sense, but then you look back at where you’ve been and a pattern seems to emerge. And if you project forward from that pattern, then sometimes you can come up with something.”

Capacity planning is the process of collecting operational metrics, aggregating the data over a period of months or years, and then extrapolating a trend into the future to estimate the amount of resources required based on current or planned activity. The more data that can be collected, and the better understood the future operational plans, the more reliable the future estimates are likely to be. Information such as the number of servers, percent utilization, and growth estimates are all variables that will factor in to any budget estimate. The data collected through monitoring and metrics collection becomes a critical component of the budgeting process.

Data aggregation is the process of taking granular data such as the load average collected in 1 minute intervals and averaging it together over a longer period such as 1 hour intervals. This “down sampling” of data makes it more efficient to visualize long term data trends that span many months or years. When estimating activity months into the future, it is critical to have at least as much historical data available in order to extrapolate a trend forward. Because many monitoring or metrics collection solutions may not automatically perform this data aggregation, it is important to ensure that any solution put in place has some mechanism to store aggregated data for long periods of time.

Software Licensing

“The test of the machine is the satisfaction it gives you. There isn’t any other test. If the machine produces tranquility it’s right. If it disturbs you it’s wrong until either the machine or your mind is changed.”

The software development process is typically composed of some combination of commercial and open source applications, and the application selection process is heavily influenced by user experience. When an open source application is sufficient and yields value, keep it. If usability or required functionality can only be found in a commercial application and the licensing cost is reasonable, then pay the license fee if your budget allows it. Always keep the per user licensing cost targets in mind to avoid purchasing a collection of “best of breed” commercial solutions that break your budget once they are all assembled.

Each commercial product will have unique licencing terms that determine licensing and support costs for the purchase and ongoing use of the software. These licensing terms may be based on the number of users, number of instances of the software, or the operational environment where it runs (i.e. number of CPUs, amount of RAM, etc). It is critical that the operations team fully understands the current licensing to ensure that the most cost effective architectural and operational decisions can be made. Operations should be involved in the early stages of any new application deployment, and should have full access to any quotes or purchase orders that relate to the application or environment.

Although open source software may not have a licensing cost associated with it, there will likely be a higher operational cost due to the lack of paid support or available reference material. Depending on the software project, the operational staff may need to invest more time acquiring knowledge, developing customizations, or researching problems. This operational cost should be estimated and factored in to any budgeting or cost analysis. The same is true for internally developed tools and applications.

In addition to the application licensing, there may be licensing costs associated with the operating systems or hardware as well. Selecting the appropriate operating system or product line is also an important aspect of budgeting and optimizing. Unless they are included in the infrastructure cost, these underlying licensing costs need to be considered from top to bottom through the entire stack in order to accurately establish a cost model.

Infrastructure Cost

“Mountains should be climbed with as little effort as possible and without desire. The reality of your own nature should determine the speed. If you become restless, speed up. If you become winded, slow down. You climb the mountain in an equilibrium between restlessness and exhaustion.”

Infrastructure can become an ever expanding domain of cost, time, and complexity. No matter what you are trying to accomplish, there will always be more you could do to improve the infrastructure. At a certain point you will need to come to terms with the fact that infrastructure selection and cost will ultimately be a pragmatic decision based on cost and available resources.

Infrastructure usually implies the storage, network, and servers that make up a software development environment. When hosted internally, that may include everything from hardware to cooling and power costs. When infrastructure is purchases as an external service, that cost may be metered based on application usage patterns or computing power. Whatever the source or combination of sources, it is important to have some way to quantify the current operational costs, capture representative metrics, and extrapolate trends forward to include in budgeting and capacity planning estimates.

Estimating future infrastructure requirements is especially important when hosting the infrastructure internally because there may be physical limitations to the amount of infrastructure that can be deployed. Data centers require proper power and cooling, and although server density is constantly increasing, it is quite common to quickly outgrown a hosting facility without proper planning and a firm grasp of the physical resources required by the infrastructure. If this infrastructure is managed by other groups, it is critical to establish close working relationships with those groups and convey the need for Operations to understand and participate in infrastructure planning.

Conclusion

“When one isn’t dominated by feelings of separateness from what he’s working on, then one can be said to “care” about what he’s doing. That is what caring really is, a feeling of identification with what one’s doing.”

Operations exists in the “empty spaces” and fills the void left between application deployment and application usage. An Operations engineer who “cares” will expand to fill as much of the empty spaces as possible. They will expand into the lowest levels of the infrastructure until they fully understand how the application runs. They will expand upward into the user domain in order to more fully understand how and why the user does the crazy things they do. And in caring about what they do, ultimately they become more in tune with the application and the users they support.