Release Ops Applications

11 min readJan 7, 2019

In a previous post I covered the topic of Software Development Applications, an overview of the various applications that support the SDLC process. This is a companion post which covers the variety of applications that can be used to provide operational support and transparency to the applications and infrastructure.

While software development applications provide a foundation for the software development process, the operations applications provide a foundation to support that infrastructure. They help provide traceability and transparency into the health of the software development applications. This infrastructure is used to aggregate data from the continuously running applications, providing a unified platform for understanding the load and stability of each application.

Because each phase of the software development process frequently relies on applications from different vendors in order to achieve a “best of breed” solution that meets the specific needs of a team or product development organization, it is important that the operations infrastructure be capable of collecting data from various sources and surfacing the data for analysis. Each technology should be selected with the understanding that any application in the software development lifecycle, or even applications in the infrastructure itself, may need to be replaced as the system evolves over time.

Each major section below will provide a brief overview of the following operational areas:

Monitoring and Metrics
Logging
Configuration Management
Disaster Recovery
Policy Management
Event Stream Processing

These areas form the foundation for managing and reporting on the current and historical status of the software development applications you are responsible for supporting. Their role is to collect data and provide transparency to all interested parties. They make it possible to quickly and accurately answer questions about application health, and this transparency will help establish confidence and trust with the application users. Even when applications fail (as they inevitably will), having the data and transparency in place to explain when, how, and why they failed will reassure your users that you have the knowledge and competency to prevent a similar issue in the future.

Monitoring and Metrics

System monitoring involves the active collection of data from hosts and applications within the infrastructure in order to determine if they are operating within acceptable parameters. A monitoring application typically defines thresholds to determine when a monitored item falls outside the boundaries of what is acceptable. At this point the monitoring system may take action, such as correcting the condition or triggering a notification to an external system which can act on the event.

In addition to monitoring of events, the collection of these metrics is also critical for use in capacity planning and performance investigations. As data is collected in real-time, it must be available for visual representation in graphs and dashboards. This visualization allows relationships between various application and system metrics to be represented in a cohesive view of the operating environment. It must also be possible to view historical data in order to investigate events which occurred well before they were reported or an investigation was undertaken.

Due to the volume of data being collected, it is often impossible to retain metric data indefinitely. However, historical data is often not required at the level of granularity in which it is originally collected. This is why data aggregation is a critical aspect of any metric collection system. Numeric data is often aggregated by applying calculations such as averages to a range of data to produce an aggregated value. For example, a data value which is collected every 10 seconds may be aggregated over a 5 minute interval to produce a value which represents the average value during that interval. It is this data aggregation which makes it possible to use this data for capacity planning over longer periods of time such as months or years where a finer granularity is not required.

By leveraging a time series database, data can be stored and queried efficiently. Databases such as InfluxDB are developed specifically to store time series data, while traditional databases such as PostgreSQL might use plugins such as the Timescale plugin to optimize the database structure for time series data. These databases tend to differentiate themselves based on their query language, aggregation functionality, and clustering features.

Logging

Most applications are capable of generating logs on the filesystem. Each application might use their own logging format, and logs may be scattered across multiple filesystem locations or even across multiple hosts. Log aggregation is the act of consuming those logs and publishing them to a central location where they can be searched or analyzed. This makes it possible to identify patterns across many different servers or many application logs.

During the process of consuming these logs, it is often desirable to parse the log messages into a more structured format such as JSON to ensure that it can be more easily digested by the message consumers. For example, all web server access logs might be consumed and parsed to identify the HTTP client, response code, URL, client IP address, and other meaningful information. In addition to parsing the log message, the data might be augmented with additional data such as mapping an IP address to a particular geographic location so that the data can be aggregated by geography if that is meaningful.

Applications such as Fluentd or Logstash are used to parse the incoming logs and send the data to persistent storage area such as Elasticsearch. Once the structured data is available in Elasticsearch, Kibana can be used to perform ad-hoc searches or construct complex dashboards to represent the data in an easy to understand view.

Configuration Management

Applications require installation, deployment, and configuration. The role of configuration management is ensure that these tasks are automated and repeatable. When dealing with physical or virtual machines, tools such as Ansible or Puppet can be used to update hundreds or thousands of hosts from a central server. When dealing with containers, an orchestration framework such as Kubernetes can be used to ensure that the containers are deployed and configured according to a defined set of rules. In each case, the definition of the environment can be managed in source control to provide repeatable, scalable, and auditable management of the infrastructure.

As the number of hosts in your infrastructure grows, having a configuration management solution in place is critical for deploying and updating the metric and log collection agents required to achieve the data collection described earlier in this article. It also makes it possible to perform system updates and ensure that all systems are using the correct configuration files and application versions.

Disaster Recovery

A documented disaster recovery plan is essential for ensuring that systems can be properly restored in the case of an unplanned system failure. The disaster recovery documentation should describe the business continuity needs, the detailed steps and ownership required to recover each component of the system, and the impact of various failure scenarios. When a failure occurs, a good recovery plan minimizes downtime and reduces stress and errors which can result in lost time or data.

Before covering the topic of disaster recovery, it is important to first understand the concept of an operational runbook. A runbook describes all key operational details about an application or environment. It should be informative enough that it can be used by any member of the operations, IT, or development team to quickly understand the architecture of a system. A runbook is often used as a quick reference and should present information in a format which can be quickly identified and digested without excessively verbose content. This is a document that will get referenced on a daily basis and during a 3am off-hours troubleshooting session by an on-call team member. The last thing anyone wants to be doing at 3am is trying to read a dense design document to determine how to access an application or restart a failed service to resolve a monitoring alert.

Key elements of the runbook will likely include:

Application support contacts
Architectural diagrams
Development, test, and production hosts
Services, ports, and log locations
Clean-up and troubleshooting
Disaster recovery plans

The disaster recovery plan is application and environment-specific, and should be accessible from any operational runbook. When an outage occurs, the operations team should be able to start the investigation by referencing the application runbook and then proceed to the disaster recovery plan if it is determined that a component of the system has failed and cannot be repaired. Ideally the plan should be exercised periodically in a test environment to ensure that the steps can be performed correctly and to determine how long each step of the recovery will actually take to perform. Having an estimated recovery time will make it easier to set user expectations when an outage does occur.

When creating a recovery plan, the first step is to define the business continuity needs. These help establish expectations for acceptable data loss, uptime, and contingency plans. For example, for an application which handles non-critical data such as log messages it might be acceptable to lose up to an hour of data as long as application downtime is minimized to 5 minutes during a data recovery situation. In a transactional source control system, on the other hand, it may not be acceptable to lose any data during an outage and the duration of the outage may not be as critical if developers can continue working off-line until the service is restored. Knowing these requirements can help shape architectural decisions about the environment and define the steps required to recover from a disaster in a way which meets the business needs.

The recovery plan should describe each step in the recovery process in detail, including who is responsible for executing the steps, who to contact if assistance is needed, and time estimates for any long-running tasks. Application owners should be responsible for reviewing and signing off on the plan to ensure accuracy and completeness. Executing the plan against a test environment on a regular basis is a good way to ensure that operations is comfortable performing the steps and that there are no errors or omissions in the documentation.

Policy Management

Each software development application has its own representation of users, groups, and projects. The operational challenge is to manage these consistently across all applications. This means that when a development team requests a new project, it must be created across all applications and the appropriate access permissions granted. Implementing an operational tool to manage these administrative operations across all of the applications makes it possible to satisfy many of the key operational requirements such as:

Allowing self-service creation of projects and groups
Enabling auditing and traceability
Enforcing expiration, deletion, and archival policies

A policy management tool provides an abstraction layer that defines the common operations that must be satisfied within each application. These operations include create, read, update, and delete (CRUD) operations in the following areas:

Users and project management
Access controls (granting users access to projects)
Application configuration (authentication, plugins, system settings)
Application upgrades
Moving users or projects across application instances

These abstract operations can be used to enforce data retention policies, provide self-service or streamlined administrative operations, and ensure consistency across all applications. When done well, it also ensures that a single tool and language can be used to manage all applications within the SDLC portfolio. As applications are added, removed, or replaced by new vendors, the tool can be updated to support the new application.

Creating this type of policy management interface also makes it possible to scale the applications horizontally. For example, if each team requires its own application instance for scalability or security reasons, the management tool can help manage each application instance and ensure that it is configured correct to interact with the other applications.

Event Stream Processing

Once you start down the path of metrics collection and log aggregation, you’ll soon find yourself in a predicament. Now that the data is flowing fast and furious, how can you ensure that it gets to the right location? What if there are multiple locations where you want to send the data simultaneously, or what if you need to transform the data while it’s in transit? And how do you take these systems off-line for maintenance and upgrades without disrupting the flow of data?

As event messages flow in from various sources, it is often desirable to take action on the events as they arrive. An example of this might be to send notification when a log message contains a specific error string or pattern. A more sophisticated use case might be to monitor the volume of messages (throughput) and take action when the volume exceeds a defined threshold.

Another form of event processing is augmenting or transforming the data in real time. For example, IP addresses might be used to perform GeoIP mapping to augment the message data with the country or region of origin. Messages might need to be aggregated over a fixed time interval to produce an average value, or grouped together to calculate elapsed time between messages.

This is where Apache projects like Kafka and Pulsar come to the rescue. I am a huge fan of Kafka. It’s power and simplicity make it ideal centralizing the flow of data and making it accessible to all who care to consume it. These streaming platforms provide a multi-producer/multi-consumer model that can buffer the high volume of log and metric data and opens the door to a whole new world of event stream processing possibilities.

There are many benefits of sending log data though a streaming platform rather than directly to a single destination:

Data enrichment
Calculating geographic location, elapsed time between messages, or joining multiple lines into a single message are all examples of data manipulation or enrichment that can occur in real time as log messages pass through the message queue.
Multi-consumer Replication
When using a message queue that supports multiple consumers, it becomes possible to consume messages from a production message queue into both a production and a test environment. This makes it possible to test upgrades to message consumers and their corresponding infrastructure, reproducing production data volume, or simply replicate data in a test environment without changing or impacting production.
High Availability
Applications like Kafka are designed to scale horizontally. They are clustered by nature and can be upgraded node-by-node without taking the cluster off-line. This ensures that messages keep flowing even during scheduled maintenance.

Next Steps

There are a lot of tools and technologies available to help improve the operational management and transparency of any application environment, large or small. The first step is to identify your key objectives. Try to select technologies that complement your skill set and existing infrastructure. If you manage an enterprise environment that consists primarily of Linux virtual machines, then start with something like Ansible to help automate the deployment and configuration of the other components. Focus on quick wins and simplified solutions. Get monitoring and metrics collection in place, along with a visualization tool like Grafana, to help deliver transparency that can be easily understood by others within the organization. More advanced topics like policy management and event stream processing can be tackled later once more immediate need have been addressed.