Stories by Shawn Stafford on Medium

Jenkins Events, Logs, and Metrics

Shawn Stafford — Tue, 22 Jan 2019 02:41:06 GMT

How to collect information about Jenkins for analysis and data visualization

If you support a reasonably large Jenkins instance, or you support a large number of instances, you have probably been faced with a performance problem. Like many Continuous Integration (CI) applications, Jenkins works quite well at a small scale but can degrade significantly without proper care and feeding. This article will present several examples of how to export Jenkins events, logs, and metrics to help identify opportunities for improvement.

Jenkins Application Metrics

The Jenkins Prometheus plugin exposes a Prometheus endpoint in Jenkins that allows Prometheus to collect Jenkins application metrics. The plugin is really just a wrapper around the Metrics plugin to expose JVM metrics through a REST endpoint that returns data in a format which Prometheus can understand. In addition to JVM information, the plugin also exposes information about the job queue, executor counts, and other Jenkins-specific information. The Metrics plugin provides a list of the metrics exposed through the endpoint.

Once the Prometheus plugin has been installed in Jenkins, the data it exposes can be configured through the “Manage Jenkins” page:

Manage Jenkins -> Configure System -> Prometheus plugin

The Prometheus plugin is superior to most of the other metrics plugins because it returns more system information than many of the other plugins. If you need to send the data to a different destination, you could use one of the Prometheus exporters. If Prometheus is absolutely not an option, however, there are several alternatives. One alternative is the Jenkins Metrics Graphite Reporting plugin. Data exported by this plugin is far more limited than the Prometheus plugin, but it will allow you to get basic information about executor count and HTTP response statistics.

Unlike the Prometheus pull model, the Graphite plugin will push the data to any server capable of accepting Graphite messages. For example, you could configure the InfluxDB Graphite plugin and send metrics directly to InfluxDB. Or you could configure the Logstash Graphite input plugin and send metrics to any output location supported by Logstash.

If you manage a large number of Jenkins instances, configuring these settings through the UI can be tedious. In such cases, the Jenkins REST API can be used to submit a Groovy script to each instance:

curl -v -d "script=$(cat /tmp/script.groovy)" --user username:ApiToken http://jenkins01.yourcompany.com:8080/scriptText

The Groovy code shown below provides an example of how to configure the Jenkins Metrics Graphite plugin to send data to an external system.

import jenkins.metrics.impl.graphite.GraphiteServer;
import jenkins.metrics.impl.graphite.PluginImpl;
import jenkins.model.*;
import org.codehaus.groovy.runtime.InvokerHelper;

// Construct an object to represent the Graphite server
String prefix = "jenkins";
String hostname = "graphite.yourcompany.com";
int port = 2003;
GraphiteServer server = new GraphiteServer(hostname, port, prefix);
List servers = new ArrayList();
servers.add(server);
GraphiteServer.DescriptorImpl descriptor = 
Jenkins.getInstance().getDescriptorByType(GraphiteServer.DescriptorImpl.class);
descriptor.setServers(servers);

Jenkins Events

The Jenkins Statistics Gatherer plugin can be used to send JSON messages for each event to an external REST endpoint. One application of this would be to send the messages to Elasticsearch for visualization within the Kibana web interface. Jenkins events correspond to actions that occur on the Jenkins master, such as:

Project creation (when a job is created, deleted, or updated)
Job execution (when a build starts and finishes)
Job step execution (when each step in the job starts and finishes)
Job queue (when a job enters or changes state in the job queue)
SCM Checkout (when the job checks out files from source control)

There are many ways to publish events to Elasticsearch. Some popular options include:

Logstash HTTP input plugin -> Logstash Elasticsearch output plugin
FluentD HTTP input plugin -> FluentD Elasticsearch output plugin
Confluent REST Proxy -> Kafka -> Logstash Kafka input plugin -> Logstash Elasticsearch output plugin

For the sake of simplicity, this article will stick with Elasticsearch products and assume the use of Logstash as a means to ingest events into Elasticsearch. Regardless of the solution you choose, the process will essentially be the same.

Once the Statistics Gatherer plugin is installed in Jenkins it can be configured to send messages through the Jenkins UI:

Manage Jenkins -> Configure System -> Statistics Gatherer

The screenshot above shows how the Statistics Gatherer plugin can be configured to send HTTP messages to a Logstash HTTP input plugin listening at http://logstash.yourcompany.com/.

Important Notes:

The “Enable HTTP publishing?” option must be selected in order for the messages to be sent. This option is only visible once the “Advanced…” button is clicked in this configuration section.
The /jenkins-/ path at the end is optional, but can be useful to help provide some additional information about which Jenkins event type is being submitted by allowing Logstash filters to be defined on the request_path information.
Builds which publish artifacts can produce unique JSON fields for each Artifact, which can exceed the number of fields allowed for an Elasticsearch index. To avoid this, use a Logstash filter to strip out any unwanted fields:

filter {
    mutate {
        remove_field => [ "[build][artifacts]" ]
    }
}

As noted earlier, the Jenkins scripting console or REST endpoint can be used to automate the configuration of the plugin. The contents of the Groovy script would look something like this:

import org.jenkins.plugins.statistics.gatherer.StatisticsConfiguration;
import jenkins.model.*;

String baseUrl = "http://logstash.yourcompany.com";

StatisticsConfiguration descriptor = Jenkins.getInstance()
    .getDescriptorByType(StatisticsConfiguration.class);

descriptor.setQueueUrl("${baseUrl}/jenkins-queue/");
descriptor.setBuildUrl("${baseUrl}/jenkins-build/");
descriptor.setProjectUrl("${baseUrl}/jenkins-project/");
descriptor.setBuildStepUrl("${baseUrl}/jenkins-step/");
descriptor.setScmCheckoutUrl("${baseUrl}/jenkins-scm/");

descriptor.setQueueInfo(Boolean.TRUE);
descriptor.setBuildInfo(Boolean.TRUE);
descriptor.setProjectInfo(Boolean.TRUE);
descriptor.setBuildStepInfo(Boolean.TRUE);
descriptor.setScmCheckoutInfo(Boolean.TRUE);

descriptor.setShouldSendApiHttpRequests(Boolean.TRUE);

At the end of the process, what you should have is a collection of Jenkins event messages in Elasticsearch that can then be used in Kibana visualizations and dashboards to make informed decisions about build performance, failure rates, or a variety of other questions.

Kibana Search = type: build AND jobName: rest_open AND result: SUCCESS

Jenkins Application Logs

The Jenkins master and slave processes generate application logs on the filesystem. These logs contain information about the Jenkins process and can be useful to identify problems that may not be easily identified through the user interface. By sending these logs to Elasticsearch, the information can be indexed and searched for patterns using the Kibana web interface. Each line of the log becomes an JSON record in Elasticsearch.

The easiest way to ship the contents of the application logs to Elasticsearch is to use Filebeat, a log shipper provided by Elastic. Filebeat can be configured to consume any number of logs and ship them to Elasticsearch, Logstash, or several other output channels. I would recommend shipping the logs to Logstash so that the appropriate Logstash filters can be applied to parse the lines into JSON fields. This is particularly useful for HTTP access logs, which use a predictable logging format. For example, the COMBINEDAPACHELOG grok filter in Logstash can be used to parse an access log entry into structured JSON data.

The figures below show the Kibana “Discover” interface, which is useful for searching for log entries. The examples demonstrates how to filter for all Jenkins logs where the hostname matches cdptestjml01 and the log line contains the word docker:

filebeat.host.name: cdptestjml01 
AND filebeat.source: jenkins 
AND message: docker

Kibana Discover Screen

In the latest Kibana 6.5.x release a “Logs” view has been added which allows you to have a streaming view of the logs similar to the view you might get from running a tail -f command on the command-prompt:

Kibana Logs Screen

Jenkins Build Logs

Similar to application logs, each Jenkins job generates a console log. These can also be shipped to Elasticsearch. This can be accomplished using Filebeat on the Jenkins master, or by using the Logstash Jenkins Plugin depending on your needs. Using Filebeat would be similar to what is described above, so for illustrative purposes I’ll cover the Logstash Jenkins plugin here.

Once installed, the plugin must be configured to point to a central server. There are several “indexer types” available to choose from. This example shows the use of the SYSLOG indexer type. This will also require that the Logstash server has a syslog input plugin enabled so that it can receive SYSLOG messages as shown here.

Manage Jenkins -> Configure System -> Logstash

As mentioned earlier, if you are managing multiple Jenkins instances it may be easier to use a Groovy script to configure the plugin:

import jenkins.plugins.logstash.LogstashInstallation;
import jenkins.plugins.logstash.LogstashInstallation.Descriptor;
import jenkins.plugins.logstash.persistence.LogstashIndexerDao.IndexerType;
import jenkins.model.*;

Descriptor descriptor = (Descriptor) 
    Jenkins.getInstance().getDescriptor(LogstashInstallation.class);
descriptor.type = IndexerType.SYSLOG;
descriptor.host = "logstash.yourcompany.com";
descriptor.port = 5045;
descriptor.key  = "JenkinsTestEnv";
descriptor.save();

The following example shows how to enable the Logstash plugin on an individual job configuration page:

Job Configuration -> Job Notifications

Important Notes:

The Logstash plugin can be enabled globally for all jobs, or on each individual job. It may be preferable to collect logs only for specific jobs due to the volume of data that can be generated (especially if a developer happens to enable debug logging in their build script).

Summary

There are many plugins that allow you to send or retrieve data from Jenkins. The first step is to identify what data you need and how you will store it. If you wish to collect time series data from Jenkins at regular intervals, then consider using a time series database and visualize the results with Grafana. Popular database options include:

For other data such as events or logs, the Elastic products are by far the most popular option. The Elasticsearch database is a great solution for storing event messages and Kibana is a user friendly interface for creating dashboards to present the data. Although Elastic has attempted to integrate time series functionality into their product, a dedicated time series database combined with Grafana might be easier to use and manage. Once you have historical data from any or all of these sources, you can begin to make informed and rational decisions about the state of your Jenkins environment.

Jenkins Events, Logs, and Metrics was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Scaling SDLC Applications

Shawn Stafford — Wed, 09 Jan 2019 02:45:57 GMT

If you work in a software development organization, you probably deal with various software development applications. These include core applications like source control, bug tracking, and continuous integration. In a small organization (less than 500 engineers), scalability is not typically a concern. Most applications in this area perform quite well when the number of users or the amount of concurrent activity remains small. This article is for the other end of the organizational spectrum, the ones who experience the challenges of using SDLC applications which were not designed to meet the needs of a large organization.

The term “commodity hardware” is nearly synonymous with horizontal scalability. When the hardware is inexpensive and seldom fault tolerant, you need to design your software to provide fault tolerance. It needs to be capable of distributing the workload across multiple systems and expanding to include more servers as usage demands. The idea of “commodity software” is similar: if the application is not designed to scale horizontally with increased load, you must distribute the projects and users across multiple instances.

Organizations commonly make the mistake of trying to deploy single monolithic application instances and scale them vertically to support the entire company. Unfortunately most applications are not designed for large scale deployments, so performance begins to degrade as the amount of concurrent activity increases. This is usually a result of application design limitations, and no amount of optimization or tuning is going to change that.

Of course these problems rarely present themselves during initial deployment because the application data is relatively small and the activity on the system is low. If performance degradation is gradual, it may be possible to project the resource usage forward to anticipate future requirements, but sometimes the degradation is severe or exponential. The best way to mitigate the risk of performance problems in the future is to decompose the application into smaller instances. But how do you know when an application needs to be decomposed for horizontal scaling? The answer is easy: always. Assume that an application can’t scale and you won’t be disappointed.

Here are some common factors that lead to scalability problems:

Storing application data on the filesystem
Loading large amounts of data in memory to render in the user interface
Executing poorly tuned, highly relational database queries

In order to decompose the application, you must determine if it is possible to decompose its usage. This requires an understanding of how users, projects, and teams are structured within your organization. The diagram below shows the comparison between team-based organization in a single application instance compared to an alternative deployment with breaks the application into three instances.

There are several things to consider when breaking up an application:

Logical grouping of projects and teams
Reporting and communication across instances
Licensing and operational costs

Projects and Teams

The first challenge to decomposing any application is deciding how to distribute the projects, teams, and users. There is no magic formula. You must understand how users interact with the application and then model your environment accordingly. For software development organizations, it makes sense to decompose your application instances around software and organizational structures. Try to allocate teams along technology, organization, or product boundaries.

Project and Team Distribution

In order to ensure consistency across instances, you may need to develop internal tools which can be used to deploy new application instances and create new teams, projects, and users within each instance. By using abstractions to model the administrative activities across application instances, you can establish an infrastructure which can manage an arbitrary number of application instances or application types.

Examples of this internal tooling include:

Containerization (Kubernetes)
Configuration Management (Ansible, Puppet, Chef)
Administrative tools for invoking application REST APIs

With these tools in place, it should be possible to deploy a new application instance in minutes. Once the instance is running, the internally developed administration tools can be used to configure the application. As usage increases, it may be necessary to create a self-service user interface to allow users to create new teams and projects on their own.

Cross-application Reporting and Communication

When multiple instances of an application exist, it becomes more difficult to manage integration or communication across instances. It also becomes more difficult to provide transparency across multiple instances. In my previous article, “Transparent Software Development” I describe the use of Kafka as a message bus to expose information and events about each application in the SDLC environment. This same principal applies when deploying multiple instances of an application. The use of event messages makes it easier to collect data and metrics, trigger events, or integrate with other applications.

Licensing and Operational Cost

The software development process is typically composed of some combination of commercial and open source software. Each commercial product will have unique licencing terms that determine licensing and support costs for the purchase and ongoing use of the software. Licensing terms may be based on the number of users, number of instances of the software, or the operational environment where it runs (i.e. number of CPUs, amount of RAM, etc). When scaling an application, it is critical to consider the licensing and operational costs of the software to ensure that the most cost effective architectural and operational decisions can be made.

Although open source software may not have a licensing cost associated with it, there could be a higher operational cost due to a lack of paid support or quality documentation. Depending on the software project, your operational staff may need to invest more time becoming experts in the software or investigating problems. The same is also true for custom tools and software developed internally.

In addition to the application licensing, there may be licensing costs associated with the operating systems or hardware as well. Management and administration of the hosting environment caries a cost as well, so ensure that you choose an environment that you have the knowledge and ability to efficiently manage at scale.

Summary

Deploying multiple application instances can be an effective way to ensure that performance does not degrade as application usage increases, even for applications that were never designed to operate at an enterprise scale. In order to be successful, it is important to make thoughtful decisions about the infrastructure and internal tooling that is used to manage the application instances. By keeping your applications small and lean, you will be better able to predict application performance and ensure that the application can continue to scale well into the future.

InfluxDB Data Retention

Shawn Stafford — Tue, 08 Jan 2019 02:42:28 GMT

Time series data is often used by operations teams to investigate performance issues. Since we cannot anticipate the source of a performance issue ahead of time, we often err on the side of caution and collect as much data as possible as frequently as possible. This makes it easy to get a very granular picture of what was happening in an environment within the last hour, day, or week. However, keeping more than a week or two of data can become difficult as the number of hosts and volume of data increases.

There are many reasons why it may be necessary to retain data for longer periods of time. It is often useful to refer to data that is weeks, months, or even years old in the following cases:

Capacity planning — by reviewing resource utilization over the past 12 to 24 months, you can project usage forward in order to forecast your budget requirements for the next fiscal year.
Performance degradation — users often report that application performance “seems slower” than it did a few weeks or months ago. Having historical data at your fingertips makes it possible to quantify how much longer tasks are taking or whether there is a correlation between increased load and slower performance.

When dealing with time series data, you will inevitably reach a point where you cannot retain all of the data at full granularity for an indefinite amount of time. A trade-off must be made between how long you retain the data, how much data you retain, and how granular the data is. This article covers how to implement a data retention policy in InfluxDB to ensure that you can down-sample your data, retain it for a specified duration, and perform regular backups to avoid data loss.

For the purpose of this example, we will focus on CPU utilization metrics being collected by a collectd system daemon running on 3 hosts. In this scenario, collectd is submitting CPU data (“load_shortterm”) at 10 second intervals from every host being monitored. The data is being sent directly to InfluxDB using the collectd daemon and stored in a “metrics” database which is defined with a 7-day retention policy. The goal is to retain this data in a “longterm” database with a retention policy of 3 years.

InfluxDB Configuration

InfluxDB does not listen for collectd input by default. In order to allow data to be submitted by a collectd agent, the InfluxDB server must be configured to listen for collectd connections. This section describes how to configure collectd on a RHEL/CentOS system.

The first step is to create a database on the InfluxDB server to store the incoming collectd data for 7 days. To do this, open a terminal window on the InfluxDB server and use the influx command to connect to the server. Run the following command to create a new database:

CREATE DATABASE metrics WITH DURATION 7d

The next step is to install collectd on the InfluxDB server so that the types.db specification file is available to InfluxDB:

# Install the collectd RPM (available from the EPEL repo)
yum install collectd

# Locate the types.db file installed by the RPM
rpm -ql collectd | grep types.db

Update the InfluxDB config file (/etc/influxdb/influxdb.conf) to listen for collectd data and then restart the influxd service:

[[collectd]]
  enabled = true
  bind-address = ":8096"
  database = "metrics"
  typesdb = "/usr/share/collectd/types.db"

Once InfluxDB is listening for collectd input, you will need to install the collectd agent on each of the 3 hosts and configure it to send data to your InfluxDB server.

Data Identification

If you’re not familiar with the InfluxDB data model, figuring out how to locate the data can be the first challenge. Each type of data being collected is referred to as a “measurement” and each measurement can have any number of “tags” associated with it. In the case of data reported by the collectd agent, a measurement will exist for each type of data you have configured the agent to report.

The following commands can be executed in the influx command window to query the database and identify the measurements that will be used in later steps to define a continuous query.

# Switch to the InfluxDB database containing the CollectD data
USE metrics

# Display a list of metrics in the database
SHOW MEASUREMENTS

# Display the tags (keys) used to uniquely identify CPU load
SHOW SERIES FROM load_shortterm

The SHOW SERIES command should produce output similar to the following:

load_shortterm,host=host1,type=load
load_shortterm,host=host2,type=load
load_shortterm,host=host3,type=load

Data Retention Policy

There are many options for defining a long-term retention policy in InfluxDB. One option is to create a new retention policy within the existing metrics database and store the long-term data alongside your short-term data:

CREATE RETENTION POLICY longterm_policy ON metrics DURATION 156w REPLICATION 1

However, I would avoid creating multiple retention policies within a single database unless you have a particular justification for doing so. If there are multiple retention policies within a single database, your queries will need to reference the retention policy explicitly (database.policy.measurement) when operating on the data. This will make your queries more fragile because of the explicit references to a retention policy.

In addition, the incoming data in the metrics database is high-volume transient data. It would be very expensive to back up the entire database and the impact of losing 1 week of very granular data would not be worth the effort.

A better approach would be to define a separate database specifically for any long-term data you wish to retain. Doing so will make it easy to change the retention policy later or back up the entire database without affecting any external reports or queries. The following example shows how to create a database with a retention policy of 3 years (156 weeks):

CREATE DATABASE longterm WITH DURATION 156w

Data Aggregation

Create a continuous query to down-sample the data from 10 second intervals down to 15 minute intervals:

CREATE CONTINUOUS QUERY aggregate_load ON longterm
BEGIN
  SELECT max(value) AS value 
  INTO longterm.autogen.load_shortterm 
  FROM metrics.autogen.load_shortterm 
  GROUP BY time(15m),* 
END

Important Notes:

The aggregation function you choose will depend on your use case. The max function is being used in this case because we typically care about the peak load average during an interval. Using a mean calculation would dilute these spikes and make it harder to identify short spikes in activity.
The INTO and FROM clause require a fully qualified measurement (database.policy.measurement). If you define the default retention policy as part of the database creation as shown in the examples above, the retention policy will be called “autogen” by default.
The GROUP BY clause contains a * wildcard, which means “all tags.” This ensures that the aggregation is performed across only the data points where all tags are identical. Without that wildcard (or an explicit list of tags), the aggregation would be performed across all data for all hosts, resulting in a single meaningless aggregation. In most cases a wildcard should be used in the GROUP BY to ensure that aggregation is performed on the same unique item.
The multi-line formatting shown above is for readability. You’ll need to structure the command on a single line when executing it within the influx command.

Once you have successfully created a continuous query, you should see measurements appear at the end of the first time interval. This can be verified by listing the measurements in the same way you discovered them previously:

USE longterm
SHOW MEASUREMENTS
SHOW SERIES FROM load_shortterm

The Grafana charts below show the difference between raw data (10 second intervals) and aggregated data (15 minute intervals).

Grafana chart of metrics.load_shortterm

Grafana chart of longterm.load_shortterm

Data Backup

InfluxDB comes with a command-line utility for performing database backups. Execution is reasonably straight forward:

influxd backup -portable -database longterm /backup/longterm

Additional arguments can be specified to restrict the export by a specific retention policy, shard, or date range.

Debugging Tips

By default, the InfluxDB service writes its logging output to /var/log/messages. Several lines of output are generated each time a continuous query is run. The following command can be used to look for continuous query execution messages:

tail -f /var/log/messages | grep "Finished continuous"

The matching lines will look something like this:

Jan  7 15:00:00 influxdbhost01 influxd: 
   ts=2019-01-07T20:00:00.127002Z 
   lvl=info 
   msg="Finished continuous query" 
   log_id=0BoYFF20000 
   service=continuous_querier 
   trace_id=0CrUFfTl000 
   op_name=continuous_querier_execute 
   name=aggregate_load 
   db_instance=longterm 
   written=129 
   start=2019-01-07T19:59:00.000000Z 
   end=2019-01-07T20:00:00.000000Z 
   duration=7ms

The log output contains several important pieces of information:

name — the name of the continuous query
written — the number of measurement records that were written
duration — how long the query took to execute

Obviously it would be a bad idea to define a continuous query that is scheduled to run more frequently than the time it takes to complete the query.

Wrap-Up

Continuous queries are a convenient way to selectively down-sample data and retain it for longer periods of time. By creating thoughtful retention policies and backup procedures, it is possible to retain historical time series data for months or even years. Making this historical data readily accessible within the same visualization interface as your real-time metrics will empower application owners and operations staff to make informed decisions based on historical context and trends, rather than relying on a “gut feel” approach to budgeting, capacity planning, or problem investigation.

InfluxDB Data Retention was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Release Ops Applications

Shawn Stafford — Mon, 07 Jan 2019 14:50:43 GMT

In a previous post I covered the topic of Software Development Applications, an overview of the various applications that support the SDLC process. This is a companion post which covers the variety of applications that can be used to provide operational support and transparency to the applications and infrastructure.

While software development applications provide a foundation for the software development process, the operations applications provide a foundation to support that infrastructure. They help provide traceability and transparency into the health of the software development applications. This infrastructure is used to aggregate data from the continuously running applications, providing a unified platform for understanding the load and stability of each application.

Because each phase of the software development process frequently relies on applications from different vendors in order to achieve a “best of breed” solution that meets the specific needs of a team or product development organization, it is important that the operations infrastructure be capable of collecting data from various sources and surfacing the data for analysis. Each technology should be selected with the understanding that any application in the software development lifecycle, or even applications in the infrastructure itself, may need to be replaced as the system evolves over time.

Software Development Applications

Each major section below will provide a brief overview of the following operational areas:

Monitoring and Metrics
Logging
Configuration Management
Disaster Recovery
Policy Management
Event Stream Processing

These areas form the foundation for managing and reporting on the current and historical status of the software development applications you are responsible for supporting. Their role is to collect data and provide transparency to all interested parties. They make it possible to quickly and accurately answer questions about application health, and this transparency will help establish confidence and trust with the application users. Even when applications fail (as they inevitably will), having the data and transparency in place to explain when, how, and why they failed will reassure your users that you have the knowledge and competency to prevent a similar issue in the future.

Monitoring and Metrics

System monitoring involves the active collection of data from hosts and applications within the infrastructure in order to determine if they are operating within acceptable parameters. A monitoring application typically defines thresholds to determine when a monitored item falls outside the boundaries of what is acceptable. At this point the monitoring system may take action, such as correcting the condition or triggering a notification to an external system which can act on the event.

In addition to monitoring of events, the collection of these metrics is also critical for use in capacity planning and performance investigations. As data is collected in real-time, it must be available for visual representation in graphs and dashboards. This visualization allows relationships between various application and system metrics to be represented in a cohesive view of the operating environment. It must also be possible to view historical data in order to investigate events which occurred well before they were reported or an investigation was undertaken.

Grafana Dashboard

Due to the volume of data being collected, it is often impossible to retain metric data indefinitely. However, historical data is often not required at the level of granularity in which it is originally collected. This is why data aggregation is a critical aspect of any metric collection system. Numeric data is often aggregated by applying calculations such as averages to a range of data to produce an aggregated value. For example, a data value which is collected every 10 seconds may be aggregated over a 5 minute interval to produce a value which represents the average value during that interval. It is this data aggregation which makes it possible to use this data for capacity planning over longer periods of time such as months or years where a finer granularity is not required.

By leveraging a time series database, data can be stored and queried efficiently. Databases such as InfluxDB are developed specifically to store time series data, while traditional databases such as PostgreSQL might use plugins such as the Timescale plugin to optimize the database structure for time series data. These databases tend to differentiate themselves based on their query language, aggregation functionality, and clustering features.

Logging

Most applications are capable of generating logs on the filesystem. Each application might use their own logging format, and logs may be scattered across multiple filesystem locations or even across multiple hosts. Log aggregation is the act of consuming those logs and publishing them to a central location where they can be searched or analyzed. This makes it possible to identify patterns across many different servers or many application logs.

During the process of consuming these logs, it is often desirable to parse the log messages into a more structured format such as JSON to ensure that it can be more easily digested by the message consumers. For example, all web server access logs might be consumed and parsed to identify the HTTP client, response code, URL, client IP address, and other meaningful information. In addition to parsing the log message, the data might be augmented with additional data such as mapping an IP address to a particular geographic location so that the data can be aggregated by geography if that is meaningful.

Applications such as Fluentd or Logstash are used to parse the incoming logs and send the data to persistent storage area such as Elasticsearch. Once the structured data is available in Elasticsearch, Kibana can be used to perform ad-hoc searches or construct complex dashboards to represent the data in an easy to understand view.

Kibana Discovery Screen

Configuration Management

Applications require installation, deployment, and configuration. The role of configuration management is ensure that these tasks are automated and repeatable. When dealing with physical or virtual machines, tools such as Ansible or Puppet can be used to update hundreds or thousands of hosts from a central server. When dealing with containers, an orchestration framework such as Kubernetes can be used to ensure that the containers are deployed and configured according to a defined set of rules. In each case, the definition of the environment can be managed in source control to provide repeatable, scalable, and auditable management of the infrastructure.

As the number of hosts in your infrastructure grows, having a configuration management solution in place is critical for deploying and updating the metric and log collection agents required to achieve the data collection described earlier in this article. It also makes it possible to perform system updates and ensure that all systems are using the correct configuration files and application versions.

Disaster Recovery

A documented disaster recovery plan is essential for ensuring that systems can be properly restored in the case of an unplanned system failure. The disaster recovery documentation should describe the business continuity needs, the detailed steps and ownership required to recover each component of the system, and the impact of various failure scenarios. When a failure occurs, a good recovery plan minimizes downtime and reduces stress and errors which can result in lost time or data.

Before covering the topic of disaster recovery, it is important to first understand the concept of an operational runbook. A runbook describes all key operational details about an application or environment. It should be informative enough that it can be used by any member of the operations, IT, or development team to quickly understand the architecture of a system. A runbook is often used as a quick reference and should present information in a format which can be quickly identified and digested without excessively verbose content. This is a document that will get referenced on a daily basis and during a 3am off-hours troubleshooting session by an on-call team member. The last thing anyone wants to be doing at 3am is trying to read a dense design document to determine how to access an application or restart a failed service to resolve a monitoring alert.

Key elements of the runbook will likely include:

Application support contacts
Architectural diagrams
Development, test, and production hosts
Services, ports, and log locations
Clean-up and troubleshooting
Disaster recovery plans

The disaster recovery plan is application and environment-specific, and should be accessible from any operational runbook. When an outage occurs, the operations team should be able to start the investigation by referencing the application runbook and then proceed to the disaster recovery plan if it is determined that a component of the system has failed and cannot be repaired. Ideally the plan should be exercised periodically in a test environment to ensure that the steps can be performed correctly and to determine how long each step of the recovery will actually take to perform. Having an estimated recovery time will make it easier to set user expectations when an outage does occur.

When creating a recovery plan, the first step is to define the business continuity needs. These help establish expectations for acceptable data loss, uptime, and contingency plans. For example, for an application which handles non-critical data such as log messages it might be acceptable to lose up to an hour of data as long as application downtime is minimized to 5 minutes during a data recovery situation. In a transactional source control system, on the other hand, it may not be acceptable to lose any data during an outage and the duration of the outage may not be as critical if developers can continue working off-line until the service is restored. Knowing these requirements can help shape architectural decisions about the environment and define the steps required to recover from a disaster in a way which meets the business needs.

The recovery plan should describe each step in the recovery process in detail, including who is responsible for executing the steps, who to contact if assistance is needed, and time estimates for any long-running tasks. Application owners should be responsible for reviewing and signing off on the plan to ensure accuracy and completeness. Executing the plan against a test environment on a regular basis is a good way to ensure that operations is comfortable performing the steps and that there are no errors or omissions in the documentation.

Policy Management

Each software development application has its own representation of users, groups, and projects. The operational challenge is to manage these consistently across all applications. This means that when a development team requests a new project, it must be created across all applications and the appropriate access permissions granted. Implementing an operational tool to manage these administrative operations across all of the applications makes it possible to satisfy many of the key operational requirements such as:

Allowing self-service creation of projects and groups
Enabling auditing and traceability
Enforcing expiration, deletion, and archival policies

A policy management tool provides an abstraction layer that defines the common operations that must be satisfied within each application. These operations include create, read, update, and delete (CRUD) operations in the following areas:

Users and project management
Access controls (granting users access to projects)
Application configuration (authentication, plugins, system settings)
Application upgrades
Moving users or projects across application instances

These abstract operations can be used to enforce data retention policies, provide self-service or streamlined administrative operations, and ensure consistency across all applications. When done well, it also ensures that a single tool and language can be used to manage all applications within the SDLC portfolio. As applications are added, removed, or replaced by new vendors, the tool can be updated to support the new application.

Creating this type of policy management interface also makes it possible to scale the applications horizontally. For example, if each team requires its own application instance for scalability or security reasons, the management tool can help manage each application instance and ensure that it is configured correct to interact with the other applications.

Event Stream Processing

Once you start down the path of metrics collection and log aggregation, you’ll soon find yourself in a predicament. Now that the data is flowing fast and furious, how can you ensure that it gets to the right location? What if there are multiple locations where you want to send the data simultaneously, or what if you need to transform the data while it’s in transit? And how do you take these systems off-line for maintenance and upgrades without disrupting the flow of data?

As event messages flow in from various sources, it is often desirable to take action on the events as they arrive. An example of this might be to send notification when a log message contains a specific error string or pattern. A more sophisticated use case might be to monitor the volume of messages (throughput) and take action when the volume exceeds a defined threshold.

Another form of event processing is augmenting or transforming the data in real time. For example, IP addresses might be used to perform GeoIP mapping to augment the message data with the country or region of origin. Messages might need to be aggregated over a fixed time interval to produce an average value, or grouped together to calculate elapsed time between messages.

This is where Apache projects like Kafka and Pulsar come to the rescue. I am a huge fan of Kafka. It’s power and simplicity make it ideal centralizing the flow of data and making it accessible to all who care to consume it. These streaming platforms provide a multi-producer/multi-consumer model that can buffer the high volume of log and metric data and opens the door to a whole new world of event stream processing possibilities.

There are many benefits of sending log data though a streaming platform rather than directly to a single destination:

Data enrichment
Calculating geographic location, elapsed time between messages, or joining multiple lines into a single message are all examples of data manipulation or enrichment that can occur in real time as log messages pass through the message queue.
Multi-consumer Replication
When using a message queue that supports multiple consumers, it becomes possible to consume messages from a production message queue into both a production and a test environment. This makes it possible to test upgrades to message consumers and their corresponding infrastructure, reproducing production data volume, or simply replicate data in a test environment without changing or impacting production.
High Availability
Applications like Kafka are designed to scale horizontally. They are clustered by nature and can be upgraded node-by-node without taking the cluster off-line. This ensures that messages keep flowing even during scheduled maintenance.

Next Steps

There are a lot of tools and technologies available to help improve the operational management and transparency of any application environment, large or small. The first step is to identify your key objectives. Try to select technologies that complement your skill set and existing infrastructure. If you manage an enterprise environment that consists primarily of Linux virtual machines, then start with something like Ansible to help automate the deployment and configuration of the other components. Focus on quick wins and simplified solutions. Get monitoring and metrics collection in place, along with a visualization tool like Grafana, to help deliver transparency that can be easily understood by others within the organization. More advanced topics like policy management and event stream processing can be tackled later once more immediate need have been addressed.

Transparent Software Development

Shawn Stafford — Mon, 19 Nov 2018 17:34:04 GMT

Mount Baker, North Cascades — Photo by Andy Porter

I’ve spent my entire Release Engineering career chasing transparency. It started with the idea that we needed more transparency in the build and unit test process, so we collected data about these events and created a UI to visualize them. It later expanded into the traceability of the software development process as we tried to link software requirements to commits, commits to defects, and so forth. Host level monitoring, metrics collection, event messaging, and log aggregation all followed that same theme: collect the data, surface the data, utilize the data.

Only recently have I started to realize that each attempt at transparency has suffered from a lack of vision and cohesion. It’s a classic “forest for the trees” problem of implementing a singular solution to solve an immediate problem without understanding how it fits into a larger ecosystem. When we collected build event data, we sent it straight to a database because that was the tactical need. When we linked commits with defects, we configured the source control system to talk directly to the bug tracking system because that was the tactical need. After years of making incremental improvements in transparency, I feel I can take a step back and reflect on the forest. I hope that by talking about the variety of tools that comprise the Software Development Lifecycle (SDLC), you as the reader can agree that visibility matters.

Software Development Lifecycle Applications

When I talk about SDLC, I’m referring to the process of producing software. I’m referring to the collection of applications that facilitate software production. It’s requirements tracking, source control, continuous integration, bug tracking, and test case management. Ideally these applications all function together to provide greater transparency and traceability. There are commercial solutions such as IBM Jazz (formerly Rational) or HP ALM, but these can be incredibly expensive, difficult to implement, and seldom provide the best user experience. The goal should be to create a loosely coupled system where applications can be added or removed quickly, or even run in parallel, as the needs of the organization change.

The challenge with transparency is that it takes effort, especially when the applications are not be designed to integrate with each other. Some vendors attempt to provide a complete SDLC solution which meets all of the needs of a developer, but they are expensive and seldom provide a satisfying user experience. Other applications may provide hooks and plugins to integrate with other vendors. This often limits you to a very specific combination of applications that happen to integrate in very specific and pre-defined ways. What developers really need is the ability to select the best-of-breed applications that specifically suit their development process.

Although software development is not a uniform process across all teams, there needs to be a high degree of transparency and traceability throughout the process. Even within a single company, different projects or organizations may have different needs. Developer requirements, corporate acquisitions, and process evolution can all result in a multitude of overlapping applications and tools. Standardization is ideal, but seldom achievable. The solution to that problem is information sharing. The idea that applications should expose all their events in an open and transparent manner is what I’ll refer to as “transparent SDLC.”

In simple development environments, it is often possible to configure the applications to communicate directly with each other. The source control system might send commit information to the defect tracking system, establishing links between commits and defects. The continuous integration system might poll source control for changes which trigger actions. Although this is a step in the right direction, it creates a tightly coupled system that can be brittle and prone to failure. Taking a single application off-line for maintenance can disrupt the entire development process, or result in missed events which never get sent to their destination.

Direct Messaging = Tight Coupling

How can we get to that happy place where the development process is highly transparent, but the applications are loosely coupled? The solution is messaging. With the rise of big data, streaming platforms like Kafka or Pulsar have become the hub for centralizing and distributing data. By emitting event messages from each application, it is possible to create a loosely coupled environment with a high degree of transparency. The key is selecting a platform which allows multiple producers and multiple consumers to access the topics or message queues.

Message Publishing = Loose Coupling

Once applications begin emitting event messages, it creates an open transaction log that can be consumed by any application, user, or process which is interested in acting on those events. For example, a message consumer might look for source control commits, parse the message for defect or requirements IDs, and add a message to the defect or requirements tracking system with the commit information. This approach simplifies the maintenance of the producing applications because they only need to emit messages. The code for consuming the events and taking action can be deployed independently of any of the applications it integrates with. This has a multitude of benefits such as:

Individual applications can be taken off-line without impacting any other applications by allowing producers and consumers to take action at their own pace.
Integration complexity is managed outside of the applications, using public APIs such as REST calls to interact with each application. This makes it possible to configure, deploy, and update the integration without impacting upstream or downstream applications.
Multiple clients can consume a message and take action. This allows the same message to trigger events in different applications, or to be replicated to a test and production environment simultaneously.
Anyone with access to the messaging streaming platform can implement their own consumer, performing actions or analysis based on their needs.

With the appropriate messaging platform and format in place, a world of possibilities are now available with minimal additional effort. In addition to application events triggering integration events with other SDLC applications, the events themselves can be analyzed for trends or errors. Events can be streamed to Elasticsearch and visualized on Kibana. Metrics can be collected in a time series database and presented with Grafana. With a little elbow grease, messages in the OpenTracing format can even be consumed by applications like Haystack to represent complex visualizations of an end-to-end process.

Message Consumers

Messaging

In a complex software development environment, it is often difficult to predict or manage integration across multiple applications. The applications from various vendors may not support direct integration with other applications in the environment, or technical and architectural challenges may prevent the use of integration points. In order to achieve a loosely coupled architecture, a streaming platform can be used to receive events from each of the SDLC applications.

When properly implemented, event messages from each SDLC application will be published to a central messaging platform and any interested applications can consume those messages and perform the appropriate actions. For example, a change management system (source control) might publish an event each time a new change is committed to the system. A continuous build environment may listen for these changes and initiate compilation of the code to produce a consumable artifact. Similarly, a defect management system may examine the changes for references to defect identifiers and establish the appropriate links between a source code change and the corresponding defect. While not as convenient as direct integration between application, the loose coupling provides a much greater degree of flexibility.

In addition to application events, the messaging platform can also be used for a wide variety of application data such as log messages, metrics, and events. By writing logs and metrics to the message queue, the data becomes available to multiple message consumers. This allows the data to be consumed into a time series database or indexed by a search engine and exposed to the appropriate users. Standard third party tools to be used in the infrastructure to provide transparency and automation on a scale that might not otherwise be possible in a tightly coupled environment. Users can begin to analyze and interact with information in ways that meet their needs without the need to customize or impact the SDLC applications directly.

Message Producers and Consumers

Some key requirements of a centralized streaming platform are:

High throughput
As the number of message producers increase and the volume of messages increases, the platform must be capable of handling high message volume. This is particularly true for log aggregation, which can have an extremely high message volume.
High availability
Applications will constantly be producing and consuming messages. Any outage or disruption may result in a loss or delay of data. The streaming platform must be designed with clustering and high availability in mind. Even upgrades of the platform itself must be possible without requiring an outage or downtime. When the message platform is highly available, it takes the burden off of the consumers to be highly available as well. Consumers can be taken offline for brief periods of time for upgrades or maintenance and can catch up on missed messages when they come back online.
Multiple producers and consumers
A cornerstone of this loosely coupled architecture is the ability for multiple producers to write to the same message stream, and multiple consumers to consume for the same stream. Supporting multiple producers allows distributed applications to merge their data into a single location. Supporting multiple consumers allows the data to be consumed from a single stream and used by a variety of applications. For example, the environment might currently publish performance metrics from each of the application hosts to a single message stream and then consume them into a time series database, but in the future a new consumer could be created to send the data to an alternate database in order to evaluate a new software project or use case.
Event stream processing
It may be necessary to analyze log messages in real-time to look for errors or patterns and then take action, such as triggering an alert or a self-healing action. Messages or logs may need to be parsed or enriches before being sent to their final destination. Event stream processing allows the log messages to be processed independently by different consumers. For example, HTTP access logs might be augmented with GeoIP data and persisted in Elasticsearch, and the metrics from those same logs might be aggregated and stored in a time series database.

Log Aggregation and Metrics Collection

In order to provide transparency to the health and capacity of any environment, the Operations team must have real-time access to information. For example, host metrics such as memory or CPU utilization can provide insight into the load that a system is currently under. For web-based applications or systems which make heavy use of REST API calls, analysis of the HTTP request logs can provide valuable insight about the number of requests or response times, which could help correlate performance degradation to user activity.

By collecting this data and exposing it through application dashboards, the Operations team can understand the application load and performance characteristics over a period of time. Creating meaningful dashboards from key logs and metrics allows the information to be presented in a meaningful and readily understandable format that can be used to make informed decisions about how to improve application performance, reliability, and scalability.

Application Logs

Logs are a valuable resource for understanding application behavior and usage patterns. When logs are collected from the various hosts into a central location and parsed into structured documents, they can be indexed and later presented in a graphical dashboard to provide meaningful view of the data. To use a specific example, a HTTP request log is typically composed of the following pieces of information:

Client information (IP Address, client software)
Requested URL
HTTP Response Code
Response time and/or response size

When indexed and presented as a dashboard, this information can be used to answer complex operational questions such as:

What IP address or geographic region generated the most requests?
How many requests took longer than 5 seconds?
How frequently did a particular error or response code get returned?

A well constructed dashboard can provide answers to common operational questions at a glance.

Kibana Dashboard

Getting Started

The task of establishing a streaming platform just to support your software development process may seem like a daunting and overly complex approach. But you don’t have to implement it all at once. Start by collecting logs and surfacing the data through Elasticsearch and Kibana. Next select a core application such as source control and begin emitting events, then tackle a use case such as establishing links between defects and commits. Once you are able to demonstrate the application of the technology, there will be multitude of opportunities to expand on its use. And the great part of the streaming platform approach is that it allows subject matter experts to take control of their own applications. The ability to integrate applications and increase transparency is no longer dictated by plugin availability or vendor support, it is available to anyone who can imagine a better way to connect the dots.

Zen and the Art of Application Maintenance

Shawn Stafford — Fri, 16 Nov 2018 03:07:52 GMT

“Care and Quality are internal and external aspects of the same thing. A person who sees Quality and feels it as he works is a person who cares. A person who cares about what he sees and does is a person who’s bound to have some characteristic of quality.”
- Robert Pirsig, Zen and the Art of Motorcycle Maintenance

Early on in my technical career, a manager recommended I read “Zen and the Art of Motorcycle Maintenance” by Robert Pirsig. I found it to be a thought provoking narrative that deals with the struggle for Quality even though you may not know exactly how to define it. Since then the idea of Quality has been a subconscious part of my decision making process. In this article I’ll talk about the role of Operations in the area of software application maintenance, and try to relate it back to the central theme of Quality. All quotes shown here are taken from Mr. Pirsig’s book.

At its core, the goal of operations is to plan, implement, and achieve productivity, quality, and cost targets. Our job is not to just keep the lights on, it’s to keep them running at peak efficiency. There should be no brown outs, burnt out bulbs, or dark shadowy corners. When done right, all our hard work makes it appear as if no work were required at all. We are not generally thought of as software architects, but we need to know how everything fits together. Nor are we black box testers who operate on the inputs and outputs of an application. We sit at the pragmatic intersection of design and implementation.

Consider the role of operations when looking at the two pictures below. The patent application on the left is much like an architecture diagram commonly provided by application vendors. It shows an idealized view of how the application should be constructed. Contrast that with the picture on the right, which represents the user’s view of the application. In operations, we fit somewhere in the middle. We need to understand how the machine works and the how it is currently used in order to ensure the application functions correctly.

Operational understanding falls somewhere in the middle.

Release Operations Role

“It’s a problem of our time. The range of human knowledge today is so great that we’re all specialists and the distance between specializations has become so great that anyone who seeks to wander freely among them almost has to forego closeness with the people around him.”

Throughout my career in Release Engineering I’ve held various titles, but I consider them primarily to be “operations” roles. I frequently walk the line between developers and IT, using domain and application-specific knowledge to build an infrastructure that facilitates the software development process. For me, Operations has always been a blend of software development, system administration, and grief counselling. We manage the systems, but we serve the users.

The key to providing good operational support is to understand the application and how users interact with it. We must utilize domain and application specific knowledge to ensure that the experience of interacting with the application is as pleasant as possible. That typically involves responsibilities which ensure the smooth operation of a deployed environment:

User and project administration
Application upgrades and roadmap planning
Backup/DR, monitoring, change control, configuration management
Metrics collection, capacity planning, and performance tuning
Infrastructure evaluation and deployment

The challenge in Operations is that we will never be an end-to-end expert in running the application. We must work closely with teams closer to the infrastructure, as well as teams who define the application usage patterns, in order to fill a knowledge gap between those two roles.

Support Levels

“The way to solve the conflict between human values and technological needs is not to run away from technology. That’s impossible. The way to resolve the conflict is to break down the barriers of dualistic thought that prevent a real understanding of what technology is — not an exploitation of nature, but a fusion of nature and the human spirit into a new kind of creation that transcends both.”

In order to develop a sufficiently deep expertise in the breadth of topics required to managing enterprise applications, most organizations will establish some level of specialization. There is frequently an IT organization that understands the core technology infrastructure: networking, hardware, and operating systems. Sometimes there is a separate Operations team which specializes in the health and performance of the application. And hopefully there are subject matter experts (SME) who have a detailed understanding of the application. These roles each have specialized knowledge and skills which may not be relevant to the other roles, but the entire set of skills is required to ensure a healthy and successful application deployment.

For ease of reference, we’ll refer to each of these roles by “levels” to denote the fact that different organizations may participate in a role depending on the application or circumstance, and the order they get contacted may vary depending on organization preference.

The following diagram is an example of the overlapping skills between each role:

Skill Overlap Between Roles

When questions or problems arise, it is important that the correct team get engaged while minimizing the amount of uncertainty or duplicate effort. If users complain that the application is down, it is not very efficient to have various members of IT, Operations, and application experts all stop their work to undertake an independent investigation of the problem. Nor does it make sense to have the entire collection of technical experts engaged on every problem. This is where the idea of an “escalation path” comes in handy.

Level 1: Application Infrastructure

Virtualization, container, orchestration, and storage infrastructure
Monitoring, alerting, and change management infrastructure
On-call 1st responders

Level 2: Application Operations

Policy enforcement (access control, data retention and cleanup)
Monitoring definition, metrics collection, and capacity planning
Application upgrades and configuration management
Licensing and operational cost analysis (per user cost)

Level 3: Application Subject Matter Experts (SME)

Application roadmap (upgrades, integration points)
Policy definition (caching, retention policy, access control)
Integration testing

Frequently, application users are not sufficiently aware of the application infrastructure to diagnose problems on their own. Without a defined escalation path, users are left confused and unsure about who to contact when they encounter an issue. With an escalation path in place, the Level 1 support contact can be engaged to perform a validation of the services in their domain before handing the issue off to the next level once they have confirmed that their responsibilities are met.

Application Administration

“Each machine has its own, unique personality which probably could be defined as the intuitive sum total of everything you know and feel about it. This personality constantly changes, usually for the worse, but sometimes surprisingly for the better, and it is this personality that is the real object of motorcycle maintenance.”

A lot of work goes in to ensuring that an application functions correctly and performs well. Operations may not have control over how the application is implemented, but we do control how the application is deployed and accessed. Our choices influence application behavior, and our ability to observe and collect data about that behavior determines how well we can make informed choices.

Some of the operational tasks that influence application behavior and performance are:

Integration with other applications (cross-application messaging)
Operating system and application configuration/upgrades
Coordinated upgrades across multiple applications (integration points)
Use of application-specific APIs to perform operational tasks
Project or account management and data retention
Policy enforcement, access control, and auditing
Impact analysis relating to performance or licensing costs

Each of these tasks is required to ensure that the application is managed effectively and within a predictable budget. It is never a simple matter of standing up an application in production and then letting it run unattended. Applications left to rot from neglect will almost certainly fail.

Budgeting

“Who really can face the future? All you can do is project from the past, even when the past shows that such projections are often wrong.”

Budgeting refers to the estimation of operational and capital expenditures for the upcoming 18 to 24 months. It is necessary to provide an accurate estimate of your budget requirements for the next fiscal year so that these costs can be included in the larger budgets being calculated within the organization. It is also important to structure the budget forecasting models so that they can be easily re-calculated based on variations in the number of concurrent projects, number of developers, or type of automated processing that the Engineering organization might wish to undertake. A Release Engineering budget cannot be established in isolation without understanding the software development process that will be used during that fiscal period. In particular, parallel development of multiple product releases can have a major impact on infrastructure requirements.

Architectural and process decisions need to be made with budgeting and cost information in mind. Operations should be able to model the per user, per project, or per host cost of an environment. Storage and infrastructure costs should be factored in to any decisions about how to scale out an environment. Even services hosted externally have associated costs and those costs should be understood because they ultimately have cost impact to the company. If the infrastructure is provided as a service, the cost of that service needs to be quantified, regardless of whether the service is provided by an internal organization or an external one.

Capacity Planning

“You look at where you’re going and where you are and it never makes sense, but then you look back at where you’ve been and a pattern seems to emerge. And if you project forward from that pattern, then sometimes you can come up with something.”

Capacity planning is the process of collecting operational metrics, aggregating the data over a period of months or years, and then extrapolating a trend into the future to estimate the amount of resources required based on current or planned activity. The more data that can be collected, and the better understood the future operational plans, the more reliable the future estimates are likely to be. Information such as the number of servers, percent utilization, and growth estimates are all variables that will factor in to any budget estimate. The data collected through monitoring and metrics collection becomes a critical component of the budgeting process.

Data aggregation is the process of taking granular data such as the load average collected in 1 minute intervals and averaging it together over a longer period such as 1 hour intervals. This “down sampling” of data makes it more efficient to visualize long term data trends that span many months or years. When estimating activity months into the future, it is critical to have at least as much historical data available in order to extrapolate a trend forward. Because many monitoring or metrics collection solutions may not automatically perform this data aggregation, it is important to ensure that any solution put in place has some mechanism to store aggregated data for long periods of time.

Software Licensing

“The test of the machine is the satisfaction it gives you. There isn’t any other test. If the machine produces tranquility it’s right. If it disturbs you it’s wrong until either the machine or your mind is changed.”

The software development process is typically composed of some combination of commercial and open source applications, and the application selection process is heavily influenced by user experience. When an open source application is sufficient and yields value, keep it. If usability or required functionality can only be found in a commercial application and the licensing cost is reasonable, then pay the license fee if your budget allows it. Always keep the per user licensing cost targets in mind to avoid purchasing a collection of “best of breed” commercial solutions that break your budget once they are all assembled.

Each commercial product will have unique licencing terms that determine licensing and support costs for the purchase and ongoing use of the software. These licensing terms may be based on the number of users, number of instances of the software, or the operational environment where it runs (i.e. number of CPUs, amount of RAM, etc). It is critical that the operations team fully understands the current licensing to ensure that the most cost effective architectural and operational decisions can be made. Operations should be involved in the early stages of any new application deployment, and should have full access to any quotes or purchase orders that relate to the application or environment.

Although open source software may not have a licensing cost associated with it, there will likely be a higher operational cost due to the lack of paid support or available reference material. Depending on the software project, the operational staff may need to invest more time acquiring knowledge, developing customizations, or researching problems. This operational cost should be estimated and factored in to any budgeting or cost analysis. The same is true for internally developed tools and applications.

In addition to the application licensing, there may be licensing costs associated with the operating systems or hardware as well. Selecting the appropriate operating system or product line is also an important aspect of budgeting and optimizing. Unless they are included in the infrastructure cost, these underlying licensing costs need to be considered from top to bottom through the entire stack in order to accurately establish a cost model.

Infrastructure Cost

“Mountains should be climbed with as little effort as possible and without desire. The reality of your own nature should determine the speed. If you become restless, speed up. If you become winded, slow down. You climb the mountain in an equilibrium between restlessness and exhaustion.”

Infrastructure can become an ever expanding domain of cost, time, and complexity. No matter what you are trying to accomplish, there will always be more you could do to improve the infrastructure. At a certain point you will need to come to terms with the fact that infrastructure selection and cost will ultimately be a pragmatic decision based on cost and available resources.

Infrastructure usually implies the storage, network, and servers that make up a software development environment. When hosted internally, that may include everything from hardware to cooling and power costs. When infrastructure is purchases as an external service, that cost may be metered based on application usage patterns or computing power. Whatever the source or combination of sources, it is important to have some way to quantify the current operational costs, capture representative metrics, and extrapolate trends forward to include in budgeting and capacity planning estimates.

Estimating future infrastructure requirements is especially important when hosting the infrastructure internally because there may be physical limitations to the amount of infrastructure that can be deployed. Data centers require proper power and cooling, and although server density is constantly increasing, it is quite common to quickly outgrown a hosting facility without proper planning and a firm grasp of the physical resources required by the infrastructure. If this infrastructure is managed by other groups, it is critical to establish close working relationships with those groups and convey the need for Operations to understand and participate in infrastructure planning.

Conclusion

“When one isn’t dominated by feelings of separateness from what he’s working on, then one can be said to “care” about what he’s doing. That is what caring really is, a feeling of identification with what one’s doing.”

Operations exists in the “empty spaces” and fills the void left between application deployment and application usage. An Operations engineer who “cares” will expand to fill as much of the empty spaces as possible. They will expand into the lowest levels of the infrastructure until they fully understand how the application runs. They will expand upward into the user domain in order to more fully understand how and why the user does the crazy things they do. And in caring about what they do, ultimately they become more in tune with the application and the users they support.

Building a Better Ops Runbook

Shawn Stafford — Mon, 12 Nov 2018 23:10:00 GMT

What to do when it’s 3am and the servers are melting down

A runbook is an operational reference which is used to describe an application in a deployed environment. It should be easy to read, consistent across all applications, and accurate. This is the document an on-call responder would refer to at 3am when a SEV1 alert wakes them up, so it should be as straightforward and to-the-point as possible. Although this article assumes that there is a dedicated Operations team, it is equally useful for DevOps teams, system administrators, or just a plain old developer who needs to understand the deployment environment. The runbook is also useful when auditing an application environment to make sure that the appropriate monitoring, backup procedures, or security policies are in place.

It doesn’t matter where you store your runbooks, just make sure they are easy to find, read, and edit. Usually that means putting them in whatever wiki your team or organization currently uses. However, any tool that fits your regular workflow is usually the right tool for the job. I have provided a sample runbook in Markdown format and hosted on GitHub. This will give you a complete example that you can print out or reference later. The rest of this post will discuss each section of the runbook in detail.

Runbook Inventory

Knowing where to look is half the battle. A runbook inventory page provides a landing page with links to each runbook and a summary of whether each class of requirements is documented by the runbook. Having an overview makes it easy to locate all of the runbooks for the entire organization and also calls attention to any gaps in the infrastructure that might require extra attention.

Runbook Inventory

In addition to landing page, leverage features of the operating system to direct users to the correct documentation. For example, if the applications are deployed on Unix hosts, the “message of the day” file can help ensure that admins know exactly where to look:

Message of The Day (/etc/motd)

Tip: Generate a MOTD file for each system. Figlet can be used to generate the ASCII word art. The giant “punch you in the face” font size helps ensure there’s no question about what system you’re logged in to.

And now, let’s get into the details of a runbook…

Anatomy of a Runbook

Each section listed below describes a section of the runbook. Refer to the sample runbook to get a better idea of how it looks when completed.

Support Contacts

A runbook should have contact information for at least one primary contact at each level of support. That “contact” might be a team of people or it might be a single individual, but the contact list should contain enough information to make initial contact or look up the full contact information in a company directory. If the application is supported by the team, contact information might be an e-mail alias, a ticket queue, or a support hotline. For individuals, it might be their cell phone. The table below is one example of a simple contact table.

Application Contact List

Support is often provided in tiers or levels. For example, Level 1 support might receive all initial reports. Their job would be to validate that the host is accessible from the network and basic services are available. This is most often the role of an IT organization’s on-call staff. Level 2 support would provide more application-specific operational support. They have some understanding of the IT infrastructure but they also have a deeper understanding of the application and can review logs, investigate performance concerns, and troubleshoot application issues. Level 3 would be the application experts, the experts with the most authoritative understanding of the application but also the most costly to contact.

In addition to providing support, members of the contact list are also users of the runbook and should be responsible for reviewing it for correctness. Each member of the contact list should review the runbook on a regular basis (perhaps yearly) and sign-off to confirm that the information is correct and sufficient to allow other members of support to handle incidents.

Overview

The overview section provides a general description of the application. It provides enough information for someone unfamiliar with the application to understand what it is used for and how to find additional information if necessary. It should provide additional links such as:

Links to the application website
Vendor information and vendor support contacts (if applicable)
General license information and renewal dates
Links to any internal documentation or project pages

Architecture

The architecture diagram shows the hosts and services which compose the application environment. It should provide enough information to be useful for audiences such as system administrators, network administrators, or anyone who might need to troubleshoot an alert or outage.

Architecture Diagram

Hosts

The host list contains all hosts that make up each application environment. This will allow the reader to know exactly what role each hosts plays, which are required for the application to function, and any external aliases that might be used by clients. It also helps to group the entries by environment so it’s clear which hosts are used for production, test, or development.

List of Application Hosts

Network

The network table describes all of the network ports that are used by the application. At a minimum this should be provide a list of services and the ports and protocols that they listen on. This can be useful when working with the network team to define firewall rules, or when establishing external monitoring to check application health.

List of Network Ports and Protocols

Directory Locations

When troubleshooting an application issue, investigation usually involves reviewing the logs and checking the application configuration. For applications which store data on the filesystem, it is also useful to know where the data files are located. This can help the operations team identify where cleanup may need to be performed or storage increased when a monitoring alert is received.

List of Key Application Directories

Monitoring

The monitoring section should define all of the services and resources that need to be monitored and what actions to take if an alert is triggered. This can be used to ensure that monitoring is complete and that resolution steps have been documented.

Monitoring Information

Hosts should be grouped by function, with direct links into the monitoring system if possible. Monitoring which is specific to that service should be documented, including the monitoring severity (how urgently someone needs to respond) and the type of action that can be taken to resolve the alert. For simple cases, it may be enough to state, “Check logs, restart the service.” However, in more complex situations such as a disk space issue, the reader will need to know what actions can be taken to resolve the issue. The resolution should contain direct links to documentation which describes detailed steps for resolving the alert.

The severity classifications may have specific meaning or a service level agreement (SLA) within your organization, so it’s generally best to use the agreed upon terminology within the runbook and then provide links to the internally recognized definition for the novice reader.

Metrics

This section of the runbook describes how metrics are being collected, along with links to the appropriate dashboard(s). In particular it is important to document what metrics collection agents are in use and where or how they ship their data. It may be beneficial do document this as an additional service entry in the Network and Directory sections mentioned above, or to provide links to generic internal documentation if the collection is standard across all hosts in the organization.

Grafana Dashboard

It’s worth noting that metrics and monitoring may not be the same thing. Although you may use a system like Prometheus to provide both metrics and monitoring, it is also possible that the long term storage of these metrics are handled by a separate time series database. For example, data may be collected by Prometheus, but then shipped off to TimescaleDB/Grafana for long term (aggregated) storage to be used for capacity planning and budgeting.

Log Aggregation

Similar to metrics, log aggregation is often a common function that is implemented across all serves in an organization. Enough information about the collection agent, destination, and application log formats should be included.

Kibana Discovery Page

Direct links to the log aggregation web interface should be provided whenever possible, including links to commonly used saved searches. Any commonly run queries should be documented here, along with a brief description of how and when they can be used. Anything that makes it easier for Operations to identify issues or narrow their investigation will save time during an outage.

Access Control

Most applications will implement some sort of authentication and access control to ensure that only valid users have access to information that is appropriate for their role. At a minimum, this section should describe how the application is configured to perform access control. For example, it might provide the LDAP connection information, location of the configuration, and any special roles or permissions required for administration of the application.

The objective of this section is to make it quick and easy for Operations to identify what could have gone wrong with the system if someone complains that they are not able to authenticate or do not have access to the necessary resources. It should also identify what group of administrative users can be contacted if special permissions are needed to investigate an issue.

Backup and Recovery

This runbook section describes the disaster recovery (DR) processes that are in place to ensure the system or data can be recovered in the event of an unexpected failure. At a minimum it should describe any automated backup procedures, the frequency and times they run, and the data retention policies for archived data. Be sure to provide links to any detailed DR plans which will be used to restore the system during a catastrophic outage or data loss.

How to establish a disaster recovery plan is beyond the scope of this article, but there are plenty of resources available which describe such documents. Refer to Top 10 Free Disaster Recovery Plans or type “Disaster Recovery Plan” into your favorite search engine to get more information.

Maintenance and Cleanup

Applications which receive or produce data often have automated cleanup processes that remove obsolete data to ensure that the system continues to perform well over time. For example, a time series database might have a process which deletes data older than 30 days, or a binary repository might purge artifacts that conform to a specific set of rules. This section should describe those automated processes and the rules that determine what they delete.

When a disk alert is received from your monitoring system, this section should provide instructions about what actions can be taken to provide immediate short-term relief. If the filesystem is 100% full it may be necessary to take immediate action to cleanly shut down the application, increase the storage, and bring the application back on-line. In other cases, it may be possible to clear caches or execute cleanup scripts to bring disk, memory, or CPU usage back under control. Documenting how and when these cleanup activities should be executed will save critical time when responding to system alerts.

Application Tuning

Application tuning can take many forms. In the Java world, it is typically a set of JVM arguments that define the memory limits or the garbage collection strategy. In the database world it may be a set of configuration parameters that define the number of concurrent network connections, long running query restrictions, or other characteristics. This section should provide enough information for the reader to understand where and how those parameters can be changed, as well as any rules of thumb for how they can be tuned for this application to resolve common issues.

For example, if the application owners have developed guidelines for how to optimize the memory allocation based on the number of users, concurrent requests, or other observable data, that calculation can be provided here to provide the Operations team with some guidelines for what is or is not appropriate.

An Operations Runbook can take many forms, but the most effective ones are the ones that are readily available and easily understood. Remember, these documents are used in periods of extreme stress when the application or the infrastructure is in a bad state. The last thing anyone has time for is reading manuals or hunting around the filesystem looking for clues. Runbooks should be clear and concise reference materials. Keep them short and consistent across all applications. The more predictable the format, the better.

Understanding the Software Development Pipeline

Shawn Stafford — Wed, 07 Nov 2018 02:48:12 GMT

Understanding the Software Pipeline

Software development is a highly collaborative process which requires customer input, planning, development, and product testing. The application infrastructure which underlies the Software Development Lifecycle (SDLC) should support this collaboration and provide traceability and transparency. These tools should provide a foundation for taking software from the initial idea all the way through to release. In order to establish traceability, each phase of the development process should capture structured data which can be linked to the other phases of development.

The diagram below illustrates the interconnected nature of the software development process. Once an idea is formalized into a requirement, it must then be implemented in software and packaged for use, before finally being tested and released to a customer. Each of these phases may require unique software capable of managing the information associated with that stage of the software lifecycle. It is this interconnected feedback loop which ultimately forms the relationships between each phase and which provides transparency to the development lifecycle.

SDLC Applications

Establishing links between the software development phases is a critical requirement for establishing traceability and transparency. As business analysts meet with customers and collect software requirements they must capture those requirements. Developers must then implement each requirement to ensure that the software product meets the needs of the customer. Testing organizations must test the software against these requirements to ensure that the product was implemented as requested and functions correctly. During each of phase of development, information about the project must be represented as structured data, and each piece of data must have a relationship with information in other phases. The integration between these software development tools provides visibility into the scope and progress of a project within the software development lifecycle.

Knowledge Management

From the initial conception of a software product, there is some descriptive representation of how that project will be developed, where it will reside, and who will participate in its development. These ideas and organizational pages provide a collaborative platform for members of the project team to publish information and keep track of project information. This is often the least structured and least integrated portion of the software development process, but it is still important to have a central repository where project knowledge can be easily shared and updated.

Requirements Management

Software requirements are the definition of how a software product should behave. Requirements come from many different sources. They can come from customer input, product management, software developers, or domain experts. They should represent a description of a product feature or behavior that is meaningful to an application user. These requirements may have varying priorities or stakeholders, and software implementation should seek to satisfy these requirements as faithfully as possible. These are the requirements by which the software will be judged for correctness and completeness.

Change Management

Software changes are typically tracked in source control. The source control system provides a historical progression of the software as it evolves over time. Each of these software changes can represent a new feature which satisfies a product requirement, a bug fix which corrects undesirable behavior, or other improvements or alterations to the source code. Establishing links between each change and the corresponding requirement or defect that initiated the change allows for traceability forward or backward through the development lifecycle. Enforcing these links is a critical role of commit hooks within the change management system.

Build and Release Management

The build and release process is responsible for translating the software source code into an executable or installable piece of software. It often includes compiling the source code, identifying external software dependencies, and packaging the software into a format that can be distributed and installed by the software consumer. One of the key objectives of build and release management is to produce a complete record of what went into the production of the software product. This record helps to establish the link between what the developers produced and the software received by the customer. Ideally it should be possible to use this record to identify the precise contents of the software product and to reliably reproduce the software in order to perform updates or debug issues.

Defect Management

By its nature, software is inherently complex and therefore flawed. Developers may misinterpret requirements, make logical mistakes, or simply fail to account for all possible outcomes. Identifying and tracking software defects is a critical part of the software development process. The defect tracking system can often serve as a barometer for the product quality or the amount of effort required to resolve any known issues. It is also a link to gaps in the product requirements or errors in the software code. The defects help identify what errors have been found, which product versions they have been found in, and the product versions where they may or may not have been fixed.

Test Case Management

A test case is a definition of what tests need to be conducted against a specific product version. The test case management system is an invaluable tool for testing organizations to determine how much work is required to validate the quality of a software product. It provides a record of test execution and outcome for each release of a software product. Establishing links between the test cases, released artifacts, and product defects provides insight into the effectiveness of product testing and the amount of test coverage achieved for each product release.

Integrated Solutions

Although it is quite common for organizations to select a variety of different vendors or open source applications for their software development pipeline, there are many commercial vendors who provide a suite of integrated applications which cover many aspects of the development lifecycle. Although few of these vendors provide full coverage of the application lifecycle, some cover a significant portion of it. Popular vendors in this area include Atlassian, JetBrains, GitHub, and GitLab.

Atlassian

Most commonly known for their Jira defect tracking software, Atlassian has acquired several other software development applications and integrated them into their product portfolio. Jira’s plugin architecture allows it to fill multiple roles such as defect tracking, requirements tracking, and test case management. As with many applications, its strength continues to be the defect tracking functionality on which it was founded.

JetBrains

JetBrains gained popularity with its widely used IntelliJ development editor, which continues to be its strength. However, they also offer YouTrack, a defect tracking and project management software. Their TeamCity application is a very comprehensive alternative to Jenkins, and Upsource provides code review functionality on top of a variety of source control systems.

GitLab and GitHub

Although most people think of GitLab and GitHub as a source control application, they offer a comprehensive set of features to cover a breadth of the software development lifecycle. Rather than creating or acquiring a collection of loosely coupled applications, both have created a fully integrated set of features that includes code review, wiki, continuous integration, defect tracking, etc.

Choosing Wisely

With so many options to choose from, how can an organization make the best choice for long terms success? First and foremost, approach the product selection process with a comprehensive strategy. Allowing developers to select the source control system, testers to select the test management system, and project managers to select the requirements tracking system just leads to a hodge-podge of poorly integrated applications that cost far more than necessary. In modern software development, cross-functional teams need to work closely together and the supporting applications should do the same. Establishing a small, cross-functional team of process champions to participate in the selection and evaluation process can often be the best approach for selecting a good fit for your organization.

Establish Priorities

Make sure that a clear set of priorities have been defined to help narrow the options. If strong cross-team collaboration is a high priority, then a highly integrated application may be the best choice. If your organization is highly developer-centric, then products with strong IDE or tool integration may be best. It is important to understand where users will spend a significant amount of their time in order to select a product that will feel the most natural.

User Preference

Have users evaluate the application in the context of the entire software development process. Is there a natural flow or progression from requirements gathering to implementation to release? Can users from various teams intuitively understand the interface and how to navigate through a project? If the users get lost trying to navigate through the application, it probably has usability issues that will lead to frustration and resentment during initial adoption.

Application Cost

The following diagram illustrates the approximate annual licensing cost per engineer, assuming a department of 500 engineers. Keep in mind this only represents application licensing costs. Even though the open source projects are “free” to use, it takes an operations staff to run an internally deployed environment. So if you save $100k on licensing costs, but need to hire two full time engineers to support the environment, then operational support needs to be factored in to the budget.

There are a huge number of variables that need to be accounted for when estimating the cost of adopting a new application:

Can it be hosted externally (SaaS)? Usually hosted solutions are more cost effective because it reduces the in-house operational costs and expertise required to run the application.
How many users will each application need to support? As the number of users increases, the operational support load increases because it takes more effort to scale and run the application.
What is the budget approval process like? If finance or executives do not recognize the value of purchasing commercial software, you may find yourself fighting for renewal budget every year.

Size and Scale

SDLC applications tend to scale fairly well below 500 licensed users (~100 concurrent users), but the operational complexity increases steadily as the number of users increase. Developers can often deploy and manage the applications themselves in a small environment, but once there are thousands of users, it can be a full time job keeping the applications running. The additional complexity of regular maintenance and upgrades, especially if the development organization is globally distributed and has need for a highly available environment, can be costly. If these applications are hosted in-house and have a large number of users, plan for an operations staff that can support a larger environment.

Integration with Other Applications

A key element of any software development application is the ability to integrate with other applications. This is especially true for SDLC applications which rely on integration to establish links between requirements, implementation, and testing. Most modern web applications support web hooks (outbound) or REST APIs (inbound) as a means to integrate with other applications. If the application supports customizations or plugins, there will often be third-party plugins that provide the pre-defined integration logic. Some examples of this include:

Jenkins plugins — there are a huge variety of Jenkins plugins to perform all manner of tasks. This allows Jenkins to integrate with almost any application that provides an external API. An example of this would be the TestRail plugin which submits test results produced by a Jenkins job to the TestRail application.
Jira plugins — Atlassian hosts a large marketplace of Jira plugins that extend the functionality of Jira. In addition to plugins, it is quite common for the source control system to implement hooks which invoke the Jira REST API to update stories when commit messages contain a reference to the story ID.
Artifactory plugins — Artifactory makes it quite easy to drop Groovy scripts into a deployment directory in order to hook into the application event model. It also provides an extensive REST API for interacting with the application.

Next Steps

This article should give the casual reader a foundation for understanding the various applications that can be used to facilitate the software development process. Within a few weeks, I’ll publish a follow-up article to give a similar overview of the operational applications that are used to support a suite of SDLC applications. I currently have an outline for 10 articles that center around the common theme of Release Engineering Operations. This is my first blog post so feel free to contact me or clap/comment to let me know whether the information was useful.