Tools supporting software growth
An overview of the tools we use in DevOps.
In software development, we rely on tools when building anything non-trivial. We can split our tools into two categories: ones that affect the code and ones that affect our process. Tools such as which IDE or database we use typically affect the code, not how we work. On the other hand, tools such as having a build pipeline very much change how we work but have minimal effect on the code. In this post, we focus exclusively on the latter.
This post describes a journey from nothing all the way to microservices and how we adopt which tools. The journey has 12 steps, divided into four phases. This post does not detail the specific tools; entire books have been written about each step. Instead, I provide an overview of the whole life-cycle, discuss one or two important points or pitfalls, and give a tool recommendation at each step. This is not intended to show how it should be, but rather how it could be and hopefully spark questions. Where is your tools chain on this journey? Did you skip any steps?
The most basic thing for software development is collaboration. We work directly or indirectly with many other people; users, stakeholders, team members, and other development teams. Therefore we use a flurry of tools to minimize the overhead of all this communication.
Popular choices: Slack, Microsoft Teams, Discord.
Instant messaging is the modern replacement for internal emails. We often see the recipient-list grow and grow with emails, and it is tough to unfollow a thread once we have been added. This means we get a lot of unnecessary emails, which incurs a context switch and builds frustration.
Using a tool with public channels where people can choose which threads to follow and when to opt-out makes it much easier to reduce this overhead. Although if misused, these tools can easily add even more overhead and frustration.
Using public channels also gives more transparency, which can be hugely beneficial. One benefit is when I see a thread that is not of interest to me, but I know someone who might be interested, then I can notify them, and they can easily join in.
Popular choices: Confluence, Sharepoint, Stack Overflow, Google Drive.
While instant messaging is great for fast information, we also need more static information. This includes contracts, internal documents, and developer documentation. Some of this documentation is legal documents or presentations where form matters, while others are more akin to guides or tutorials.
Because of these documents' different styles, I don’t think one tool can cover all of it well. Something like Sharepoint is fine for non-developers. But for developers, I recommend having markdown files in the folder with the code where we can put all information relevant to the code. I also recommend putting more general information in a central place. This could be a private section of Stack Overflow. Putting the documentation close to the developers greatly increases its visibility.
Popular choices: Jira, Azure DevOps, Trello.
Once we have contracts with stakeholders, it is time to start planning. For this, we use ticket systems. A ticket here represents some task to be done. Tasks need to be prioritized, and our ticket system should give traceability and overview.
We know from Kanban that using a board correctly can help expose bottlenecks and increase teamwork. Another important function of our ticket system is to show our lead time, the time from a ticket enters the system until it is running correctly in production. We should even be able to track how often our tickets need rework or bugfixes.
Popular choices: Git, SVN, Mercurial, Pijul.
With the planning done, we can start writing code. For sharing code, we need version control. In the current ‘meta,’ this is synonymous with Git. I have encountered situations where we needed a centralized system to lock files, but these are rare.
More interesting I think it is to look forward. I put both Mercurial and Pijul on the list, as they seem like they may be the next step for distributed version control systems.
After the collaboration phase comes time to automate, this is important because it enables continuous integration (CI), continuous delivery, and ultimately continuous deployment, which are known to cause fewer errors and increase development speed.
Popular choices: GitLab, Azure DevOps, Jenkins.
The first automation step is to start building our software centrally. This helps eliminate the way too familiar “it builds on my machine.” Building on a neutral server can also start catching common problems such as forgetting to add a file to version control or pushing before having built locally.
Once this is up and running, we can add more confidence to it by adding static code analysis or running automated tests. The only pitfall is that the longer the build pipeline to get onto the main branch is, the longer people tend to postpone pushing to it, which goes against continuous integration. My rule of thumb is that the CI build should run in less than five minutes.
Popular choices: Artifactory, Nexus.
Trying to make our build pipeline run in less than five minutes, we run into problems if our codebase is large. The best solution to this is to cut it up into separately buildable pieces. This is not an easy task, but by building smaller artifacts, we can use our programming languages’ package manager and avoid rebuilding unchanged parts of our code.
This is one of the most significant ways to speed up build and test time, simply because smaller components means less code to build and fewer tests to run. It can also give us more stable builds because the same artifacts are reused instead of rebuilt every time, possibly with different tool versions.
Popular choices: GitLab, Azure DevOps, Jenkins.
With the artifacts produced, we can build a script to deploy them to production. With such a script we can do continuous delivery. Like most automation, this reduces time-waste. More importantly, it increases quality because the script does the same thing every time; therefore, we are more likely to find errors quicker and fix them forever.
In my opinion, though, there is an even more important effect of completely automating deployments. Many developers dread having to make a manual deployment; what if I do something wrong? What if there are bugs in the deployment package? The person making the deployment is typically responsible if something is broken afterward, even if they didn't cause the problem. Automating the deployment moves this responsibility to the entire team, reducing the dread.
Popular choices: Launch Darkly, Optimizely, make your own.
The final step in the automation phase is to decouple release and deployment. We want to have a feature in production without being released, i.e., visible to users. We can achieve this through feature toggling. In a nutshell, feature toggling means putting an
if around any change we make, keeping the existing behavior in the
else. If we can dynamically flip which branch of the
if is run, we can control precisely what is released.
Deployment is a technical decision. Releasing features to users is a business decision. However, without feature toggling, the two are the same, which means neither group is happy. The development team cannot deploy at will, meaning they are forced to build bigger deployment packages, increasing the risk of going wrong during deployment. And the business cannot release or launch a new feature without coordinating with the development team.
Having decoupled release and deployment allows us to do continuous deployment. That is, we can deploy on every commit that reaches the main branch, safe in the knowledge that no new code with run because it is behind an
This is the only step on this “tool journey,” where I recommend starting with an in-house version before going into the full-featured 3rd party tools. It is important to practice the fundamentals, getting used to the cycles of introducing the flags and getting safely rid of them. Once this cycle is second nature, feel free to hook up to one of the bigger systems and start doing advanced A/B testing and experiments.
With our code flowing seamlessly and continuously into production, the next most pressing thing is to be aware of the state of production we need to measure. Continuously. We need to know both what is running and how it is running. The final stage of this phase is when we can discover faults before our users. Let’s discuss how tools can help us achieve this.
Popular choices: ELK Stack.
Logging is a simple and powerful way to find out what code is running. By simply putting a log statement at the start of important methods, we can easily overview how much each method is being called. This means we can identify if some methods have no, or very little, utility. Deleting these is an instant improvement of the codebase. I discuss this in great detail in my book Five Lines of Code, chapter 9, “Love deleting code.”
While simply printing these logs in files can work, there is a lot of time saved by having them indexed and queriable. My usual preference for this is the ELK stack, a composition of a few tools, collaborating to index, query, and present logging data. It is easy to get started with and can scale to even the biggest applications.
Popular choices: Honeycomb, Prometheus.
Once we know what is running, we turn our attention to how it is running. What is the current load, what is our error rate, what is our through-put, what are we currently paying? Whereas logging concerns itself with what is going on inside the codebase, all these questions are about the code's environment. This is where monitoring comes in.
We can often wrap our code in tools such as Prometheus or Honeycomb, which then monitor all requests and responses and even observe the (virtual) machine running it. With little effort, we can get a lot of insight that we can use to diagnose issues when they occur.
Popular choices: Prometheus.
When we have lots of data on how the production system is supposed to be running, we can start looking for outliers. Anyone involved with maintaining software strive for consistency. Whenever something is out of the ordinary, it usually means something is wrong. If we can identify this and react to it, we can spare our users from frustration.
There are a few ways we can give this information to the team. We can set up simple threshold alerts to go off if the error rate passes a specific value. This works well for some things but is not very flexible. We can set up adaptive alerting systems to learn what the data is supposed to look like and trigger if it is off by a certain percentage. This can work really well but is difficult to setup. If we get either of these methods wrong, we risk sending out too many alerts causing frustration for the on-call people. We also induce alarm fatigue, where people stop paying attention to alarms because they are so common.
My preferred method for getting started with this is simply putting the important metrics on big screens wherever the team is. Humans are great at recognizing patterns and spotting outliers. By making the data visible, they automatically start learning the data and reacting when it looks off.
With collaboration, automation, and measurement in place, we have usually acquired a lot of inter-dependencies. These can be in the code, in the organization, in the processes, in the communication, everywhere. Therefore the final phase focuses on managing these. Increasing team independence reduces the cognitive load and enables faster experimentation and learning. From a tool perspective in this post, it is about enabling independent building, runtime, and deployment of services.
Popular choices: RabbitMQ, Kafka, AWS SNS & SQS.
The first step for a team to become independent is to build their component independently of other teams. The most effective method to do this in my experience is to use a central message queue for all inter-service communication. This redirects the tight dependencies on components towards the queue, which is a simpler dependency to manage. It is also more dynamic, meaning that we can change the communication flow gradually.
I go into detail with how to set up such an architecture in a previous blog post. I am a big fan of using RabbitMQ to get started with because of its exchange functionality, which saves a considerable amount of work. However, RabbitMQ does have a lower throughput than tools like Kafka, so when RabbitMQ becomes a bottleneck, I migrate.
Popular choices: Docker, VMware.
Building our team’s component(s) independently, we also want to run them separately. Our communication through an external queue actually makes this straight-forward. By adding a “main method” to our component that sets up communication with the queue, we now have an independently runnable service.
We can package this into a docker image or run it on a separate VM. The important thing is that its runtime is independent of other services. This way, if something fails in one service, it should have minimal impact on other services. Running services separately also gives us finer granularity in our monitoring, allowing us to isolate faulty services quickly.
Popular choices: Heroku, AWS, Azure, Kubernetes.
The final step on our journey is the ability to scale and deploy our service independently. These are two of the biggest advantages to “cloud,” whether it is managed or not. They usually are easy to gain when using a managed cloud such as Heroku, AWS, or Azure, but they can also be gained from running a Kubernetes cluster.
Notice that while this is very powerful, it is also one of the most expensive steps on this journey, either through cloud expense or the great cost of operating Kubernetes. You should examine carefully whether your scale justifies it, because otherwise it may well bust your budget.
We have gone through the entire life-cycle and growth of software development from the perspective of tooling. Starting from collaboration, discussing the tools necessary to communicate efficiently about software. Adding automation to reduce waste and increase quality. Using measurements to operate the code internally and externally. Finally, bringing down dependencies by decomposing.
I don’t believe that every software product needs everything on this list, but I think it is relevant to discuss, and I believe my model is useful when considering how to grow our software. In fact, I often use this model when I discuss DevOps transformations. This is because tools are concrete and, therefore, easy to understand. And my model is close to my preferred definition of DevOps; the CALMS:
- Sharing — which I often explain as “scaling,” as that is a major part of it.
Because my model is so useful in my communication, I have turned it into a poster, which you can pick up for yourself or your team here:
Otherwise, if you would like to learn more, I discuss many of the points in this post and much much more from a developer’s viewpoint in my book: