The Essential Flavors of DevOps “Tools”
While it’s nowhere close to slowing down from being the be-all catchphrase (or buzzword if you’re the jaded type) for propelling software development further and farther, DevOps is much more than just a #hashtag. In fact, companies have been utilizing variations on today’s concept of DevOps for over a decade. I’ve always loved how the Agile Admin described DevOps as:
“ DevOps is the practice of operations and development engineers participating together in the entire service lifecycle, from design through the development process to production support.”
As DevOps has been explained a billion times over at every conference and meetup across the world: it’s a culture first and a toolset second if at all.
Despite the constant bombardment that it’s ‘culture first’ and not a ‘tool’, people still yearn to know what ‘DevOps tools’ in fact are. Is it automation tooling? Are they collaboration tools? Is it the cloud?! How about monitoring? The answer is that any tool that helps further quality, velocity, while supporting the entire lifecycle of software is probably useful in a DevOps cultured workspace.
For a moment let’s move past the notion that DevOps can be loosely described as:
…a cultural aspiration where software developers, security engineers, operations personnel, network architects, quality automation and performance test engineers, and/or database administrators are working together with minimal to zero artificial barriers by advocating improved communication and collaboration to achieve increased agility, automation, security, and customer experience…
Instead we’ll focus on some specific categories of tools with some examples that can help bridge the traditional gaps between varied groups of engineering specialists.
Continuous Integration & Continuous Delivery
As most people know, CI/CD is a highly automated series of tools where your software gets built, tested, security scanned, and deployed. These test and build systems help to further deliver quality code in an automated fashion improving the overall experience. In fact as teams work on codifying their infrastructure and server configuration, these same tools can be leveraged to build entire systems and application stacks and tear them down when not in use.
Jenkins is perhaps the most widely recognized open sourced build orchestration tools out there. Jenkins can help orchestrate entire lifecycle management by using webhooks from source code repositories such as Github, Bitbucket, or SVN to instantly begin a build the second a software change is checked in to a repo. You can set up jobs and pipelines to automatically test, instrument infrastructure, scan for security bugs, create metrics, etc. The list of the tool’s utility is quite large and there are thousands of available open sourced plugins in the ecosystem today. (Full disclosure: I am a member of Cloudbees Customer Advisory Board who provide enterprise Jenkins solutions)
One of the core tenants of a successful DevOps culture is ensuring that the teams can collaborate with one another in a highly efficient streamlined manner. That means you need a collaboration tool and one that can support chat, video/audio, tool integration, etc.
Slack is perhaps the most hyped tool in this particular space at the current moment. It’s got a lot of features to go along with all the buzz too. Plus you can use it for free. I almost feel like they sell themselves short as merely a “messaging platform”. I do wish the SAML/SSO functionality came with the free platform but hey they’ve gotta make money too.
HipChat is Atlassian’s answer to Slack. It’s advantage is that it natively supports the other Atlassian software stacks such as Jira, Confluence, and Bitbucket. One team decided to swap tools and recorded how that went down. It’s a lengthy read but a good one if you’re trying to decide between the two.
I tend to agree personally alongside of this group’s experience that if you’re choosing between the two, Slack is probably going to come out as the winner. The integrations and plugins out of the box just continue to be the differentiator.
Ansible is a configuration management tool that is growing in popularity alongside of favorites such as Puppet and Chef. In fact it’s starting to develop into a bit of a religious war among some crowds but there are clear advantages and disadvantages of all of the tools in this space. The power of Ansible is that it doesn’t require agents lending itself to native immutable infrastructure. The playbook structure is highly reusable and since it’s YAML-based it’s fairly straight-forward to pick up.
More and more enterprises as well as smaller shops are flocking towards Ansible with many embracing Ansible Tower which wraps playbooks in a more manageable interface via a dashboard. Tower can get prohibitively expensive as it’s $5000/yr for 100 nodes (there is a free version available for limited time trials if you want to play-before-you-pay). If you’re going this route, I’d recommend just sticking with Ansible and not using Tower. Both Chef and Puppet are going to require your team to know Ruby (more so for Chef and less for Puppet though to use Puppet very extensively and deep you’ll need the CLI which is Ruby-based).
Monitoring and Metrics
One of the core tenants of ensuring that Development + Operations becomes DevOps is codifying the operational tools necessary to keep production up 24/7 and feeding back any production errors into the software development cycle. Much of that is driven by monitoring and metrics tools. This space is huge and lots of the major players continue to clobber one another for market share. But the good news of all that competition means that there are a lot of solid tools to choose from and the resulting price wars have residual benefits to the enterprise.
In fact the space around monitoring is so cluttered there are a variety of tools that fit specific flavors of monitoring — such as infrastructure monitoring, application performance monitoring or APMs, network monitoring, IaaS or cloud monitoring, synthetic monitoring, and uptime/performance monitors. And the list seemingly goes on and on becoming more specialized by the minute. Gartner has a nice dissection on some of that nuance.
Some may argue that your monitoring tool and metrics tool shouldn’t or even can’t be the same one — well that’s not got to be the case at all. Nagios is an example of an open source tool that has been embraced by everyone from the new fangled DevOps engineer to the classic SysAdmin and fits nicely within both the metrics and monitoring space. However it can be clunky to set up and maintain and while it certainly can scale, it’s not the easiest tool in that regard.
Probably the best tool that is not a native Amazon Web Service offering that allows for real IaaS and cloud monitoring is DataDog. It offers a nice API and a great dynamic dashboard that allows for a ton of integrations with other tools in your toolbag. Of course if you’re leveraging AWS already, AWS CloudWatch is a fantastic tool to provide all of your AWS infrastructure and obviously integrates nicely with other native Amazon cloud tools.
Uptime and performance monitors tend to be best when done globally so that you can actually gauge the performance of your application from across the globe. In near real-time the goal is to ensure that your customers are able to reach your endpoints and frontend web apps in a timely and performant manner. The age-old enterprise tool was Gomez (now available in the Dynatrace suite of products) for the longest time as it seemed to dominate the market but this has certainly gotten to be a more competitive landscape in the past few years. Neustar’s Web Performance Monitoring aka Webmetrics product is fairly decent but Pingdom seems to be one of the more go-to tools in this space now due to its great price-point, Real-User Monitoring, and pervasive amounts of integration points.
Monit is a great open source tool for Unix server-based monitoring. It boasts a nice lightweight web interface that can be used to see your entire server ecosystem in one single pane of glass. It’s also highly configurable and quite easy to setup.
In terms of APMs or Application Performance Monitoring there are certainly a couple of clear leaders in this space — New Relic and AppDynamics. Both are robust, feature-rich tools that have a high pedigree in the market. Both can be on the higher end of the price ticket. So why are there so many parallels you ask? Well interestingly enough both were spawned from the same company Wily Technology. Both tools are attempting to own the entire space of the monitoring arena and to be quite honest they’re both doing a fairly good job at that. Traditionally New Relic had catered more towards startups but that has changed over the past year or so (ahem: when public, you need to sell and cater to everyone). That said there are differences but they’re going to be unique concerns for your experience so definitely explore both.
On the more pure metrics side of the house both Cacti and Grafana provide graphical interfaces to showcase your metrics. Theses metrics-as-a-service tools are more highly tuned for that unique space and should definitely be explored if you’re not willing to pay for a enterprise-level solution that has it built in.
And lastly, logging is an important facet of any great DevOps toolbox. Mature DevOps teams require that team members don’t go hopping onto production boxes (or really any box!) to tail a log to find out why their app went belly up. In comes some form of a service that provides centralized logging. The clear leader for large enterprises is Splunk. By far and away it’s an incredible tool allowing for simple or multi-faceted text search, deep level customization, dashboards, visualizations, intelligent data analysis as well as many other rich features. But and this is a big but, if you plan on using those features you better staff up. The customization capability while robust is enough of a challenge that an ecosystem of “Splunk architects” or “Splunk engineers” are emerging. That screams that the tool is hard to work with and can become burdensome at scale. There is also a SaaS version of Splunk called not surprisingly “Splunk Cloud” but it does not have a similar set of configurations to the on-premise version resulting in a volley of customer support tickets.
If you’re more in need of a tool that allows your teams to subscribe to the notion of centralized logging while giving them valuable insight into how the app is performing and issues, then consider an ELK stack. ELK is Elasticsearch, Logstash, and Kibana; all tools that when tied together offer a rich experience and the ability to maintain and operate production with ease. While it’s perfectly easy to setup your own ELK stack for “free”, you’re going to be paying for storage, implementation, maintenance, and operating it. At scale that can become unwieldy.
Another tool that got momentum in the open source community is Graylog2. It boasts some great visualizations and plenty of plugins and API support. It also subscribes to the notion of breaking down the proverbial barriers between Operations and Development and while most of the logging tools do this, the learning curve for Graylog is not as challenging.
I‘d love to hear from you about your experience with some of these tools or ones that I didn’t cover. Hit me up on twitter “@jsin”.