Slack is the backbone of DevOps

In the cloud, DevOps has become the hottest buzzword in the industry. As enterprises move their workloads to the cloud, the need for transparency and visibility across your development and operational teams has created a entirely new job roles for DevOps engineers and Site Reliability Engineers (coined by Google) that blurs the lines between these well established and historically distinct roles. In this article, we will talk about the tools that are key to a successful DevOps toolchain and how Slack is the glue that brings it all together.

Source Code Management

While there are many choices when it comes to source control tools, most engineers we talk to have standardized on Git for version control. There are a variety of flavors ranging from public Github to on-premises solutions such as Gitlab and Github Enterprise. Throughout the year, there were many published reports around enterprises and their contributions to open source and hosting these projects in Github. From the perspective of public Github usage, there was a great article that talked about how enterprise usage of Github may predict whether they disrupt or are disrupted from a technology perspective. An example of this disruption is what is happening at IBM. IBM has embraced Github for enterprise usage in both Bluemix for their customers as well as their own internal deployment which is one of the largest Github Enterprise deployments to date. Suffice it to say but Git usage continues to be the tool of choice for developers.

Continuous Integration/Continuous Deployment (CI/CD)

When we start to talk about CI/CD, the conversation starts to span a variety of toolchains ranging from Jenkins and Travis-CI to commercial offerings from a variety of vendors. As the discussion shifts to cloud native CI/CD toolchains, Travis-CI has taken the lead in hosted solutions as the chart below outlines the top cloud hosted options for developer toolchains.

As this article is focused on DevOps with a strong opinionated view on cloud native development and where we see the market headed, we are going to focus solely Travis-CI. The power of Travis-CI can be tied directly to their configuration model (travis.yml) and how the configuration is managed by Github source control giving the developer full control over executing build, test, code coverage and deployment. If you are developing your code in public Github, the ease of introducing Travis-CI to your workflow take a matter of minutes. Developers love the ability to trigger builds off of pull requests in Github and then immediately transition over to Travis-CI to see the result of the build and deploy all running in the cloud.

Monitoring and Alerting

Traditionally developers would consider the CI/CD section of this article to be where their responsibilities end. At this point in time, I have my code under source control and working with the build team, I have a build with a set of tests that prove my code works so I am ready to deliver this code to customers. The responsibility around operations and debugging issues shifts to another team. In the new cloud native world, the responsibility of managing the health of your application that is running in the cloud has shifted back to the development team. Teams have embraced tools such as New Relic to be able to quickly understand the health of their applications and capturing metrics around response time, workloads and failure rates. Most teams have integrated monitoring tools such as New Relic with services such as Pager Duty to proactively notify their DevOps/SRE squads when the system is not acting as expected.

Data and Insights

In an earlier section, we talked about the power of Travis-CI and its model for configuration and ability for developers to have the power to influence how build and deployments work for their projects. The concept around Social Coding in Github enabled a model where all developers have access to the entire CI/CD workflow and can get insights in the most recent build, provide feedback on pull requests, validate the latest test results and what is the current results for code coverage. When the conversation shifts towards tools like New Relic and Pager Duty to get performance metrics and know which developer is on call, the data is equally insightful and relevant but is typically in a separate silo as the data is outside of the scope of CD/CI.

Audit Trail

All of the tools mentioned so far each play important roles in the DevOps process. Most developers are in the now and want real time data to make decisions quickly and definitively. There is also great value in knowing historical details especially during a post mortem where the development team is attempting to backtrack to understand why an outage occurred. Historically, we have spent time attempting to correlate the logs from tools such as Elasticsearch to understand the state of the system when the New Relic captured the error which resulted in a Pager Duty callout. Once we have pinpointed the issue, it is now time to go to Github and see which commit version mapped to that Travis-CI deployment and then check to see how our code coverage missed this code path in our testing. This process can be quite arduous and time consuming and is very mistake prone.

Social Collaboration

Social collaboration may seem to tangential to our discussion around toolchains and about two years ago, we would likely have agreed. However, the adoption of tools such as Slack has introduced a huge paradigm shift in how our developments team interact today. With a globally distributed workforce, having the ability to quickly start a Slack conversation and have real time interactions with your teammates becomes at least as important as the development tools you use for source control and build.

Slack in DevOps

We believe it is easy to get agreement that tools such as Slack are great collaboration tools and teams have flocked to Slack for group collaboration and of course generate random GIFs using giphy. However, the true power of Slack is the sense of community and our developments teams have gravitated to it in mass. When we polled our development team earlier this year, the most actively used tool was Slack. Now lets walk through why.

As our team embraced Slack, we started to see how we can avoid all of the context switching that occurs throughout the day and how we can keep our developers focused on the most important things that matter. We found that Slack is a great tool to tear down barriers and give our entire team full access to our DevOps flow. We started initially with New Relic posting events to our DevOps channel each time we saw an increase in response time (we were seeing one service with intermittent issues and our developers wanted to be notified when this happened). We then configured PagerDuty to publish to Slack when a developer was about to start a shift and also if there was a callout. This really helped our developers understand issues with our running systems.

Our team really appreciated the transparency of our development process and wanted to expand the scope to include CD/CI. Quickly we started to add additional events around development lifecycle events such as git commits, pull requests as well as build and deployment events to Slack which allowed us to quickly gauge the state of our end to end DevOps process from code commit to deployment environments. Now when our teams participate in a post mortem, we have a complete audit trail of our entire environment within Slack that gives us a complete history of what contributed to the issue and can seamlessly go back to which commits were part of a given deployment all within Slack.

Our main driving force around Slack adoption was around transparency and having a great foundation for our development teams to collaborate in. What we learned is that many others in our organization could benefit from this transparency and would also like insights into our system. At this point, we moved to a Chatbot model and introduced Cloudbot who was an interactive bot that team members could chat with and get answers to their questions. To reduce the learning curve, we trained Cloudbot to be cognitive using IBM Watson. By making commands cognitive, users did not need to understand the exact commands to find out the state of our applications and know what is the latest code that has been deployed.

Conclusion

While there are many aspects of DevOps, we believe that transparency is the key to having a thriving culture. Using tools such as Slack, developers can leverage the same collaboration tools they use to interact with team members to also interact with their toolchains. By making Slack cognitive, users can ask questions to fellow team members as well as bots and get real time responses to their questions. With the recent announced partnership between Slack and IBM, we believe the future is bright for Slack and Cognitive Chatbots where Slack continues to be the backbone for DevOps (aka ChatOps).