By Nikita Lohia
Change is inevitable, change is constant..
Someone famous said that. All knowing WWW seems to be ambiguous on who said it. But finding out the origin of this quote isn’t the goal of this blog.
We, the reliability engineering team, had a far more pressing problem to solve:
TL;DR : How to track various code releases, happening multiple times a day, pushed by various different teams in a completely random order, which may or may not affect your service in a forever increasing microservices architecture software development environment?
We want to track the changes happening in all of our systems.
We have attempted to solve it before. Our legacy Change Request API, fondly called CRAPI (/ˈkræpɪ/) was tracking releases, both automated and manual, and storing them in a third party change management tool. There were a few reasons why CRAPI didn’t work today:
- It was written 4 years ago in Java which no one in the team quite understood anymore.
- It was synchronous and did not handle server side errors gracefully, resulting in a failure rate of over 50%. This discouraged developers to add it to their production pipelines because it would constantly break/delay their builds.
- It was too complicated to adopt: one had to create an API key by calling another API, the request payload was bulky and unclear, and there was a lack of documentation.
Meanwhile, during production incidents, our operations and development teams were forever trying to figure out which change in our wide and varied microservice estate could have possibly broken a system.
The need for having a central log of changes was evident.
However, between various feature requests of building cool new things™ and supporting our existing services and products, Change Management often got deprioritized. We realised that we needed a smarter approach to solving this problem, and quickly.
So a group of us gathered around a white board, handed a marker to the one who could draw and write on the whiteboard the most legibly amongst us, and we got to work. External auditors, compliance team, dev teams, Project Managers, and our Operations support team: they were all potential users of Change Management. However, we decided to start with one user group to get something out the door quickly — our developer teams.
We used the following guiding principles when building a solution.
The main one is simple, but often forgotten — keep the user at the centre of it all. Whatever we build must be easy for the user to adopt and use. Always always always, think from the user’s perspective when building new things.
Learn from the past
While it’s important to know what to build, it’s equally important to know what not to build. Use the lessons you have learnt from the past to guide you. In our case, we knew that one of the reasons CRAPI wasn’t hugely popular was because it was slow and synchronous. It took a user’s request, validated it, waited for a 3rd party to respond, log it and then finally send a response back to the user. By adding so many dependencies to it, CRAPI was bound to be unreliable. We decided to make the replacement solution asynchronous. As long as the user sends a minimal valid payload, we respond with a 202 accepted and let the user be on their merry way. All of the other processing happens asynchronously.
After all, a server side error should not really be a user’s problem.
We decided to track microservice changes with a microservice application. We made our solution highly decoupled. In the first iteration of building a new Change API, we kept it simple. When a user sends us a change log, we validate it, send an “Accepted” response immediately upon success and send a notification on Slack asynchronously. To make this happen, all that was needed was a couple of AWS lambda functions and an API gateway with a real-time data streaming service — in our case, AWS Kinesis.
By default, we sent all of our “production” change logs to a single Slack channel. This was already hugely helpful.
We went from
Have a futuristic outlook
One of my favourite quotes is from Arthur C. Clarke
“Any sufficiently advanced technology is indistinguishable from magic”
Change API in beta, it was pretty basic. We then did extensive user research and took that feedback to build on it. We added lots of “enrichments” to Change API.
- Change API would take the Git commit details (commit hash, repository name etc.) from the available CircleCI environment variables and helpfully add a link to the GitHub Pull Request/Release into the Change notification on Slack without requiring any effort from the user.
- It uses the “systemCode” ( a unique identifier for every system) supplied by the user and converts that into a clickable link to the relevant troubleshooting guide (we call them Runbooks) — really handy if something were to go wrong after the release.
- Users can also optionally add further key:value metadata with the Change log notification, for example our Memberships team add additional changeSummary and changeDescription fields to give more context.
- It adds a nice little “Last Release Timestamp” next to the system’s monitoring page to make it super easy for people to correlate a monitoring alert with a recent change to the system.
- It also triggers an update of the Runbooks if there was any change in the documentation. Read more on Runbooks in Rhys Evans blog
Each of these “enrichments” runs as its own lambda function, asynchronous and independent of each other, without costing our developer teams any additional time or effort. Moreover, thanks to principle 3, “decoupling”, any future enrichments can be built easily, by anyone, by adding them as a new lambda consumer on the existing Kinesis stream which is logging changes.
Now that Change API was more mature, we decided to put it in high gear and make its adoption easier. We identified the most common user patterns within the FT for software deployment. The majority of us use GitHub for code control with CircleCI as our CI/CD whilst other teams use Heroku or Jenkins.
To cater to the majority of the dev population, we decided to build a CircleCI orb for Change API with some sensible defaults.
CircleCI Orbs, for those who don’t know, are a nice way to share common CircleCI config. By creating one for Change API, we essentially abstracted away all the shared configuration for Change API, leaving the user to only supply the system-specific information in a clear, concise YAML format, instead of writing complicated bash commands in a circle config.
To support some of the other common deployment patterns, we also supplied a plug and play script for Heroku integration with Change API.
Finally, we integrated Change API with existing tooling. One such example is the `n-gage` NPM module which our developers of Ft.com use extensively in their repos. We simply added it to this package once and then all of the repos which use n-gage adopted Change API with a simple version upgrade.
By applying all of the principles above, we now have a mature Change API product which is widely used across our Technology department.
Change API now:
We are clocking around 150–200 daily code releases.
Even during this COVID-19 pandemic, it seems like we are pushing our code, business as usual:
We are tracking most production releases in a single place, which makes debugging really easy.
One such example of Change API in action is below
We are now in the mindset that whenever an alert fires for a system, we look in the change-log channel to see what has been released recently and whether that could have caused the alert to fire. We rollback, make sure user impact has been reduced or all together removed and then do root cause analysis.
Feedback from another colleague on the CircleCI Orb:
One completely unexpected but seemingly very useful effect of Change API was identified by another colleague:
Another area of tracking changes we have long struggled with has been DNS. With our old DNS solution, we didn’t have an easy way to track changes being made to DNS records. We recently moved to Route53 as our main DNS provider and with that, moved to Infrastructure as Code. You can probably see where I am going with this…
Thanks to Change API, we now immediately know exactly what changed in DNS and by whom.
We can proudly say that the Change API codebase has been touched by almost every secondee* we have had since its creation. Till today, at least 16 different people have made code contributions to this project.
While Change API has been extremely successful, as with any other piece of tech, there are areas for improvement. While some of the decisions we made helped its success and encouraged a wide user-base, we also made some things difficult for ourselves.
We currently don’t have a very reliable monitoring check for Change API. This is a common problem for any microservice-based architecture. We have individual functional tests in place, but we still need to implement an end-to-end testing strategy which will alert us when any part of the pipeline is broken.
Secondly, because Change API is essentially a bunch of lambdas, it can be a nightmare to debug if a specific part of it isn’t behaving as expected. We have tried to solve that problem in part by central and structured logging, but it’s still a far cry from ideal. What makes it worse is the fact that by design, we never throw any server side errors, so catching a bug real time can be very difficult — as we are learning.
We are also looking to add more features to Change API — for example, there have been feature requests to also track package releases, and AWS secrets and key management tracking.
However, because it was built by a temporary feature team which is now disbanded, we need to find a way to make these happen for our users.
One thing is for certain, Change API is far from done.
* The Reliability Engineering team runs a secondment program via which people from teams across our Engineering department can work with us temporarily on something different than their usual projects. This encourages collaboration and knowledge sharing and reduces developer fatigue. Get in touch if you want to know more about this!