“We need to do better,” I was thinking to myself while surrounded by chaos. It was hard to believe that just the night before we were celebrating the launch of a new system. At that point, it was an understatement to say what happened was unexpected.
Flashback — The Bruises
A couple of years ago, I was involved in a major turnkey project that would serve as a linchpin for other systems. It was awarded to a single vendor with multiple sub-contractors. We ran the project using the waterfall model, as was common then. We did all the necessary testing and when the time came, we launched our new product to the world. By then, it felt like we were building a rocket, as the system was constructed with a fair bit of complexity.
A few hours after launch, this rocket was starting to show signs of crashing; some users were not able to use the system at all. The engineers moved in swiftly to troubleshoot. We were optimistic as only a few users reported issues. After all, there was no issue when we re-tested the system, and we had years of preparation and testing to back up our confidence. Well, this confidence was misguided. Not long after, more users reported similar problems.
We called for an urgent meeting to get an update on the root cause. The lead engineer presented the likely causes and shared how they needed to quickly get the sub-contractors in for further investigation. Basically, he wasn’t sure.
The sub-contractors were called into our data center to get a better look at the logs. We narrowed down a few more probable causes and by the late afternoon, we zeroed in on one root cause. It was a bug in one of the core appliance’s firmware. This bug didn’t turn up in our tests because it was triggered by specific conditions. Hence, some users didn’t encounter any issue, while others experienced intermittent access.
A fix was made ready, but we needed to bring our entire system offline to apply the patch. This meant we had to cut off those users who were unaffected by the bug. If we didn’t apply the patch, affected users wouldn’t be able to use the system. In the end, we waited until later at night when user traffic was low to bring the system offline. We applied the fix and did rounds of testing before bringing the system back up.
The next morning was largely uneventful. The system operated as expected. While we were happy, none of us celebrated or declared that the worst was over. We were either too tired or too afraid that we might jinx the system into crashing again. I told myself — “We need to do better”.
“Try Again. Fail again. Fail better.” — Samuel Beckett
Out With The Old, In With NDI
I was given the opportunity to work on the National Digital Identity (NDI) project, which will serve as a critical platform to other systems. I very much wanted to avoid the possibility of repeating history.
One Goal, Shared Visions
We decided on a co-sourcing approach and worked alongside different vendors to develop the final product. Everyone had equal contributions and opportunities in developing the product. We had stronger ownership and better clarity of our product. We didn’t have to worry about saturated responsibility among multiple sub-contractors. We no longer have to depend on the different interests of a single vendor. Our team worked cohesively, despite their varied backgrounds, and we delivered our first internal release on-time.
No More Waterfall
The software development was done using Agile. This enabled us to quickly adapt and evolve our product. We complemented this with Test Driven Development (TDD) and Pair Programming. While we saw an increase in our developmental effort (as compared to a typical waterfall project), our very first code coverage tests yielded 98%. This meant far fewer defects and more confidence in our deployed product.
We Are Open
We limited the use of proprietary products in our development. As much as possible, we used similar open source solutions. Besides the cost savings, we had more flexibility in customisation and more agility if we needed to switch between different solutions. It would also be far easier for any 3rd party to integrate and use our services. For example, authentication and authorisation services were built around OpenID Connect (OIDC) to simplify integration efforts. Our developers, in turn, welcomed the freedom to suggest different open source products. It’s more about right fitting, than over or under fitting.
Go Cloud Or Go Home
One of the biggest decision we made was to work with commercial cloud providers to host our new system. We sent our developers for the necessary training, and designed our system to incorporate cloud features (scalability, managed services, etc). We didn’t have to worry about machine failure, or bringing in new hardware to support more load. Our developers can fully focus on writing and tuning application code. With the given timeframe for our first release, we managed to deliver on-time because of our deployment on the cloud. The first few months saw our developers setting up and tearing down instances, re-deploying firewalls, reverse proxies, etc; things that would have been very complex and time-consuming if it were an on-premise solution.
The Whole Is Greater Than The Sum Of Its Parts
Next, we designed our architecture based on microservices. It suited our Agile approach, where small chunks of code can be developed in isolation from the rest. Unlike a typical monolithic application, our developers were no longer dependent on one another when writing codes. Smaller code is way more manageable than having to deal with a code that is part of a huge deployment file. However, it can and will get uncontrollable. As we found out, we needed a method to keep the madness in check. Enter Kubernetes and Docker. Microservices were containerised by Docker, and Kubernetes would handle the deployment and scaling of the containers.
Development, Security, Operations
Piecing it all together from code development to deployment was our DevSecOps implementation. Our implementation consisted of GoCD to handle our continuous build and Fortify for static code analysis, among others. Deployments to any environment had multiple rounds of automated testing before the final deployment. We didn’t have to worry about inconsistent testing, or if we missed running a test case before it went into our production environment. Fluentd was used for centralised log collection. Prometheus, Jaeger, Elasticsearch, Kibana, and Grafana were implemented for full end-to-end monitoring. At a glance, we could see the health of our system, configure alerts for certain thresholds and, more importantly, trace exactly where issues originated from. I no longer had to wait for a vendor to call a sub-contractor who would call a product engineer for specific data. We had instant access to the inner workings of every part of our system.
What we are aiming to do in our next release is to develop full blue-green deployment capabilities. We want to not only detect issues quickly, but deploy fixes just as fast without any downtime. Our users shouldn’t even notice the difference. We want to avoid being in the same situation where we were forced to cut off unaffected users just to apply a simple fix.
“Don’t worry about getting perfect, just keep getting better.” — Frank Peretti, Illusion
NDI will be a platform that offers a range of services, such as authentication, remote authorisation, identity verification, digital signing, facial biometrics, and push notification, for both government agencies and private businesses. As with all good platforms, a stable foundation is necessary. This was the reason why our main focus of the first release was on getting the foundation right. Most of the work was buried behind the scenes. We had to set up the infrastructure, pick the right tools to use and choose the right people to help us achieve our goals. It was not easy. There were many unknowns, many firsts, and many failures. However, from our first release, I’d say we achieved all the objectives we set out to meet.
Looking back and comparing this to the project I mentioned earlier in this article, if you were to ask me if we had done better, I’d say, without hesitation, Yes.
Think NDI can help you? Talk to us: https://go.gov.sg/engage-NDI