What is agility in tech ? (Doing 10 deployments an hour @udaan )

Kaushik Mukherjee
engineering-udaan
Published in
10 min readMay 16, 2023

--

Prologue

I am often stumped when confronted with the classic “what’s agility like in tech?” type of question. While I am fully aware of the level of agility and the rigorous control over metrics at udaan, as well as at some of my previous stints leading engineering and product orgs, the answer to such a question is almost always never straightforward. Perhaps because the question itself is ambiguous.
Agility in what ? Solving a complex problem, adapting to a dynamic environment or to produce a blueprint? Or, is it about agility in coding, in design, in build times, or in release cycles? What about quality? Can it all be extrapolated to a lucid number?

Startups sometimes consciously choose to compromise on certain software engineering principles for the sake of speed. But, it is equally important to remember that stacking up debt can only result in bankruptcy over time.
Then there is the entire gamut of product debt — abandoned code from failed experiments that continues to live in the system. This mounting debris can cause problems with readability, maintenance and performance if left unaddressed.
The question hence is how do you balance agility and yet ensure sanity.
BTW this blog isn’t about tech or product debt, although they do influence agility. This blog is mostly about what needs to be done to continue to remain agile, despite incurring debts all the time. While making ~ 10 deployments an hour.
But first, let us chat first principles and talk a little bit about SDLC.

What is SDLC?

Imagine how building a house, requires a set series of stages: site preparation, floor slab, framing etc. Similarly in software engineering there are 7 stages to building out an end product. Now imagine building out tens of hundreds of those end products that are all interconnected and talk to each other. This requires intricate orchestration of the stages and replication of the same in a seamless manner. A smart SDLC mechanism enables just that.

Stages of building a house

SDLC or Software Development Life Cycle consists of a set of tools and processes that assists in producing software with the highest quality and lowest cost in the shortest time possible. SDLC provides a well-structured flow of phases that help an organisation to quickly produce high-quality software that is well-tested and ready for production use.

So what are those phases?

Plan — in this phase, thorough research is conducted on the product. Then, depending on the span of the product, cross deliberation is required across sub orgs.

Pros and cons of the current processes, software, methods get identified. An outcome of this would be a Software Requirement Specification (SRS).

Design /Code / Build — Once the SRS is completed, design considerations (as applicable) come next including how the design will cater to the requirements.

From here on its about coding and building. The implementation phase.

From here on its about coding and building. The implementation phase.

Test — Once the product is developed, the software development life cycle testing phase follows. Traditionally, this was done by a QA team but more evolved organisations leverage various tool sets that allow a developer to do things end to end. More on this in later sections.

Release — The tested product is rolled out into a different environment. Let’s call it staging or pre-prod. Now it’s time to see if this product, when exposed to this different environment, functions as expected. A/B tests are also done to validate hypothesis.

Deploy — Once all the errors are removed, the product is rolled out to the market.

Monitor — After deployment, there is an observation phase wherein the market reacts to the product. Sometimes A/B tests are conducted here as well. Based on the feedback received, improvement analysis is conducted.

Operate — The software is now achieving what it was targeted for, and is responding to the feedback cycle. “Does the software version need an upgrade? Are new features needed? Should the interface be simpler and more intuitive?” And so on.
Every team then replicates these steps using some methodology. It could be waterfall, agile etc. Methodology is out of scope for this blog.

Why was doing SDLC the right way important for udaan?

e-commerce in itself is a complex domain. Hundreds of micro services across thousands of API end points need to intricately orchestrate across multiple subsystems for the platform to function 24x7 365 days.
Now imagine building such systems to power the fastest to unicorn company in India!

Photo by Ronnel Ramos on Unsplash

It is very easy for chaos to reign in if building systems out the right way (SDLC) is not thought of at the inception stage.
At udaan, as systems continued to increase in complexity, as well as volume, it became important to ensure there was a methodical approach where:

  1. Productivity could be measured conclusively
  2. Capabilities were built that allowed to scale without compromising on quality
  3. It deepened the understanding of the overall quality of the digital artefacts.

To put things a little bit more into perspective, while the number of services at udaan grew by 41% Y/Y, and so did our commits, our Speed to Stability (S2S) ratio saw an improvement of 1% .

Speed to Stability ratio (S2S) is quite simply ( # production deployments — # system degradations and outages) / # production deployments

This metric, as the name suggests, provided a sense of the true velocity. For instance, while number of deployments could be looked at separately, an increase in number of deployments but a decrease in stability was suboptimal and vice versa.
Now, improvement in S2S would be possible in all probability if the following areas in software development were actually improving

  1. A predictable CI/CD pipeline
  2. Clearer contracts between systems
  3. Intelligent tools that continuously reduced the likelihood of a broken build / deployment
  4. Observability, alerting and monitoring that enabled a faster MTTI (mean time to identify)
  5. And smart processes and tools that allowed a faster MTTR (mean time to resolve)

SDLC at udaan

Before I go into the details on how we went about implementing SDLC at udaan I want to spend sometime discussing the landscape.

SaaS solutions are amongst the fastest growing segments in the software industry. While a majority of this pie is held by Cloud offerings like Infrastructure (GCP, AWS etc) the SDLC tools and Dev-Ops are showing significant growth as well.

Image credit : Better Cloud https://www.bettercloud.com/wp-content/uploads/sites/3/2017/05/2017stateofthesaaspoweredworkplace-report-1.pdf
Image credit : Better Cloud https://www.bettercloud.com/wp-content/uploads/sites/3/2017/05/2017stateofthesaaspoweredworkplace-report-1.pdf

This an indicator of how organisations are adapting to some of the latest Dev-ops and SDLC tools to remain competitive. According to Gartner , the SaaS industry will continue to grow in 2022 and beyond. Business Research Company predicts SAAS to grow from 270B USD 2022 → 435B USD 2025.

Companies worldwide have started leveraging SaaS in a huge way. Of course the adaption has been accelerated significantly by the pandemic. The Dev-Ops and SDLC tools adaption from organisations providing the same as a SaaS service has shown a higher adaption especially by small and medium scale organisations.

So why is this upward trend on using tool sets for SDLC emerging?

The stages of SDLC as explained at the start of the blog need to be carried out in order to ensure the right outcomes are being engineered. Traditionally organisations have invested in people as gatekeepers for each of the stages. As systems and complexities grow it requires more and more people to be deployed in order to maintain the sanity of the overall outcomes.

This has two major problems.

  1. The quality of each stage now becomes dependent on the people and is therefore susceptible to human errors.
  2. As complexities grow they start impacting execution speed.
    All of this has a compounding effect and leads to significant negative impact on velocity.
    What is required hence, is to bring in systemic interventions in all of these stages and build a platform for engineers that enables them to focus on what they do best i.e. write awesome code and design systems that develop with emerging requirements. This removes distractions and automates the mundane, repetitive tasks as far as possible, so engineers can focus on their craft and have a great time doing it.

Achieving upwards of 10 deployments / hour

At udaan various frameworks / and tools were interspersed into the entire SDLC lifecycle.

Here is a peek into all the tools that were either built or customised internally (except for a few which are paid)

Each of these tools above are important cogs in the SDLC wheel.
They together continue to aid in increasing the S2S at udaan. For eg, the SQL dashboard has a tool called as snorql that will monitor and diagnose sql related problems. It helps write durable queries that scale with time. The engineer does not need to go back and fix things as requirements evolve because snorql apart from implementing best practices also provides recommendations on how to optimise a given query.
Or for e.g., the Netra tool alerts and explains possible system degradation and which business metric might be impacted as a result of that degradation , as well as which services amongst the hundreds of services that might gone awry. This helps in reducing MTTR as well as stopping faulty deployments.
Or for that matter, the binary compatibility detector ensures binaries are backward compatible. If not, it will fail the build thus saving time and the heartache of an inevitable production disaster. There are blogs on what they do on the udaan engineering blog posts.

Guiding principles

  1. Create a measurable plan for each step of the SDLC cycle
  2. Incorporate incidental human learnings into systems, making them smarter
  3. Have a genuine interest in enhancing engineer experience

Creating a measurable plan :
Each track in SDLC was deliberated and we almost forced ourselves to put metrics on the vectors like so:

Tracks in SDLC

This results in honest conversations on current states and measurable movements thereof.

Learnings from past mistakes:
This aspect is quite fundamental to udaan and I highly recommend this regardless of what stage of the maturity curve an organisation is at. Being the pioneers in e-B2B commerce, udaan did not necessarily always have a playbook to follow but often times had to create one. Thus, mistakes would have been made and that was completely ok, but it was extremely important to learn from them and innovate.
Not only is udaan diligent about RCAs, the RCAs at udaan are guiltless. Focus is on the incident and not the individual. There is analysis around “how did this issue escape the current guardrails?” and “what can we do to avoid this in the future ?“
This brings valuable insights that go into creating some of the most effective tools and processes in house. Most action items from the RCAs at udaan make their way into one of the tracks of SDLC described above. This means continuous systemic improvements that avoids similar mistakes without the need to rely heavily on processes.

Caring about engineering experience:

udaan encourages curiosity, questioning everything and critical thinking

Fellow engineers regularly talk to each other specifically around technology, advancements thereof, our tech stack , and do not hesitate to provide feedback in case they feel something is not working, regardless of who they are talking to (or at least that is what I would like to believe) . Any problem that looks like a recurring pattern is a fair candidate to be systemised. Also, the culture of surveys and feedback are passionately nurtured. Every tool that gets developed inhouse goes through both qualitative as well as quantitative feedback to measure its effectiveness. Sometimes, the feedback can be brutal, but it leads to honest conversations around what is working, what is not, which then further leads to creating frameworks that promotes efficiency, creativity and productivity.
Also, there are no artificial walls. Any engineer can contribute to the SDLC tech stack. Engineers see a problem, there is a good chance they will build a tool to solve it, and it will go right into the SDLC stack for others to use. There are several special interest groups inside the organisation — for eg special interest groups on SQL, redis, cosmos to name a few. Many cool tools have emerged out of these groups.

SDLC stack maturity at udaan

What’s next?

Tech at udaan is foss first when it comes to consumption.
Likewise, I feel that the tech team may have produced useful artefacts that can be leveraged by the larger community and so we have started open sourcing some of what we produced. snorql is one such eg., which has already started seeing contributions coming in from the community. Several other tools are in the works towards being open sourced.

That said, there are still unexplored areas that lie ahead to transform the landscape, towards building intuitive software applications. Especially so, when it comes to replicating real production behaviour, complete with its fair share of jitters, network partitions and varied handheld device responses based on their configurations. Last but not least, there is the largely uncharted, hallowed ground of self healing mechanisms that we will soon explore.

--

--