A Tech Team’s Voyage Towards Enhanced Stability: PART — 1

Akhilesh Baldi
Capillary Technologies
10 min readMar 12, 2024

At Capillary Technologies, we have always taken immense pride in delivering excellent customer value with our innovative, highly scalable, reliable, and stable B2B SaaS platform.

In the past few years, we have successfully kept our production incidents at a historic low and uptime in the range of 99.9% on all products. All this while executing well-planned maintenance windows and highly agile sprint deliveries that helped us make our customers successful.

This razor-sharp focus on excellence has rightfully resulted in healthy Net Promoter Score (NPS) improvement, global recognition (leaders in forrester), the addition of marquee (and demanding) unprecedented scale on the platform, and a highly confident executive leadership team that has placed big bets by acquiring and expanding business in multiple geographies and verticals.

Acknowledging our expansion across diverse geographic regions and the importance of catering to various verticals with 600+ bn points transactions, we embarked on a journey to enhance our stability. We focused on critical change-management-related issues and elevated our uptime metrics to surpass the 99.9% threshold with continuous CI/CD pipelines.

These two blog series will encapsulate our journey, illustrating our progression in product stability and production uptime from 99.9% to an impressive 99.999%.

The blog encapsulates our journey through the following stages:

  1. Outlining our critical objectives for achieving stability and efficiency.
  2. Detailing the genesis of the Tech Guardian team and its pivotal role.
  3. Describe the responsibilities and norms upheld by the Tech Guardian team.
  4. Address the challenges faced in reaching our objectives and the innovative solutions devised.
  5. In conclusion, success metrics demonstrate our progress and achievements.
  6. Teasing what awaits in Part II, where we delve deeper into the realm of automation to enhance our processes further.

What drove our pursuit of surpassing the 99.9% threshold?

Amidst the relentless pace of development and the pressures of meeting tight delivery deadlines, our focus was on tackling known and unknown factors causing downtime and stability issues, which, in turn, undermined client confidence. Furthermore, we underscored the importance of maintaining due diligence despite high productivity.

It sparked more focused discussions within the team regarding the necessary actions to be taken.

In tandem with our commitment to progress, the birth of the Tech-Guardian team marks a pivotal moment in our quest for excellence. Collectively, we recognized the hurdles impeding our path forward and forged this dedicated team.

Through collaborative efforts and shared determination, we established the Tech-Guardian team and unearthed and resolved the barriers obstructing our journey toward success.

This group leverages its combined expertise to identify and overcome obstacles, implementing new processes to protect product stability and assist tech teams.

Indeed, here are the key objectives outlined, which we aimed to achieve with the Tech-Guardian Team

In the first part of this blog series, we will focus on addressing the first three objectives.

  1. Keeping the number of critical/high-priority production bugs to a lower single digit throughout the quarter.
  2. Limiting change management-related issues to at most three across the teams.
  3. Achieving a production uptime of at least 99.99%.
  4. Increasing the stability of our production automation suites from 95% to 99%.
  5. Maintaining the stability of our pre-production environment automation suites at or above 70% with continuous development every week.

As we embark on this journey towards stability and efficiency, we commit to addressing each challenge head-on. Stay tuned for the first part of our series, where we delve into practical strategies to tackle the initial three challenges. In the second installment, we’ll continue our exploration, offering insights into how we overcome the remaining obstacles.

What inspired the name ‘Tech-Guardian’?

Mirroring the protective connotation of “Guardian,” we established a dedicated team of 10–12 members, leveraging their combined technology and challenge awareness expertise. This core group, led by a few Engineering Managers, Team Leads, Senior Developers, and Technical Project Managers, aims to safeguard product stability and address team-related challenges.

The objectives of this group

  1. Reduce change-related production bugs: Identify, investigate, and mitigate issues arising during the release process to minimize their impact on customers and improve product quality.
  2. Enhance automation effectiveness: Evaluate, optimize, and expand the use of automation tools and frameworks to increase efficiency, reduce manual effort, and improve the reliability of our products.
  3. Proactive bug prevention: Implement measures to prevent bugs from occurring in the first place through rigorous code reviews, test coverage improvements, and proactive error-handling strategies.
  4. Collaborate with cross-functional teams: Work closely with development teams to foster collaboration, streamline processes, and ensure seamless coordination during releases.
  5. Continuous improvement: Regularly evaluate our bug detection and resolution processes, automation frameworks, and tools, seeking opportunities for improvement and innovation.

Establishing norms and guidelines was essential for fostering a cohesive and productive team environment.

The team established the following norms and guidelines:

  1. Respect for diverse perspectives: Acknowledge and appreciate the variety of viewpoints within the team.
  2. Encouragement of constructive feedback: Foster an environment where we give and receive feedback constructively, promoting open and honest communication.
  3. Effective collaboration: Work together proactively to identify opportunities for collaboration across teams, enhancing effectiveness and synergy.

The Tech-Guardian team embarked on a journey to delve into the intricacies of our challenges, unraveling their complexities and seeking viable solutions. However, the magnitude of the issues and the diversity of perspectives necessitated a more focused approach.

Consequently, we established subgroups to address specific problems and foster efficient collaboration among team members.

Let’s look at the first two challenges, as we’ve recognized their similarities.

How did we identify the grey areas that were impacting our product stability?

The first subgroup focused on identifying all production bugs within their respective teams and analyzed patterns around change management, particularly concerning the prevalence of critical/high-priority production bugs.

Here are the patterns we identified:

  1. Inadequate technical specifications: Lack of precise technical specifications is causing confusion throughout the development process, leading to the need for clarification before and after development, resulting in post-release fixes due to unforeseen gaps.
  2. Variations in code quality and PR review practices: Different teams have varying practices for code quality and pull request (PR) reviews, which impede collaboration and affect overall code quality.

How did we standardize our technical specifications and increase the adoption in 150+ team members within two months?

The team implemented standardized sections in tech detailing documents to ensure consistency and improve team collaboration with faster/smoother adoption.

These standardized sections included:

  1. Precise Requirements: Ensuring alignment with product requirements to establish a common understanding of what needs to be developed.
  2. Solution Details: Provide precise details on the proposed solution and the rationale behind architecture changes, covering both UI and backend aspects.
  3. Scope Definition: Clearly define the scope of requirements to minimize gaps and ambiguity.
  4. Test Plan: Develop a comprehensive test plan and automation suite for release and stability testing.
  5. Tasking and Effort Estimation: Breaking the requirements into manageable tasks with estimated effort, encompassing development and testing efforts, including automation suites.
  6. Monitoring and Alerts: Implement monitoring and alert mechanisms to ensure stability after production releases.
  7. Release/Rollout Plan: Create a detailed plan for releasing changes across clusters, if necessary
  8. Version Control: Maintaining version control to track changes and ensure consistency across environments.

By enforcing these standards and establishing a review process before development begins, we aimed to achieve the following:

  • Stable sprint-level delivery
  • Foster consistency
  • Improve collaboration
  • Deliver high-quality products with fewer post-release issues.

Tech detailing Template, which we started following across teams at Capillary -

How did we ensure the adoption of this control across teams before starting the development?

We introduced a change in our development process to ensure that team members were aligned with us and adequately followed processes.

  • Mandating Tech Detailing Document Links in our Jira workflow:

Before development begins, team members must provide links to the tech detailing document in JIRA. It ensures that all relevant documentation is readily available and accessible to the team throughout development with the sign-off from TL/Architect.

  • Regular and Random Reviews: We conduct regular and random reviews of test cases and tech detailing documents. This ongoing review process helps ensure that documentation is comprehensive, accurate, and aligned with project requirements. It also allows team members to collaborate, share feedback, and address any discrepancies or gaps in documentation.

By integrating these measures into our JIRA process, we cultivate accountability, foster collaboration, and nurture a culture of ongoing enhancement. This proactive stance aids us in upholding high-quality standards, mitigating risks, and providing superior products to our stakeholders.

Let’s see: How did our standard code review practices and checkpoints help us ensure the high maintainability of the codebase in multiple teams?

  • Variations in code quality and PR review practices existed across different teams, hindering collaboration and impacting overall code quality.
  • To address concerns regarding code quality and inconsistent peer review practices, the team took proactive steps by developing a comprehensive documentation of best practices.
  • This knowledge base encompasses various aspects such as code reviews, Test-driven development (TDD), and guidelines for writing high-quality code.
  • It is a valuable resource for existing team members and newcomers, ensuring consistency and promoting continuous improvement in coding standards and review processes.

We identified best practices for code review processes:

Standardizing the commit guidelines across all groups proved beneficial in identifying issues, solutions, and their impacts.

The initial subgroup adeptly tackled the challenges, and the solutions were smoothly implemented across the entire team.

Simultaneously, another subgroup focused on addressing the issue of production uptime.

Achieving a production uptime of at least 99.99%

Production UPTIME

Addressing change management issues was crucial, but it’s typical for legacy systems to have hidden vulnerabilities that only manifest in the production environment. Despite thorough testing and preventive measures, these issues can still arise unexpectedly, causing disruptions.

Recording each downtime was a proactive approach to understanding and addressing issues in our production environment. Here’s a breakdown of the information we started to capture for each downtime:

  • Issue Duration: The duration of the downtime, indicating when it began and resolved.
  • Issue: A description of the specific issue or problem during the downtime.
  • Reason for the Downtime: The underlying cause or reasons that led to the downtime. It could include software bugs, infra failures, configuration errors, etc.
  • Temporary Fix: Any immediate or temporary measures taken to restore functionality during the downtime.
  • Permanent Fix: The long-term solution or permanent fix implemented to prevent the issue from recurring in the future.

We allocated bandwidth for permanent fixes and incorporated them into our roadmap, ensuring we address them within a 30-day timeframe.

Implementing permanent fixes regularly and integrating them into our quarterly roadmap brought several benefits:

Reduced Downtime: By systematically addressing underlying issues, we experienced a decrease in the frequency and duration of downtime incidents. This improved our systems’ availability and reliability, enhancing user satisfaction.

Increased Stability: With each permanent fix implemented, our systems became more stable and resilient to failures. It helped prevent future incidents and reduced the need for reactive measures to address downtime.

Predictable Maintenance: By scheduling permanent fixes into our quarterly roadmap, we established a predictable cadence for maintenance activities. It allowed us to allocate infra resources effectively and plan for any potential disruptions to production activities.

Improved Efficiency: With fewer downtime incidents and a more stable environment, our teams could focus on strategic initiatives and value-added activities rather than firefighting and troubleshooting issues.

Enhanced Confidence: Stakeholders, including customers and internal teams, gained confidence in the reliability of our systems. It improved trust and perception of our organization’s ability to deliver stable and dependable services.

Conclusion and Success Metrics

  • Observed a significant reduction in change management-related production bugs from 70% to <20%
  • Product stability and uptime hit 99.999%
  • Dev process improvements reduced time spent on PR reviews and test coverage maintenance.
  • Positive feedback from stakeholders regarding improved product quality and reliability.

How did we measure the effectiveness of these controls from the developers?

Understanding the value of team perspective, the Tech-Guardian team prioritized gathering feedback after implementing new workflows. Analyzing anonymous surveys revealed positive responses from 70% of participants, validating the impact of the changes on team dynamics and satisfaction.

Key Takeaways

  • Areas of improvement in the Engineering processes that cause product stability issues in the continuous delivery environment in Agile teams.
  • Seamless adoption of new controls in the existing SDLC processes with proper standardization, spanning different product teams of 150+ members.
  • How can we adopt the shift-left approach effectively to have better control during the development instead of struggling with the stability or last-minute significant gaps found post-release or near larger product launches?
  • How can bottom-heavy teams reduce the risk of last-minute surprises near sprint release dates or critical client deliverables?

What’s next?

In the second part of this series, we will delve deeper into automation around staging and production environments, exploring how they have contributed to further enhancing stability and predictability.

Stay tuned for more insights and happy learning!

--

--