Reliability — Good Bye SDLC, Hello SDRLC — 3 Things to Future Proof Your SDLC Right Now

Shesh Chikkatur
6 min readMar 31, 2020

--

The Software Development Life Cycle (SDLC) has been around for many decades now, going through various permutations to include Software as a Service (SaaS) and more, with many leading consortiums trailblazing best practices. While SDLC is a popular methodology, most companies primarily scope it to cover seven phases namely Idea, Plan, Design, Build, Test, Deploy, and Maintain in a discrete way (waterfall project management) or with some amount of feedback loop (agile). However, an oft ignored dimension or an after-thought is the concept of Reliability and Supportability.

Many companies struggle to define this dimension and fall into the trap of late discovery, resulting in churn across the People, Process, and Technology framework. Change management is often missing entirely or done in a very ineffective way and you can say goodbye to staying in budget. In a remote-first, cloud-first world, reliability is the key to future-proofing our SDLC framework:

“Reliability should be regarded as a branch of science of its own and that many companies are yet to understand that its role doesn’t start at the end of the SDLC phases, but should be a part of every phase that will help in making production systems more resilient than what they can be without its influence.”

SDLC is considered complete only when Reliability is incorporated as an embedded methodology to follow and hence needs to be viewed as a new paradigm called Software Development & Reliability Life Cycle (SDRLC).”

First, Shift Your Paradigm from SDLC to SDRLC

Companies need to immediately audit the SDLC process to identify areas where the processes are afar from making their production systems reliable, resilient, and supportable. A quick litmus test is to interview the people that are working in the area and ask questions relating to Build-Measure-Learn between teams and identify the gaps. Once the gaps are identified, plan to have collaborative engagements between product, development, DevOps, infrastructure, and support groups. The outcome of these engagements will pave the path for transformation from SDLC to SDRLC.

Second, Engage SREs Early and Often

A team that was previously considered as an entrant at the end of the SDLC is the best team to perform the reliability tasks and in close collaboration with all the teams. This is the Site Reliability Engineering (SRE) team which is a coalescence of software development, systems and infrastructure operations, and software engineering. SREs need to assume a broad range of responsibilities during the initial phases of SDLC beyond just supporting the production systems post deployment. SREs need to be involved in every stage in a multitude of capacities to provide insights from the production perspective and provide sign-off as required (formal sign-off for waterfall projects and implicit sign-off for agile projects due to continued association) at the time of production release. SREs need to prioritize automation, resolve production issues by writing code, and eliminate toil by avoiding repeating the same mistake and automating the repetitive manual methods.

SDLC to SDRLC Transformation: Reliability is a thought process that needs to be embedded in every stage of SDLC.
Figure 1: Reliability Influence and SRE Contribution in Software Development & Reliability Life Cycle (SDRLC)

Here are the ways the Site Reliability Engineering Team can collaborate with the rest of the teams:

  1. IDEA: Collaboration starts at the Idea phase where SRE team will need to be considered as one of the stakeholders by the product and development teams. The project objectives need to be socialized to bring awareness and walk the high level requirements through. At this time, SRE team can consume this information and internalize for contribution during the next phases of software development.
  2. PLANNING: Support and Operational teams possess abundance of knowledge for architecting the systems to make it effective to operate on production. A classic example is an architectural decision whether to proliferate the credit card decryption algorithm to multiple micro-services or centralize it and provide decryption as a service with necessary controls in place. The operations team that manages the keys for periodic encryption of credit card numbers would choose to centralize it so that the keys need not be added to every micro-service. Additionally, SRE involvement in this phase helps the team identify training opportunities when the system is ready to be supported on the production environment.
  3. DESIGN: Some of the main facets of maintaining production systems are to be able to monitor effectively, provide reliable service restoration, change configurations, and provide data dependent mitigations. It is important to discuss and collect vital inputs from SREs towards logging, dynamic feature toggle flags/switches, and configurable properties/data during the design phase. This would also be the best time to decide API signatures and responses for monitoring purposes. The SREs in-turn can start thinking about building one-click mitigation through an admin interface that can turn off a functionality or provide interim relief to the customer. SREs can run capacity models to cater to the traffic needs during the peak season and provide projections and maximum response times allowed for API responses.
  4. BUILD: At the build phase, it is very important for the Software Development Engineers (SDEs) and SREs to collaborate and walkthrough the code at a high level so that it helps in running through debugs to inspect the paths to software failure. In the due course, the teams can also decide on appropriate logging parameters for alert and dashboard creations. The teams should spend time in having end-to-end tracking codes and IDs for transactions for tracing the data flow for triaging and troubleshooting purposes. SREs can also work with SDEs and DevOps Engineers to ensure the design decisions are properly implemented and the results are shared with the SREs.
  5. TEST: A major area for SREs to contribute before the software goes to production is the automation of data extraction and hand picked test cases that can be run during the release or post release. This is particularly helpful so that the teams can develop one-click test case automation to run every day as part of Daily Site Verification (DSV) or as needed to certify production health. These test cases can be written in Selenium, Postman, mabl, Lighthouse, or custom applications. SREs can also invest time in writing code to extract transactional data or aggregated data periodically from the database to construct the customer data flow path or track business and technology metrics in the form of reports.
  6. RELEASE: One of the main areas where SREs play a major role is while providing support during a change where adequate information about the progress of change and validations thereof need to be provided to the management. Some of the validations include but not limited to the health of the application, batch jobs status, traffic routing decisions, application errors, HTTP response error codes, logging inflation/deflation, etc. The SREs have a very important responsibility to provide the sign-off for the release (again implicit sign-off for agile projects) based on the application KPIs tracked and if there are no customer impacting errors due to the release. The sign-off is a crucial step without which it can lead to rollback of the code to the previous version.
  7. MAINTAIN: Finally, during the maintain phase, SREs have the responsibility to automate alerts & dashboards and provide rapid response & resolution so that the service is restored and the customer and business impacts are minimized. The teams need to work on issues on a daily basis and automate the triage wherever possible. Depending on the customer impact, if an issue needs a code change, then the team can fix it quickly by making an emergency code change following the hot-fix strategy. On the other hand, if the customer issue is temporarily mitigated by an alternate method and if a code change is needed to make the code fully functional then the SRE team will need to work with the product and development teams to prioritize the fix as a product backlog item in the next possible release.

Third, And Most Importantly, Define the Benefits and Impact this will have to your organization

It may seem that the investment to have SREs engaged in advance is more in the beginning, however as the organizations mature, the initial investment will pay itself for future success. The impact of thinking through the reliability mindset in software development is tremendous and the contributions that SREs make to the organization are invaluable. Make this initiative a top down approach for transformation to succeed. Then, find a senior exec biz and tech sponsor and roll it out.

When done with focus and open mindset, this will truly be the best investment an organization can make. As such it would be apt for any organization to tread the path laid out above and transform from Software Development Life Cycle to Software Development & Reliability Life Cycle.

The author of this blog is Sheshaprasad Chikkatur, a results-oriented senior leader, expert advisor, consultant, and speaker with 20 years of highly successful software development, support, and people leadership experience. Mentor to many, collaborator to all, he enjoys finding simple, elegant solutions to complex problems across software, web, mobile, eCommerce, Security, and Sales & Marketing, from Strategic Planning and Roadmap Creation, through to Deployment and Support, with reliability built into the core.

--

--

Shesh Chikkatur

Shesh is a senior leader with over 20+ years of Web and Mobile Software devpt and support experience with 12+ years of people leadership experience.