Life after Release: Quality Approaches to Uphold Stability
When it comes to product development, the journey doesn’t end with the release of a new feature. In fact, maintaining product quality after deployment is a critical aspect that separates exceptional products from the rest.
I’m Valery Kashentsev, a backend engineer at Manychat. In my previous article, I shared approaches we use to ensure the high quality when developing our product under the conditions of rapid decision-making. Today, we’ll explore the ways of supporting product quality after releasing new functionality.
Monitoring
Once our solution is available for users, we activate our monitoring instruments. Rollbar, Sentry, and CloudWatch are key tools to monitor application health. These instruments help identify any errors, or confirm that everything is functioning as intended.
For the frontend side, we use Sentry as a SaaS solution for application monitoring and bugs tracking. We set up alerts to oversee key metrics related to application stability. We pay particular attention to the following metrics, such as the total number of bugs, the percentage of users affected by bugs within a selected period, and the number of sessions impacted.
At the application level, we categorize errors by type and label them with additional contextual attributes. For example, using this approach, we’re able to differentiate between an exception triggered by JS code leading to an application crash, and a backend response featuring an unsuccessful HTTP code, which we didn’t address via the UI.
Rollbar is used for backend monitoring. It enables us to trace all unhandled errors arising within the application, ranging from unexpected argument types in methods to errors originating from interactions with third-party systems or issues during database operations.
In addition, Rollbar allows us to send controlled errors. We generate events of various levels — information, warning, error, or critical — with the relevant context. This helps us identify problems and guides our decisions when fixing bugs.
We believe that an error without an owner can be overlooked, and subsequently, forgotten. To avoid this, we ensure every error is assigned ownership. Errors are marked either manually or automatically according to the principle of code ownership. This commitment drives us to address and rectify errors promptly.
Once a week, we make “the bug parade” in our developers’ chat, where we share how many bugs are assigned to everyone. This practice motivates each developer on the team to promptly address their current bug count.
CloudWatch enables us to comprehensively track all events within our application, such as statuses, endpoint requests, processing time, queues, and customized events. Leveraging this data, we build graphics that allow us to monitor the work of the application in dynamics and set alarms.
For example, if a specific endpoint typically receives around 100 requests per minute, a sudden absence of requests for 15 minutes triggers an alert signaling a potential issue. In this case, we send a message to a designated Slack channel, accompanied by the corresponding event details. Subsequently, our support team or duty engineers follow the protocol for handling alarms.
Duty engineers
Occasionally, even after double-checking that all systems are running properly, something can go wrong. In such cases, our support team steps in as the first line of defense. Thanks to our comprehensive documentation, most issues can be resolved autonomously by any member of the support team who consults the available resources. Yet, even these superheroes occasionally encounter challenges that demand additional expertise. This is where our duty engineers come into play.
Each week, two teams allocate a frontend and a backend engineer to collaborate with the support team. Over seven days, these engineers monitor emergencies (alarms, monitoring systems) and resolve them. They also help the support team to investigate users’ problems.
When handling alarms, their initial focus is technical metrics and searching for the source of the issue. If necessary, they can engage with colleagues who may have expertise on the problem.
Beyond the scope of the monitoring tools, duty engineers also deal with addressing queries from the support team related to atypical application behavior experienced by individual users. In most cases, these anomalies result from an unprocessed combination of conditions. Acting like true detectives, duty engineers analyze the data and strive to solve the issues.
If their investigations uncover a bug, duty engineers create a task to fix it. It’s worth noting that only critical bugs are handled by duty engineers. Any non-critical issues are assigned to the product team to be addressed in order of priority during the sprint. This approach allows the product team to comprehensively examine why a particular decision was made, and come up with a new solution instead of a temporary fix.
Being on duty gives engineers the opportunity to work with functionalities they may not have encountered before. This experience often helps reveal deep problems. Sometimes, seemingly minor bugs can occasionally serve as indicators of deeper, more substantial issues.
We understand that deep diving into problems can be time-consuming, so duty engineers are not involved in sprint tasks during their duty period.
Component ownership
While Scrum empowers us with speed and flexibility, it can also blur responsibility and expertise. This becomes apparent when one team devotes two months to a specific product aspect. Afterward, another team engaging with this functionality struggles with understanding its underlying logic, functioning principles, and who’s the best contact to ask questions about.
To address this challenge, we decided to launch “Component ownership” practice. In this context, a component doesn’t refer to a mere technical entity; rather, it encompasses a specific functionality.
Given that our platform helps businesses engage with their customers across multiple channels, including Instagram, Facebook Messenger, WhatsApp, and others, we can consider Live Chat as a component. It’s responsible for real-time communication between a Facebook business page and its followers. In this case, the component is not just an interface element, it represents the functionality of managing the dialogues on our side and between our platform and Facebook. Also, It processes messages, statistics and contains some other features.
Now the product architecture is partitioned into components, each overseen by a designated owner. This approach helps us to accumulate specialized expertise.
The main advantage of this practice is the consolidation of expertise, ensuring that everyone in the company knows which team member to reach out to and ask: “How is this supposed to function?” In this case, our approach is to understand the current functionality first before integrating new features, fostering seamless evolution over disruptive reconstruction.
Component owners are responsible for writing documentation, creating metrics, and working with monitoring systems to manage bugs fixing. This comprehensive perspective empowers them to maintain system stability.
Over time, as owners accumulate knowledge not only of the codebase but also of the insights into emergent challenges, they start to see the bigger picture. Based on this vision, they address immediate issues promptly and strategically plan extensive refactoring if a component’s status prevents future scalability.
Component owners don’t need to have deep expertise in the different functional parts of the component. For this purpose, they can involve their teammates with the necessary knowledge. However, solely relying on particular competence to maintain components isn’t practical. This is where a dedicated team with a component developer comes to the rescue. Teammates who have other competencies also deep dive into the component peculiarities, which enhances their comprehension of the product’s functional and technical principles. This approach enables us to engage with our product from various perspectives, ultimately contributing to its enhanced stability.
Ongoing process
Quality is not just about ensuring user satisfaction. It also refers to the speed and development excellence. The less attention we devote to monitoring and resolving bugs, the more they accumulate, making it increasingly challenging to develop new features and deliver value to users.
Our approaches create order to the processes of addressing issues and bugs, contributing to the maintenance of high technical and product quality within our system.
Of course, reaching our current state has required more than a single iteration. Step by step, we identified problems within our processes and looked for solutions. The journey of improvement is ongoing. We strive to maintain flexibility and draw from our experiences.