Architecting for system reliability and uptime

How Bondora architected cloud operations with DevOps practices to ensure system reliability and uptime

Bhrikuty Aggarwal
Bondora Engineering and Data
6 min readMar 24, 2023

--

Photo by Christopher Gower on Unsplash

Amidst the rapid advancements in technology and our constantly-connected society, downtime can be extremely costly for organisations. Every second that a system is down can result in lost revenue, reduced productivity, and damage to brand reputation. That’s why it’s crucial to have reliable and efficient systems in place that minimise downtime and ensure business continuity. Two popular approaches to achieving this are Cloud operations and DevOps practices. DevOps practices emphasise collaboration and communication between development and operations teams to automate and streamline software delivery and deployment processes.

The below section explains how combining cloud operations and DevOps practices can result in improved system reliability, reduced downtime, and enhanced business agility.

Infrastructure as Code

IaC is a DevOps practice that involves automating the process of provisioning and managing infrastructure. IaC allows teams to define infrastructure as code, which can be versioned, tested, and deployed in a consistent and repeatable manner. There are many popular IAC tools, such as terraform, that can help you provision and manage infrastructure across various cloud platforms, such as AWS, Azure, and Google Cloud Platform. By using IaC, teams can ensure that their infrastructure is consistent and predictable. This reduces the risk of errors and makes it easier to manage infrastructure at scale. It also allows teams to roll back changes quickly in the event of an issue.

At Bondora, we have different applications and services running behind our product. We needed a standardised way of managing infrastructure and application deployments which could ensure consistency across environments and reduces the risk of configuration errors. So, we chose Helm and Terraform (Two popular tools in the DevOps ecosystem) that helped us simplify and automate application deployments. This combination made it easier to scale infrastructure and applications as needed. We reused Helm charts and Terraform configurations across different environments, which reduced the amount of time and efforts needed to deploy new applications or infrastructure. Both are version controlled, which made it easier to manage changes and track updates.

Monitoring and Alerting

Monitoring and logging platforms provide a wealth of data on system performance and usage, and DevOps practices emphasise the importance of monitoring and logging. By using services like Azure monitor, tools such as Prometheus, Grafana and, logging services such as Loki, teams can monitor system performance and identify issues before they become critical. These tools can also provide insights into usage patterns, allowing teams to optimise resource allocation and improve performance.

At Bondora, we chose to go with a self-hosted solution for our observability stack. Here is an awesome read on what is Bondora way of choosing the Right Tools for the Observability Stack : https://medium.com/bondora-engineering-and-data/avoiding-the-roadblocks-how-to-choose-the-right-tools-for-your-observability-stack-5aac1a39ecdb

Automated deployment and Continuous integration and continuous delivery (CI/CD)

DevOps practices focus on continuous integration and continuous delivery (CI/CD) to ensure that changes are tested and deployed quickly and reliably. Cloud platforms provide many tools to support CI/CD, such as Azure DevOps pipelines. By automating the build, test, and deployment process, teams can reduce the risk of errors and accelerate the time to market for new features.

At Bondora, with multiple repositories, it can be really challenging to ensure that all the components of the product are in sync with each other. And that’s why we chose to implement CI/CD that can help us automate the process of merging code changes from different repositories, making it easier for developers to work on different parts of the product simultaneously. As we are heavily on Azure cloud, we chose to go with Azure DevOps pipeline to build, test, and deploy our applications. It offered a range of pricing options, including a free tier for small projects and pay-as-you-go pricing for larger projects. This makes it a cost-effective solution for organisations of any size. We tailored it to fit the specific needs of our org i.e., defined our own build and deployment processes, configured custom triggers and schedules. We used a range of security features offered to us, including role-based access control, encrypted our sensitive data, and integrated with identity provider such as Microsoft Azure Active Directory which immensely helped us to manage permissions and ensure that users have the appropriate level of access to the tool and its features. We integrated it with SonarCloud (cloud-based version of the popular SonarQube code analysis tool) that helped us to identify vulnerabilities in our code. Azure DevOps pipeline provided continuous feedback throughout the development process, from code check-in to deployment. This helped our developers identify and fix issues more quickly, reducing the risk of errors and downtime.

Collaboration and Communication

Finally, DevOps practices emphasise collaboration and communication between development and operations teams. With this, teams can work more effectively together to identify and resolve issues, reducing downtime and improving system reliability.

Photo by Jason Goodman on Unsplash

At Bondora, we have this weekly practice to share important or useful lessons we have learnt, share any new technology ideas , new tech solutions used, etc. It is like an InfoHour where we exchange information with engineering organisation like ongoing initiatives, progress, future plans, etc. These sessions encourage collaboration and cross-functional teamwork. This leads to the development of new ideas and solutions that might not have been possible otherwise. By sharing knowledge and best practices, employees identified ways to streamline processes and improved productivity. This also lead to build a culture of trust.

One of the most popular communication tools that we used in our organisation is Slack. In Slack we have opened few public channels to help the Teams/EngineeringGroups/people within the company. For example you could create an infrastructure support channel to ensure that the infrastructure is stable, secure, and available to support the needs of the developer’s applications and services. Here developers could ask IT infrastructure related queries, including node pools (VMs), networks, and databases, could ask anything regarding issues that impact application availability, any issues in Grafana dashboards or even raise Jira tickets for provisioning new tech stack or resources. This really led to higher job satisfaction and lower turnover rates among employees.

What are our next milestones to achieve

One of the critical component of CloudOps+DevOps is Backups and Recovery, and it plays a key role in helping organisations to maintain the stability, reliability, and availability of their systems in the face of unexpected events.

Disaster recovery

Cloud platforms offer a range of disaster recovery options, including backup and restore, replication, and failover. DevOps practices emphasise the importance of disaster recovery planning and testing to ensure that systems can recover quickly from failures. By combining cloud disaster recovery capabilities with DevOps practices, teams can reduce downtime and minimise the impact of system failures.

At Bondora, while we have implemented a certain level of disaster recovery measures for our cloud resources, we recognise the need for a more streamlined and comprehensive process to ensure their effectiveness. Our goal is to enhance our disaster recovery capabilities to better ensure business continuity in the event of a disruption.

Overall, this will help our organisation to maintain business continuity more effectively and to minimise the impact of a disaster on our customers and stakeholders to a greater extent.

Takeaways

In conclusion, above mentioned practices can be used together to improve system reliability and reduce downtime. By leveraging the strengths of each approach, teams can build resilient systems that can withstand major challenges. Whether you’re running a small start-up or a large enterprise, integrating cloud operations and DevOps practices can help you achieve your goals and stay ahead of the competition.

Interested in joining us?
https://www.bondora.group/careers/

--

--