SRE at Memo Bank

guillaume arnaud
Memo Bank
Published in
7 min readMay 12, 2023

After five years building and running Memo Bank’s infrastructure, we thought it was time to take a break, look back, and talk about the role of our SRE team. It is also an opportunity to specify the name of our team, Site Reliability Engineer, SRE. While some are already familiar with this role, we have often noticed around us the definition remains quite obscure for many people.

“SRE” was popularized by Google when they publicly published a reference guide on the organization of their teams around this topic, as well as a whole set of best practices to rationalize the reliability and support of applications running in production. But not all companies have the size of Google, and in addition to picking up a good part of the SRE concepts, other indirectly related topics are generally grouped into the same team, such as deployment tools, security, performance, or Infrastructure As Code to drive infrastructure. Hence, there are often confusions in the terms.

By listing the different tasks we deal with on a daily basis, we could have chosen another title for our team, such as DevOps or DevSecOps. There are several reasons why we preferred SRE among them, including ensuring reliability in all its forms, applying the same practices as developers, and working on a broader field than continuous deployment or infrastructure.

Reliability

If we were to survey what our engineering team’s priorities are, reliability would undoubtedly often come at the top of the list, and this is even more true when we are a bank. We can have a narrow view of this reliability by confining it to availability: “we are reliable because we are always able to respond to our customers.” While this is definitely one of our goals, it is not the only one. For example, we also need to ensure data integrity, verify our procedures in case of a major disaster, or remain performant during large traffic spikes.

Training for failures

Among the numerous obligations at the IT level, we have to be able to quickly switch regions at our Cloud Provider. To ensure it, we have implemented replication between regions. The procedure for switching between regions is fully automated, and we decided to really do this switch on our integration environment every three months or so to ensure that our procedure still works. This regular repetition allows us each time to detect bugs and find ways to improve this region change.

Every two months, we switch our staging environment to control if the process is valid.

Better performance for a better reliability

We are also very careful with the performance of our applications and their use of resources (memory, CPU, RAM, disk). In addition to better satisfying our customers and controlling our costs, it improves our resilience. Because even if a service runs on two servers, if one of the servers stops and the other does not have enough resources to support the load of the first one, it will probably also be unavailable. For this, we have on the one hand load testing scenarios and on the other hand a very comprehensive set of monitoring (Prometheus), traceability (Tempo), and logs. Every detail counts to ensure the reliability of an application.

Development at the core of our practices

Apart from coding more often in Python and Terraform than in Kotlin or Elixir, our methods are the same as those of development teams. Nothing will go into production without going through merge requests, unit testing, and right formatting. Each merge request is automatically applied to our integration environment. And jobs also run every night to check compliance with certain rules on our infrastructures as well as security rules. Moreover, at Memo Bank, for each project, we have a framing phase and then a building phase, encompassed in quarterly cycles. We intervene on these different phases with developers, but we also carry our own subjects in the same way. Generally, we have one to three major subjects per quarter in addition to support and the elimination of toil.

Automation everywhere

At the level of the configuration of our various components running in production, our automation strategy is to embed everything necessary on the servers directly. A few rare except`ions aside, we will not configure anything manually. For example, a cron on Kafka servers will launch the script that will fetch the list of topics on S3 and then create or modify them. For Vault, the configuration of different roles or the rotation of database passwords will be done on the same principle. This has the great advantage of having only one uniform method for all our changes, by applying our Terraform code and triggering deployment pipelines.

Even topics that belong to our scope more out of necessity, such as our employees’ onboarding into our office tools and their VPN access, become automated and resilient topics entirely managed by Terraform, Ansible, and our Python scripts.

Vault for the configuration of credentials and Consul for distributed locks and service discovery help us a lot to coordinate all these processes without manual action. This also allows us to maintain the immutability of our servers, from the integration environment to production.

A wide range of topics

If the DevOps movement has gained more and more momentum over the past few years, it is because we have understood that two forces contradict each other in the development and maintenance of applications. On the one hand, we want to deploy new features to satisfy or acquire new customers while keeping an eye on the unintended consequences that the introduction of new features can have on our codebase. On the other hand, we want to stabilize what we already have to capitalize on what we have learned from past mistakes and improve it. Intuitively, we associate the first objective with development teams, and the second one with infrastructure teams, hence the subsequent drifts and misunderstandings that arise between teams. To avoid this pitfall, we have no magic solution, but we are convinced that these two objectives must be carried in the same way by everyone.

Share the accountability

This is mainly reflected by everyone’s shared accountability for deployments and support in production. Our best example is our deployment ritual. At Memo Bank, we work in a highly regulated environment, for instance a committee led by our Risk team validates new products or products which have changed substantially. So we can’t do exactly a continuous deployment to production but we want to have a good balance between control and reactivity.

Every two weeks, we deploy a new major version of Memo Bank, which includes both backend, frontend, and infrastructure deliverables, each component having its own version. The deployment of these releases is done in pairs with a person from the development team and a person from the SRE team.

Then, between two releases, it is entirely possible to make patches, at any time on a subset of components. We sometimes have several of them per day, and most of the time our team doesn’t participate at all. This mix of support and autonomy with development teams allows us to save a lot of time and be more responsive to the various bugs that may occur.

Release management: a release with all components every two weeks and patches between releases

Less CI/CD, more time with developers

Another pitfall we wanted to avoid was too much focus on the CI/CD part. Our system is based on AWS images that we build with Ansible and Packer and deploy using Spinnaker on AWS EC2. We use GitLab for everything else. It is not very innovative or sophisticated; it could probably be a little faster here or a little more dynamic there, but it is a system that we have hardly needed to touch for four or five years. Finally, the part on which we have invested the most is the tool we coded for developers and us to build a release with all our components and trigger deployments in pre-production and production, whether it is applying Terraform or launching Spinnaker pipelines, and to give temporary accesses, for example to servers or databases.

This saved time allows us to perform tasks that are sometimes neglected. We have already mentioned, for example, those around the reliability, the data integrity, or more recently, we have worked a lot on the performance of our various components and our ability to scale up. It also leaves us more availability to support developers or be reactive when we are on-call. With the growth of Memo Bank, it is possible that we will question some of these choices, but in the meantime, we will have accumulated enough experience on these other topics for not neglecting them again.

Conclusion

Of course, as a SRE, we always need to stay humble against potential failures on our applications, but we are proud of what we have accomplished during the last years, without major outages and with a small team. We are currently a team of three managing our 150 servers, our 23 databases, our CI/CD tools, our monitoring stack or our different distributed systems like Kafka, Consul, Vault or Cassandra. We made it possible because we keep time for building, automating and not just running.

So, in the end, it is not very important if we do not fully fit into the strict definition of SRE. We tick a lot of boxes, but we probably overflow the scope, which is normal when we are still at our scale. But we feel good about it, and it is very structuring for the many projects to come.

Example of Grafana dashboard following RED method (Rate, Errors, Duration)

--

--