Microservices at Nubank, An Overview

This article has been migrated to our new blog. Find it at https://building.nubank.com.br/microservices-at-nubank-an-overview/

Microservices is the prevalent architectural style of the day. So much so that there isn’t much value in describing what it means and its advantages, there is plenty of great content available out there. Instead, we are going to talk a bit about what we’ve learned growing a complex system over the past six years.

Six years is also the entire lifespan of the company, which means we started with microservices from day one, defying the standard advice of starting with a monolith. The rationale usually given is that it is best to optimize for quick pivoting in the beginning, while the startup is searching for market fit, to later refactor to a more stable structure. We found it to be true that starting with microservices, particularly in 2013, made us slower in the beginning. Putting together a complex provisioning infrastructure while building the product from scratch was a lot for a small team, and it took some time until we felt the work flowed smoothly. On the other hand, our business wasn’t very inclined to quick pivots, and those early months were probably the best time for investing in a solid foundation, rather than later when we had to face the increased pressures of scaling and building out the feature set.

That is not to say we got everything right from the start. On the contrary, we had to change many of our core abstractions as we understood the domain in more depth, which meant at times that we had to redraw service boundaries. This learning process continues to this day, at any given time one squad or another will be working on splitting up a service or merging two services together.

Our production infrastructure went through several iterations. We started on top of a managed Chef service, experimented with CoreOS Fleet and ECS, built a simple DIY one-container-per-vm infra using core AWS EC2 abstractions and CloudFormation, to finally converge on rolling our own Kubernetes clusters. It was a long journey but made much less arduous due to pervasive automation. All of our cloud resources, since the very beginning, are provisioned through some kind of automated process. Even one-offs, like that livegrep instance one squad wanted to get running in no time, have to be automated. Like most things in the company, the obsession with automation is much more of a cultural norm than an imposed rule.

The latest incarnation of our automation infrastructure takes the form of a Clojure codebase that orchestrates the creation of AWS or Kubernetes resources. One important input to that process comes from a git repository where engineers collaborate to declare metadata for each microservice: what kind of database it requires, how heavy is the workload, what kind of build and test tooling it should be put through during build, etc. Based on that data our automation can do magic like provisioning databases, setting up service discovery, even conjuring up entire build pipelines for all services.

Much of our engineering culture is connected to the functional programming community and ideas. One central idea is the concept of immutability: it’s always safer to build updated copies of data than to change it in place. When we carry the idea over to the infrastructure domain, it translates to building updated copies of resources rather than mutating them in place.

A simple application of this principle is applied to deployment: we first spin up new containers to then tear down the old instances. Nothing particularly interesting there: pretty standard example of blue-green techniques.

A larger-scale application is to spin new production stacks. Every now and then we have to make changes that are larger than a simple deployment. We might be tightening up security, improving service discovery, or rightsizing infrastructural elements. Regardless of the specifics, the general approach is the same: we spin an entire production stack — including all services, Kubernetes clusters, Kafka clusters, etc. -, test that it’s up and running well, then point traffic to the new stack by updating the DNS aliases of all entry-points. Needless to say, this whole process is heavily automated, to the point a small team can execute all steps every couple of months.

Taking care of people’s finances is a charge we take quite seriously. Consequently, data integrity is of the utmost importance. Running a microservices mesh while maintaining those high standards brings new challenges: how to ensure data is never lost in a world where partial failure and network partitions are the norm? Our answer is to heavily rely on asynchronous messaging.

Partial failure and network partitions must be embraced to guarantee reliability.

Most service-to-service interactions are intermediated by a reliable replicated messaging broker. Instead of having the client service wait for the server to finish processing and respond — subject to all sorts of failures due to load imbalances, network glitches, and the like — the initiator service will reliably publish a message to be later consumed by the next service in the flow.

Failures can still occur. Imagine that in order to process a message, the consumer service needs to make a call to a third-party service that is having stability problems. Even then we can recover and avoid data loss, by catching the error in a lower layer and automatically rerouting the message to a dead letter topic.

In addition to the operational aspects that we’ve covered so far, we should consider the data needs of decision making in the company. Decision-makers, including business analysts and data scientists, depend on data once written by microservices for their models. Our goal as engineers is to offer them a stable interface to that data. This is challenging, given that data models and boundaries are constantly changing. Also, our data is sharded to handle scalability — all in all too raw and unpleasant for our analytical counterparts to work with.

A few years ago, we deployed a layer called ‘contracts’ to solve the problem. Contracts serve as the stable interface mentioned before. Using contracts, we’re able to safely expose microservice data to the analytical side of the company and reduce the risk of (silently) breaking models. Contracts are Scala objects automatically generated from the service data model. With the contracts, our Spark batch jobs transform data into tables destined to fulfill any analytical purpose downstream. To make sure that this layer consistently reflects the reality (the shape of the data), we heavily depend on automated test — both for the services and the Spark jobs.

Because of the tests, we rarely run into issues due to upstream data schema changes breaking analysis downstream. We do, however, still wrestle with appropriately reacting to changes of ‘data values’ upstream. This can happen when the definition of a piece of data changes (dangerously subtly at times) while the downstream analysis isn’t aware of that. We are currently attacking this problem at various angles: anomaly detection on table statistics (count, cardinality, etc.), improving communication between engineering and analytical stakeholders, and decoupling contracts and analytical use cases (implementing data warehouse), but it’s still far from a solved problem.

There is a whole lot more we can talk about. From other reliability patterns that we’ve applied over the years, to the details of our container orchestration journey. From the way that we plan for scalability with sharding to the culture of learning from outages through blameless postmortems. From our experiences introducing back-ends for our front-ends to… well, you get the picture. This is all too much for a single blog post but stay tuned to future notes from our engineering team.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store