Engineering at N26: a Tour of our Tech Stack and Architecture

Rafael Roman
InsideN26
Published in
7 min readSep 29, 2022

The last time we posted on InsideN26 about our tech stack and architecture was back in 2018, and even though I personally feel like we’re still in March 2020, the reality is that we’re in 2022. It’s been over three years — and a lot has changed with our tech stack at N26 since then.

I also realised recently that the outside perception of our tech stack and architecture might not be accurate. I got a message on LinkedIn from someone who was curious about it:

Oh wow! I was hesitating to apply because I assumed that the codebase is mostly in Java and wasn’t very comfortable with moving back to a verbose language.

If it’s Kotlin and as you mentioned that we use microservices, I would really be interested in working at N26 🙂

I’ve been working at N26 for almost 3 years, so I’ve seen plenty of the changes first-hand. In this article, I’ll explain the steps we’ve taken in the past years to keep our stack modern, the challenges we faced, and our architecture evolution.

Myself hosting a MeetUp we did in Berlin back in June 2022

Containers everywhere

Everything we do at N26 lives in the cloud. As we grow as a business, scaling our infrastructure and architecture becomes more and more challenging. Currently, our infrastructure manages over 9,000 containers!

yeah, we like containers.

In the past, we used a combination of Nomad and a home-baked orchestration solution on top of SpotInst. We knew for a long time that we needed a better container orchestration solution. Kubernetes became the obvious choice — today, we manage Kubernetes clusters in all our environments. Of course, there have been some tough lessons along the way, but ultimately, it’s making our infrastructure more reliable and scalable every day.

Since 2017, we’ve employed continuous deployment to production. As we move more and more towards automation and Infrastructure as Code (think: E2E tests, security checks, quality gates, and Monitoring as Code), the pipelines running with the homegrown solution became increasingly slower, eventually taking around 1.5 hours to complete. That was a major motivation to look for alternatives, taking us to Nomad and, later, to Kubernetes. Nowadays, over 80% of our roughly 230 microservices have been migrated to Kubernetes, which makes our deployments and jobs faster and much more stable. A typical deployment takes 30 minutes, and some services fully complete a CI+CD pipeline execution in under 15 minutes.

Kotlin our way in

A few years ago, we made a bet to invest in Kotlin and evolve the majority of our Java-based microservices. We’re excited to say it was a successful bet, and our backend team is happily coding in Kotlin! Of course, we’re aware there’s no silver bullet. Different problems require different solutions, so we also have some components written in Python, Typescript, and some still in Java.

We’ve been thrilled to watch and contribute to the Kotlin community growth. It’s been an interesting journey over the past years, starting with a mix of Java and Kotlin services. Now, all our JVM services are bootstrapped with Kotlin from the start, and most of our Java codebase is in maintenance mode only.

CI & CD

We’re making strides toward a more modern CI+CD setup to support our continuous deployment pipelines. We’re moving away from our Jenkins solution into a combination of Github Actions and Argo CD, which also creates a clearer separation between our CI and CD steps. It’s already had a positive impact on our infrastructure scalability and predictability — plus, it makes it easier for us to meet our strict regulatory requirements.

We’re still in the early stages of this integration, but it’s already proving to be the right decision. The migrated pipelines are faster and more stable as well, thanks to more native support for Kubernetes.

Observability

Using continuous deployment to production is a great tool as it allows us to iterate quickly and be agile. At the same time, it opens us up to risk — we can make positive changes for customers in record time, but we can also break things. Inevitably, there are times when things don’t work out as planned: infrastructure issues, unnoticed bugs, or incorrect assumptions about how other parts of the system work. This is when our extensive and ever-improving observability capabilities come to the rescue. For example, we’ve set up an easy way to access logs through our ELK stack (soon moving to OpenSearch). We also rely on infrastructure and application metrics using Datadog, as well as custom business metrics. When something isn’t working as expected, we’re alerted instantly, leading to much faster reaction and incident management.

Another important step we took in the last couple of years was moving to a Monitoring as Code implementation, so our metrics and alerts live alongside the same codebase. We can tweak the threshold of an alert — for instance, by doing a PR on the service that owns it.

Managing Data

Any modern company should be aware of how valuable its data is. As a highly regulated company, we also need to be extremely careful with our customer data. This creates an exciting challenge for our data team: How can we leverage our data knowledge to work in our customers’ best interests, while also being compliant with regulations and respectful of our customers’ privacy?

Our DBA team currently helps us manage over 100 PostgreSql databases. In the last few years, we also migrated away from Mysql, making PostgreSql our go-to RDBMS for most use cases, due to its robustness and versatility.

We automated the setup of new DB instances and we continue looking for further automation and process improvements. Creating a new database can now be done in a single PR — within a few minutes, it’s already available, auditable, and meeting our compliance requirements. We also operate with a small DBA team, thanks to the standardization of our database landscape, our decision to live in the cloud, and our usage of Infrastructure as Code.

Since we launched our chatbot, our usage of Machine Learning techniques has scaled as well. Although we have many areas of improvement, I think it’s impressive to see how far we’ve come.

Scaling our communication patterns

As you might have already guessed, we’ve doubled down on using a microservices architecture. Our stack is completely distributed, with roughly 200 microservices communicating to each other every second. Anyone that has dealt with microservices (or any other distributed architecture) knows that it comes with its own challenges and trade-offs.

Our network currently handles over 3TB of data every day, with peaks reaching 10TB — all while maintaining an incredibly low TCP latency, usually below 5ms!

As amazingly fast and reliable as HTTP is, we’ve learned — along with the rest of the world — that it isn’t sustainable for a heavily distributed system. The end result could be a distributed monolith, where a simple change in architecture would require changes in several microservices, coordinated deployments, feature toggles, etc.

We were already using more asynchronous forms of communications using AWS SQS and Kinesis streams — both which work really well for what they do. But there were some limitations we were facing, including the scalability of our platform and challenges to meet our regulatory demands. To address that, we introduced Apache Kafka to our stack. Its capabilities, availability, resilience, and speed have made it our solution of choice for most asynchronous communications.

In one sample case, migration to an asynchronous flow from a HTTP sync request reduced the latency of one of our critical endpoint calls by 50% on the p95, while doubling its throughput and reducing the load on another critical system by 66%.

There’s still a long road ahead of us in this matter, as we are breaking out most coupled systems, but we’re making continuous progress. Today, several of our critical processes run fully asynchronously and operating at scale, with over 1,000 messages per second at peak times! We’re adopting a more event-driven approach, which brings a lot of interesting challenges and many conversations about how to model domains correctly.

As a bank, we introduce such new technologies gradually while we stabilize them internally and learn what they can do best for us.

There are clouds in the sky

Any experienced engineer is probably thinking by now: this is a marketing blog post to attract talent, so there must be a ton of stuff they’re not mentioning.

Indeed, there are still plenty of challenges we haven’t finished tackling. For instance, we’re still breaking down our initial monolith. It was deprecated two years ago (*cough, cough* unofficially, four years ago), so the vast majority of its features were already migrated. There are other macroservices left over from early days. For some specific tasks, we do use third-party software that’s built with 20-year-old technology (still in the cloud, though — that’s where our tech belongs!).

But hey, no one wants to work in a place that has no challenges, right? Besides the technical topics we just mentioned, we have an exciting roadmap in front of us to introduce new ways of banking, investing, and managing your financial life. We’re also tackling new challenges: Have you ever thought about what happens when massive amounts of customer bank cards start to expire at the same time? How does a financial institution prevent financial crime and fraud? How can you scale a subscription-based financial business? There are tons of interesting problems to solve, and each day we’re getting better at solving them.

And if you made it this far without having to google your way through the whole article (it’s ok, I also googled a bunch of stuff while writing it), then you’re the kind of person we’re looking for! There are plenty of open positions in our engineering team and one might be your next big challenge. Come join us!

--

--