The TransferWise stack, 2020 edition
Over three years ago Alvar published the 2016 take on our tech stack. A lot has happened between now and then. The TransferWise community has grown at the speed of light from a million users in 2016 to over 7 million users today. Our customers now transfer over 4 billion pounds a month, saving over a billion pounds a year versus the banks.
The engineering organisation has grown too. From 120 to over 400 engineers in dozens of teams, all working together to achieve our mission. The number of services powering TransferWise has grown from 50 in 2016 to over 250 today, and on a normal working day over 120 production deployments happen, compared to around 30 back in 2016. That growth has meant our stack and infrastructure has had to evolve and we’ve changed a lot in three years.
This is the first in a series of posts about the inner workings of TransferWise. We’re going to deep dive into our engine and look at how far we’ve come in the last three years. We’ll look at the programming languages, frameworks and technologies we work with and contribute to.
Later posts will focus on more specific areas: platform and infrastructure, our approach to front end development, observability and more.
Platform and SRE — embracing the cloud
Back in 2017, it was clear we were outgrowing our on-premises datacenter racks. And, it was time to grow the TransferWise platform. Our business is not building data centers, and we’d like to focus on what is running there, instead. The elasticity, dynamism and varying degrees of managed cloud services accelerated this decision. As we broke our monolith into microservices, our existing infrastructure could not scale out as quickly. And just like that, the decision was made and the migration has begun. Today, almost all of our workloads are running in AWS, and we’re on track to migrate or decommission the remaining parts in the datacenter this year.
We went with AWS for several reasons:
- It was (and still is) the most mature and feature-rich cloud provider, and both vendor and open source tooling makes our lives a lot easier;
- AWS knowledge was already widespread within TransferWise, as well as in the broader industry;
- We have already had a good working relationship with them.
Regardless of that, most of our platform is based on open source software, so the choice of the cloud provider isn’t set in stone. The option to switch, or even use multiple providers is a useful feature.
All our AWS infrastructure is described as code in Terraform. We use resource tagging heavily for reporting the costs back to teams.
CloudFlare sits on the edge, serving the website and the API to the end users and partners, as well as providing attack mitigation and CDN functionality.
The services run in our Kubernetes clusters, which we spin up using our home-baked AMIs. Jose talked about the setup and lessons we learned migrating to k8s in great detail at GOTO Amsterdam last year and his slides provide a lot of useful insight.
For the database layer, we use PostgreSQL and MariaDB, selecting the engine based on the task. Mongo is our choice for NoSQL solutions. Most of the databases run on RDS, which allows us to automate stuff like backups and multi-availability zone deployments, whilst maintaining compatibility. Highly available EC2-based clusters are set up when RDS is too limited for a particular use case.
Envoy handles the service mesh layer, providing a transparent way for services to reach other services.
We use Kafka for messaging. It processes several thousand messages per second, even if we exclude the logging pipeline, which uses Kafka for log shipping.
Backend: microservices galore
Most services are written in Java, although we have production code in Kotlin, Go, Python and NodeJS. We’ve also open sourced a lot of libraries written by us to make our lives easier, like tw-tasks-executor, a framework for executing asynchronous jobs in a distributed environment with consistency guarantees.
We’ve also recently introduced a unified service template. So, when starting a new service you get all the basics like logging, the aforementioned task executor, monitoring, metrics and error reporting out of the box. It’s no longer using Spring Initializr, but lives in a maintained git repository you can base off thanks to GitHub’s excellent template repository feature.
We’re ramping up adoption of our homegrown service-to-service communication framework. With it, all the requests flowing between our services will have priorities, deadlines and idempotency flags set. This will allow services to prioritise critical traffic and shed or postpone the less critical requests. With idempotency controls, requests will be automatically retried in all configured cases — not only when a network issue or another problem happens before the request was already sent, which is what we already do by default.
Some of our earlier tech bets didn’t pay off: we got rid of Eureka and Zuul (replaced by Envoy), as well as Spring Config Server (replaced by Kubernetes manifests and sealed secrets).
Our Grails monolith app is still there, but has reduced in size. Getting rid of it isn’t a priority, so we’ll just retire it when it becomes naturally irrelevant in future. In the meantime, we’re making sure our standard toolset for development and deployment works consistently with any service, small or large.
Frontend: let the 🦀 do the work
Speaking of bets that didn’t pay off, we worked with AngularJS a lot back in 2016 but rarely use it now. Most of our frontend code, which is several dozens of separate small apps, is now powered by React. And, we’ve developed a tool called Crab (which is an acronym for Create React Apps Better), which helps create, develop and deploy a React app with a NextJS/NodeJS backend. We think it’s awesome, so we’re planning to open source it soon.
And all our frontend apps use our unified design system, thanks to regular collaboration with our designers.
Mobile apps: iterative evolution
Our iOS and Android apps are fully native as we believe it allows us to build the best-performing experience for our customers. Today, more customers use TransferWise from their mobile phones than from their browsers, so a first-class mobile experience is a must.
The iOS app went through several tech and UX iterations: from stock Objective-C based MVP back in the early days to the current app, which is based on modularized Swift and custom UI components (lightweight wrappers around UIKit). We also have our own NSUrlSession-based lightweight networking stack, and the same goes for CoreData.
All new code in the Android app is written in Kotlin, which runs rings around Android’s Java version. Using Kotlin throughout the stack allows us to develop consistent APIs and create clean integrations between other core libraries, such as Retrofit and Room. The RxJava code is being migrated to Kotlin Coroutines as well. The current reactive MVVM architecture has replaced the reactive MVP architecture from back in the day, and the overall codebase is now around 75% Kotlin. We also have a custom design and UI component library we use throughout.
Deployments, observability, analytics and security
Our code lives on GitHub and we use CircleCI as our general purpose CI/CD solution, as well as Bitrise for the mobile apps. The artifacts built, like JARs and Docker images, go to our highly available Artifactory instance. The deployments are powered by our in-house tooling.
All metrics — platform, service, business-level ones — are gathered with Prometheus, stored in a Thanos store and displayed in Grafana on a number of dashboards. AlertManager is used for alerts, which trigger pages in VictorOps for relevant teams. We believe in knowing what’s going on inside our running code at all times, so our observability stack is one of the most important parts of the engine.
In terms of product analytics, the relevant data in our databases is replicated to a Snowflake instance in near real time, while being stripped of any personal information, using our own open source PipelineWise. We use Looker to query and visualise the data, and it isn’t just for the analysts or product people — out of 2200+ people working in TransferWise, almost a third of us use Looker every day to make data-driven decisions.
The future outlook
Our datacenter-cloud migration took us a while, thanks to a lot of other improvements happening at the same time. From day one, we’ve decided to not do a simple lift-and-shift of the datacenter virtual machines to the EC2 instances. Instead we’ve worked on the new TransferWise architecture — robust, scalable and fault tolerant, designed from the ground up to take advantage of the cloud capabilities. After we’re finally done with the migration, the work doesn’t stop — there’s still a lot more to do to make TransferWise even faster, safer and more convenient for all of our customers, as well as partners and the developers using our API.
Another thing that we’re working on these days is a cross team disaster recovery exercise. While disasters (like a whole datacenter being gone) are, by definition, improbable, we want to have plans and playbooks for quickly falling over to a backup region and restoring service, no matter what has happened. As this is a large effort spanning multiple teams all across the organisation, it’s also an excellent opportunity for everyone involved to learn more about our platform. And to get involved with the parts you wouldn’t normally work with.
P.S. Interested to join us? We’re hiring. Check out our open Engineering roles.