Wise Tech Stack (2022 edition)
Two years passed since our last tech stack post. TransferWise became Wise and Wise became a public company. We’ve opened more offices in more locations and remote work was never as common. We faced more regulatory scrutiny, went through multiple audits, and proved to the world and our customers that we are trustworthy, and still laser focused on helping money move without borders. We kept going, kept opening new routes and launching new products, all in the new paradigm. Most importantly, we did it without slowing down, introducing extra complexity or bureaucracy to our processes. In this blog post we’ll share how we achieved it.
Product: Backend Microservices, Frontend, mobile and API
Our product is at the core of what we do and it roughly consists of customer-facing frontend and mobile applications, and complex backend machinery that works under the hood to support our mission: money without borders — instant, convenient, transparent and eventually free. It is not a small task. Luckily, we have more than 700 engineers forming approximately 100 teams that are very good at solving the problems that arise. Our stack follows that structure. We use microservices in backend, frontend and our mobile apps are also modularised to a small chunks owned by specific teams.
Most of our over 600 backend services are running on JVM, use Spring Boot and written in Java or, sometimes, Kotlin. Early on we decided to encourage anyone to contribute to any service we have. To achieve that we needed to reduce the entry barrier, so common language as a foundation worked really well. Sticking to industry standard allows us to tap into a wide and diverse pool of engineers that can join us on our mission.
Common, compatible runtime also enabled us to work on standardisation, and building tools and libraries that support each engineer: standardised observability, security, service-to-service communications and more. We abstracted a lot of underlying infrastructure, so our product engineers can focus more of their attention on building things for customers. We ask our developers to write automated tests, focusing on a quality of tests rather than perfect coverage. Developers can easily spin up a separate dev environment for experiments and testing.
For our frontend stack, we took it a step further. We have a single framework, built on top of Next.js, which is currently the backbone of all customer-facing applications. We call it Crab and we touched on it in the previous post, but it evolved since. It’s now written entirely in TypeScript and has more features than ever. The framework was built to abstract away a lot of common functionality and infrastructure-related complexity, so that product engineers can focus on creating new products and their features with as little extra code as possible. It has observability, tracing, analytics, A/B testing support, and authentication baked into it, amongst other features. Any changes to the framework will be incorporated to all our apps as soon as we bump the version of the framework used, and with the help of Renovate bot it’s easy to keep our package dependencies up-to-date.
Frontend at Wise is also a good example of our engineering culture. Crab is maintained by the Web guild, which is made up of various engineers from different product teams. The guild actively works on setting standards for our engineers, things we don’t want to compromise on. Guild members identify opportunities where framework can be improved and form working groups that focus on tackling the pinpointed issues. These initiatives represent our core value of “We get it done” — no matter what team someone’s in, you have the expertise and willingness, your help is always welcome. Engineers have a lot of freedom in their hands in terms of trying out new tools and cutting-edge technologies, and if they work out for us, they have the power to bring adoption to it across the organisation by starting with a simple proposal.
We have two native mobile apps: one for Android and one for iOS. Working on them has their own unique challenges.
Mobile releases are known for being relatively slow. With backend, we do approximately 700 releases a week, with mobile apps we currently do weekly releases. Even if most of it is automated, we just can’t speed up review and acceptance process of application stores, and asking customers to update their apps too often would be ridiculous.
We also don’t want to make app experience too different between mobile apps and frontend.
So we build our own framework, called Dynamic Forms, that allow us to implement business logic once and then natively render it on the devices and in web. It is data-driven and declarative and can change from backend without any modification to the mobile application code.
This keeps our mobile teams small and razor-sharp focused on improving the core of app experiences and allows our large pool of backend developers do frontend cross-platform, no matter what it is, business logic like “recipient creation” flow or more specialised code that is responsible for photo and video capture used for customer verification.
And if it still acts and feels like a native code, that’s because it is a native code and uses native widgets and dialogs.
As for the apps themselves, iOS app is now written entirely in Swift, fully modularised, we moved from UIKit to Swift UI, and we use Combine. Android app is similarly modularised, uses Kotlin as a main language (we moved from 75% to 97% since last blog post), still uses MVVM architecture, RxJava is replaced by coroutines and/or Flow, and we now use Jetpack Compose for UI. We do machine learning on mobile devices to drastically speed up the some document pre-validation checks, save the roundtrip to backend and thus give customers better experience. More about it in ML section below.
We also work hard on our internal design library that allows for some very cool things that I’m not allowed to talk about :)
You might notice a common trend in here: we focus very hard on keeping a startup-like fast and ruthless release velocity, and let engineers have a lot of freedom, impact and agency over their work.
We pair it with strong standardisation to make sure we can do those things in a scalable, reliable, observable and compliant way. Next section will cover the ways we handle data at Wise, a huge topic by itself, so keep reading.
A lot of what we do in Wise is actually data processing. With the scale we operate at we have lots of data that gets produced or collected, needs to be processed, stored and retrieved and the numbers are only growing. We have both RDS databases and instances running on the EC2. Kafka is used for asynchronous messaging between our services, streaming and aggregations and for log collection. We use data for decision-making: our data professionals are given access to a number of tools to analyse and visualise data at Wise. Last but not least, data we collect is used for our ever-growing machine learning stack. So, let’s drill down here.
For the services we mostly use three database technologies: Maria DB, Postgres and Mongo. Teams decide which one suits their needs the best. RDS is helping with backups and db provisioning, but we also automated a fair deal of manual toil since the last post. We have self-service now that allows product teams to provision an RDS database, so they don’t need to wait behind DBAs. It also allows some automatic upgrades, as well as exposes explain plans, schemas and more. Whenever we need to handle heavier load, we prefer provisioning our own DB instances on the EC2. We achieve high availability by using Patroni (for Postgres) and Orchestrator (for Maria) to manage our clusters. DB discovery for EC2 is done via Envoy, same as service discovery.
Observability and performance insights are very important for us at Wise, so we looked a lot and tried different technologies and solutions for that. For now, we use Percona Monitoring and Management for exposing query performance data to product teams, CloudWatch alerts and AWS Performance Insights.
In addition to databases, we have a mature messaging and streaming infrastructure. We produce and process billions of messages a day. We recently moved our Kafka clusters to Kubernetes, there is a post (part 1, part 2) you can read more about. We keep complete history of Wise data in compacted Kafka topics and use that data to calculate different aggregations. We have our own streaming engine and a custom DSL to work with it. It allows us to choose between Kafka Streams and Flink, depending on the needs. We have an in-house tools to ingest the data from databases to streaming Kafka cluster. It allows us to have real time aggregations. Services use them to take business decision faster, without the need to perform heavy database queries.
We have a very strong knowledge in company about Kafka and streaming, which lets us use those tools to the maximum and also contribute back to those projects.
Our Kafka team is one of those teams that fully manages their own vertical slice of infrastructure, which gives them total control on everything, from where their software runs to how it is presented to the users. They are also free to choose and to use their technologies and do large changes. For example, right now they are looking towards the way to make data easily discoverable across Wise: using a graph processing and distributed query engine, Trino, to do so.
While they work on uniforming that access, a lot of data is already available for those who need it. Data scientists and analysts at Wise use data we keep to make data-driven decisions and to build machine learning models used by product teams. We use Snowflake as a data warehouse and it works wonderfully for us. It stores most of company data, and presents analysts a number of views based on their access level. We replicate data from our production databases, Kafka, S3 and more using PipelineWise, an in-house tool built specifically for that purpose. We made it open source, so feel free to check it out.
Complex reports are built regularly using AirFlow, some data is queryable directly from the Snowflake. For visualisation, we use Looker and Superset, depending on the needs and on the engineer’s familiarity with the tools.
Machine Learning stack in Wise is relatively young, but it’s growing and maturing at breakneck speeds. We do data exploration and model training using AWS SageMaker, as well as on EC2 instances. We also make heavy use of Spark and H2O on EMR. Re-training, data gathering and cleaning is orchestrated by Apache Airflow. One thing we used to do was to train models using data from our analytical data warehouse, before testing them in production. An on-going project bridges that gap, allowing us to directly use production data to build machine learning models. This ensures a better quality of data and therefore — better quality of models. Moreover, model’s time to production is now significantly shorter. Unsurprisingly, it’s built on top of our strong data streaming foundation, using Kafka and Apache Flink.
Models at Wise are hosted by an in-house built prediction service. We made lots of optimisations in our prediction service, to achieve near realtime performance from our models, so that customers will barely notice the checks we do.
In some places we would like to be even more responsive with ML outcome. So we brought those machine learning capabilities to the end-user devices, using Tensor Flow Lite to do document checks. This allows us to filter out images, that would anyway fail the full check, but at the rate of 10–20 frames per second, meaning customers will always send documents that are legible and of a correct type.
Analytics, and ML, is a place in Wise where we actively use Python. Our data professionals use PyCharm, Jupyter and Zeppelin notebooks in their daily jobs.
Now when we covered the data part of Wise, let’s talk about DevOps side: how we build, deploy, run and observe things.
Everything we talked about has to be built and deployed somewhere. While some teams fully own their stack, most use common foundation provided by our Platform teams. We are a financial organisation and thus we routinely pass different audits. All code changes have to pass automated tests and go through a number of automated checks, ensuring code and documentation quality, proper review process and more. This might sound obvious as well as boring and bureaucratic but we wouldn’t be Wise if we let it slow us down. We work really hard to make it as frictionless as possible, enabling us to do numerous daily releases, and reduce lead time so changes are made quickly.
Our services are deployed to Kubernetes clusters, that are in turn provisioned in AWS cloud. Each container has a number of sidecars: we use Envoy as a service mesh, we have daemons for the log collection, etc. We make it simple for product teams to configure their deployments by introducing an abstraction over Kubernetes manifests and exposing only a fraction of configs they need. We use Helm in the background to fill in the gaps, provide sensible defaults and abstract away infrastructure knowledge (and access!) from majority of product engineers, letting them focus on their code only.
We moved from Circle CI to GitHub Actions. That way we have full control over the infrastructure where our products are built. We use self-hosted runners in Kubernetes for running build commands and GitHub actions as control plane.
We also now provide a uniform experience where engineers can do more on GitHub, without using other CI tool.
Lastly, that move helped us save the cost, leading to better prices for our customers.
There was recently a big two part post about our CI/CD state, you can read more about it here (part 1, part 2).
For some Java services we auto-generate the CI workflows from Gradle, abstracting one more piece away and speeding up development.
Build systems themselves differ for platform, for Java we mostly use Gradle, and for NodeJS we use yarn or npm. We are trialling usage of pnpm and might be moving there soon. Our pnpm builds tend to be significantly faster, reducing lead time of changes. Moreover, it gives us better understanding of dependencies we have (pnpm gives more context on missing peer dependencies). For Java and Gradle we’re excited to see a dependency locking feature implemented, and are working on uniforming many parts of build setup with the help of custom in-house plugins.
CI is integrated with Vulnerability Scanning. We use Trivy for scanning and DefectDojo for aggregating that data and then bring that information to owner team attention. We make sure those vulnerabilities are fixed promptly, making our code as safe as possible. We recently published an in-depth article about the state of Application Security (part 1 and part 2) and it’s well worth the read.
We recently created our own developer portal, making use of Spotify’s backstage. We automatically collect documentation, adopting TechDocs, and service-specific meta-information to display service ownership, team structure and observability shortcuts in developer portal, and more is definitely coming.
Speaking of observability. We already covered that every part of our product: services, databases, mobile apps and more is fully observable, so we can notice a problem if it occurs. That’s also data that needs to be collected, stored and made accessible. Lots of data, in fact: our apps produce ~11 billion logs lines/day amounting for 9TB of log data a day. To collect that data, we use logstash, promtail and fluentd. We use Kafka as an intermediary to deliver those logs where they need to be: to Elasticsearch cluster and to Loki. As before, logs are being collected and indexed in Elasticsearch cluster, and made accessible to engineers via Kibana. But recently we also started using Loki, as it allows us to have longer retention for logs accessible to engineers at a fraction of the cost we’d pay if same logs were in Elastic. Plus, engineers can now access logs via CLI, as well as execute long-running (up to days) async log extraction queries.
In addition to the logs, we collect traces and metrics. We currently move tracing from Jaeger to OpenTelemetry. We use Grafana Tempo for storing those traces. Metrics still go via Prometheus to Thanos. Traces, metrics and now logs are available via Grafana dashboards, giving us a centralised observability experience. Lastly, we use Alertmanager capabilities as well as Grafana alerts depending on what kind of alerting we need. If something critical is happening, owning team will get a page via Splunk On-Call (formerly VictorOps).
We use a mix of industry standard tech and cutting edge experimental stuff. We go wide and deep, and touch a lot of aspects of software engineering and heterogeneous technologies. We double down on things that prove to work for us and test out and integrate new tools whether they are a better fit. We are rigorous in testing and strong on automation and standardisation where it makes sense. We focus on the product, always keeping customer in mind. And finally, we are a big distributed team, and we are only getting stronger as we go.
Even this article is a team effort! Big thanks to the people that told me about all the awesome things their team do:
Adriano Stricca for the help with Frontend section; Forrest Pangborn, Javier Laguna Soriano and Eddie Woodley for mobile; Alicia Berry, Ben Mildren for the help with database section; Urtzi Urdapilleta Roy for answering my questions on the topic of Streaming and Kafka, Amol Gupta and Mark Harley for the intro to our ML stack, Ervin Lumberg for the state of CI, Toomas Ormisson for the deep-dive in our observability stack, every other person who read this and shared their feedback and all other people who make Wise happen and are not shy to share what we do!
Lastly — big thank you for reading till the end!