Migrating a gigantic financial system to 20,000 pods in the cloud
Who are we?
It’s the year 2019. Two software developers from a Polish company called VirtusLab are unaware of what’s in store for them. That’s us.
We are consultants for one of the largest investment banks in the world; we are part of a team of around 30 developers that maintains and develops a core data-storage solution. We were assigned to a smaller sub-team that was given the task of migrating this system to the cloud: not only our applications, but also components that are responsible for running market calculations and storing the results; infrastructure like Zookeeper and Kafka, and integration with our non-cloud infrastructure. The initial requirements were quite straightforward: build docker images, write Helm Charts, write Terraform code, and deploy on Kubernetes. Well… it turned out there was much much more to it.
This article is written from the perspective of simple developers working in DevOps culture.
Precursors to technology, once
Since we were working for a bank, you can expect that there were many archaic or custom tools that were written either because the technology did not exist or external tools were not allowed. Over time, multiple solutions had been created, like a custom configuration-management system, numerous cache implementations, communication libraries, and even a private cloud implementation! It’s no different in the case of dependency-resolution mechanisms. If you’re familiar with the JVM ecosystem, you may have heard of Ivy, a package manager that started as a subproject of Ant, the father of Java build tools. Even more complete build tools have dropped their support for Ivy in more recent releases.
One image to rule them all
The first problem we encountered was to build necessary Docker images. You may think it’s easy, but nothing is easy in huge corporations, especially banks. Anyway, JVM applications weren’t ever a perfect fit for containers. Initially, JVM wasn’t even able to properly calculate a maximum heap size, but this issue was resolved in the end. Still, Java applications are usually quite heavy, unless you’re using GraalVM, for example. Our system is really heavy. We had to deal with a massive amount of dependencies being resolved by Ivy using a shared file system like NFS. The images needed to be self-contained, as they were supposed to run in the cloud. We couldn’t just mount our shared drive at run-time. Most of our dependencies were taken from JAR manifests, and some of the native libraries had to be resolved using strace. The issue is that most of the libraries were produced from our monolithic repository, which constantly changes, and the dependency list itself is huge — and we mean HUGE! Once we had our dependencies, we could build an image. Sadly, we ended up in a situation where there were no proper layers (with parts that don’t change too often) and an image size of around 30 gigabytes! It had to be rebuilt completely for each release and then pushed and pulled between build hosts, repositories, and containers in its entirety. We were frequently running into disk space issues on our build hosts, and it was taking as much as 20 minutes to pull the image while starting containers (imagine starting 10k containers with this image). What’s more, there was a requirement to build just one image for different applications from the mono-repo, among others, to support the current release process. We split the dependencies into more stable ones and ones that change rapidly, and we filtered out dependencies that we really need. This resulted in an image that was almost 50% smaller and a significant layer that didn’t have to be rebuilt every time. It still wasn’t great, but we had to prioritize.
Security in the first place
As mentioned before, it may be difficult to justify the use of external software in a company that prioritizes security. Therefore, we had to prepare definitions of images for MongoDB, Zookeeper, and Kafka. More or less customized binaries of these are placed on the shared disks, similarly to the libraries we use. When it came to deployment, we came up with the idea of utilizing the official Helm Charts, as it was just a matter of copy-pasting them. We just needed a few tweaks here and there, e.g. when TLS support wasn’t in place. Obviously, all connections have to be safe in an environment like this. Speaking of TLS, to secure connections in our own applications, we had to implement this based on a custom TCP library that is used firm-wide. This library supported async computations even before it was cool. Also, it came with Kerberos as one of the authentication mechanisms, which is still a standard across the company, but we had to disable it for connections to the cloud. In addition, it turned out that we aren’t allowed to set up connections from within the cloud to the firm. Unfortunately, our data-storing system tended to connect in the forbidden direction. Firstly, we tried to incorporate reverse connections; however, due to their initial instability and the fact that the changes turned out to touch too much of the core system, we decided to implement pluggable proxies that use our existing service discovery and open connections in the opposite direction. As a result, we achieved higher test coverage and stability. On the other hand, more code had to be written.
A proxy or a load balancer, this is the question
Another aspect related to connections was the actual network infrastructure that we could use. Normally, you’d say, “Hey, these are just outbound connections. Let’s just connect to the cloud”, don’t you think? Not here. There is a DMZ between the firm and the cloud, with dozens of firewalls and other security features. We started with the simplest available approach, which was to use a SOCKS proxy managed by an external team. It should have worked with our custom TCP library out of the box. All we had to do was to set a proper address in the code, set up rules with desired destination subnets, and get a couple of approvals. Unfortunately, the proxy was too slow and unreliable for production, but the load balancers saved the day. These were a bit more complicated to use as they required a rigid forwarding configuration and additional firewall rules (with a bunch of extra approvals). Each time we deploy a new cluster, we have to update the destination IP addresses for the exposed services. However, the load balancers enabled our replication-based system to keep up with the real-time data stream.
Running multi-terabyte databases in the cloud
We use a bunch of MongoDB replica sets as our main database. Some of them are almost 12TB large. To make snapshots that could be used to create disks that would later be attached to our containers, we had to rsync the data to VMs in the cloud. Even though we made various optimizations, it was quite slow: 48 hours to copy everything. Good enough for starters, though. With the data in place, we chose encrypted Premium SSD disks that we attached to our Mongo pods. They ran fine until it turned out that they performed pretty badly. We were getting read/write speeds at a level of tens of megabytes per second, whereas on premises it’s hundreds of MB/s. What we didn’t know about these SSD disks was that they were immediately available but were still syncing data in the background, which in our case could take days! This was a no-go, so we found out that the vendor offers VMs with local NVMe disks. The drawbacks of this choice were no out-of-the-box encryption, lack of support for disk snapshots, and we had to take care of any disk/VM failures. Our encryption implementation is a topic for a different story. The main problem was how to efficiently populate the disks. The choice fell to blob storage and a custom binary that was provided by the vendor to support large multi-terabyte files. Now, at startup the Mongo pods pull the data from blob storage, which can take up to a few hours; this is not perfect, but at least they are ready to use and perform similarly to the on-prem disks. We still use the premium SSDs for daily development to quickly deploy Kubernetes clusters with empty databases. Also, we’re working on using a blob storage copy tool instead of rsync, which should improve the speed by several times and simplify the process.
Running 20,000 pods in the cloud
As we mentioned in the first paragraph, we were also asked to migrate the calculation distribution engines that are supposed to scale to thousands of instances. For one of our cloud providers, it’s possible to deploy a cluster with 10 node pools of 100 nodes each. The final number of pods that seemed to work based on many tests and after introducing some improvements on the provider’s side was 1,000 per node pool. This means up to 9,000 pods per cluster because one of the node pools is occupied by kube-system components. Due to this, we needed at least 3 clusters to run 20,000 pods. Therefore, we came up with a term: “multi-cluster”. To reduce the number of stability issues, we agreed on running 5,000 pods per cluster; so, we usually run 4 clusters with the distribution engines. It’s worth mentioning that even at the time of writing the auto-scaler doesn’t behave as expected in these difficult conditions. It’s quite laggy and doesn’t always scale the nodes to the desired numbers. Because of that, we are working on our own scaling plugin. Also, to this day we experience issues with different kube-system components and we have to restart them for some deployments.
Another issue is the number of logs that are produced. You can imagine that 20,000 pods multiplied by a number of log lines, then multiplied by a number of clusters for a given period, results in an insane value. Our predicted monthly costs for logs were reaching millions of dollars if we were to use the vendor’s logging solution. Moreover, many log lines were simply missing! There were hours of gaps for certain containers. Despite the fact we came back to this problem many times, the cloud provider’s support wasn’t able to help us. First of all, no one would like to pay that much money for logs, especially when the storage is inconsistent. Of course, we could focus on limiting the number of entries by lowering log levels for certain apps or packages, shortening the messages etc. To some extent we did this, but it wasn’t very effective in terms of costs. Only one reasonable decision could be made: build an alternative logging infrastructure. The current estimated monthly cost amounts to a few hundred thousand dollars.
DNS resolution issues
Our components, which are deployed across clusters that belong to a multi-cluster, resolve themselves with an external DNS. It came as no surprise that this causes problems. Most of our applications weren’t prepared for missing DNS entries or obsolete addresses. The services are registered every minute, while the entries have a TTL of 5 minutes. This led to binding errors because certain processes were trying to bind to an outdated IP address, for example, after being restarted. Some apps were failing on missing hostnames and didn’t retry as it was just not implemented. To sort this out, we introduced init containers that wait for a pod to register itself and other necessary names to become resolvable. What’s worth mentioning here is that we mostly use headless services, so each pod using them is uniquely identifiable. All our headless services’ names are resolvable, and all the IP addresses are routable between clusters in a multi-cluster, as well as from within the firm. This makes things easier in an already obscure system.
Other large-scale-related issues
Normally, we download TLS certificates to our pods from a vault using an existing solution. This doesn’t scale very well. If 20,000 pods hit the vault simultaneously, it would crash or at least take a long time. We settled on re-working the mechanisms a bit to transform the certs to Kubernetes secrets, which works like a charm. Even if the certs change, the secrets are watched and updated inside the pods by default.
Another entity that was vulnerable to the load from our army of pods was Zookeeper. Before we upgraded to a more recent version and enabled throttling, Zookeeper nodes were getting randomly killed from time to time.
Endless deployments and a strategy for clean-ups
We spent many hours making the deployments take less and less time. We attempted to reduce the size of the image(s) so they could be pulled faster. We made minor improvements, such as optimizing the way we build our certificate trust stores, etc. The main advice I can give is to set the podManagementPolicy to parallel wherever you can. It’s sometimes worth setting up something beforehand and making it wait or even sacrificing it in order to restart a couple of times when a dependency is not ready. Kubernetes and other tools provide the means to do this easily. This practice has proved to be especially useful for our Mongo pods, where we need to download the data first. In general, it’s obviously recommended to do as many things in parallel as possible. What we’d like to do next in this matter is to start setting up parts of a multi-cluster simultaneously.
Rapid deployment is crucial when it comes to continuous integration and end-to-end deployment testing. No one wants to wait hours for builds. Clusters for testing purposes should be short-lived. Because of our long-running deployments, some of us were trying to preserve our clusters by force. A few times we ended up in a situation where someone had left a large testing cluster running for a weekend. As you can imagine, a discussion with a manager is often unpleasant after burning thousands of dollars. A solution to this might be to delete all testing clusters periodically, before weekends, for example. In the case of our manually created clusters, deletion is also manual. However, I imagine it could also be done automatically.
Most likely, we were and still are struggling to make our system work fully reliably and efficiently because it wasn’t cloud-native from the very beginning. It’s quite difficult to make a gigantic monolith and all the tooling around it “cloud-ready”. We’ve achieved a lot, but there’s still lots of room for improvement. Please let us know if you have any ideas in this regard.
Many thanks to my teammate Jakub Węgrzyn for writing this article with me.