Migrating from Marathon/Mesos to Kubernetes

Selasie Anani
mPharma Product & Tech Blog
7 min readJan 23, 2023
Engineers working from mPharma Office in East Legon, Accra

A complete overhaul of infrastructure is one of the most challenging initiatives any tech team undertakes. A lot of times, the team behind a particular stack might have departed the team leading to a lot of unforeseen circumstances. However, that wasn’t the case at mPharma, over the years we had adequate knowledge transfer across critical parts of the system. In this article, I will like to share how we managed to migrate our core infrastructure from Marathon/Mesos to Kubernetes.

Over the years, we’ve been actively building a number of products based on the ever-growing needs of a patient in Africa. To list a few;

  1. We successfully scaled Bloom, our proprietary application for managing the modern pharmacy into 9 countries(As of the time of writing this article that number has increased to 10).
  2. We also launched myMutti, an e-commerce platform for buying over-the-counter (OTC) medications online and having them delivered to your home.
  3. We built and launched a warehouse and logistics application to better track the movement of products from our warehouse to partner pharmacies and hospitals known as Marketplace.
  4. We launched a last-mile delivery application to further digitize how we transport drugs.
  5. We also started collecting diagnostic data inside of Bloom to augment our mutti doctor program.

A key event during this period was covid, which led to lockdowns and bans on travel. However, the team was motivated by our vision of ensuring every single African had access to quality healthcare.

The engineering team had over the years been contemplating the possibility of migrating our core infrastructure. However, our explosive growth as a company led us to prioritize building products as opposed to reinventing our core infrastructure. In January 2021, we came to the realization that we need to reinvent our infrastructure or else we won't be able to support the type of growth we were experiencing as a company and a team.

The core infrastructure at mPharma had gone through different phases throughout the years, however, this was our first attempt at a complete overhaul and also ensuring that our infrastructure matched the growth of the company and team as a whole.

Our discussions began with a focus on the problem we were looking to solve. This helped ensure that we reduced any element of bias in choosing a particular tool and focus on the best solution. A key question we asked each other was “what stack will help us grow”. We started by focusing on five key areas;
1. Scalability
2. Developer Experience
3. Community Support
4. Security
5. Cost

Based on the above and a few heated debates we opted to go with Kubernetes.

What is Kubernetes?

According to Kubernetes.io, Kubernetes, also known as K8s, is an open-source system for automating the deployment, scaling, and management of containerized applications. It groups containers that make up an application into logical units for easy management and discovery. Kubernetes builds upon 15 years of experience running production workloads at Google, combined with best-of-breed ideas and practices from the community.

Migration Process

Any migration of this scale usually leads to service downtimes, our goal was to either completely eliminate this or minimise its occurrence. Our first step consisted of forming a project team led by a project lead. The project lead was responsible for handling the end-to-end management of the migration and ensuring all relevant stakeholders were kept up to date on the progress of different initiatives. Some of these include;
1. What will be the additional cost for spinning up additional servers?
2. What is the best way to easily deploy applications to Kubernetes without causing interruptions to the development workflow?
3. How will we measure deployment success?
4. How will we efficiently switch traffic from our old servers to the new ones after the migration?

What will be the additional cost for spinning up additional servers?

As a team one of our key focuses is to reduce costs as these savings translate to mPharma as a company being able to offer lower costs of drugs to patients. We set out to speak with our cloud provider to see if they had a program for proof of concept(POC) projects. We managed to get our cloud provider to cover the cost of our migration until we were in a stable position. We essentially spent $0 on the first few weeks of our migration due to this.

What is the best way to easily deploy applications to Kubernetes without causing interruptions to the development workflow?

We wanted a way to do this migration without disrupting the software development lifecycle of the engineers. A key thing to note is during the migration, we continued to build products for patients and the business continued to grow at an alarming rate. Prior to this migration, all our services had a .yaml file that contained deployment instructions. We didn’t want to overburden engineers with writing a new set of deployment instructions. This made us decide to centralise our deployment script in GitLab. Within this script, we had sections that deployed code to both Marathon/Mesos and Kubernetes. This ensured that the work of engineers wasn’t affected during this migration period.

How will we efficiently switch traffic from our old servers to the new ones after the migration?

This was the trickiest and most difficult stage of our migration strategy. Our focus was on what path will result in no disruption in service quality. Switching traffic to the new servers was tricky for us because in doing this migration we had overhauled how we did authentication and authorization. This made it impossible for us to do gradual migration. We had to migrate all services at once. After a few discussions, we leaned towards using traffic data to make this decision. We looked at traffic to our applications to decide on the best time to do a traffic switch. After looking at data from our API gateway we settled on Saturday night. This was due to us seeing less traffic on our network on Sundays hence any disruptions will have a very small blast radius. We also opted for working from the office so we could reduce communication friction. We ordered a lot of pizza, for we knew the task at hand was not going to be easy. At 11:00 pm GMT, we proceeded to do the traffic switch.

How will we measure deployment success?

One of the teams you hear little of is our Quality Assurance and SRE team. These two teams ensure that the products we build meet the highest standards. However, as Isaac, the QA lead will always say, “Quality of the products is not just the responsibility of the QA team”. We decided to all chip in to test different parts of the application post-migration to ensure that everything was running smoothly. Another key area I failed to highlight was prior to deploying to production we migrated our test and stage environments to Kubernetes. This gave us the confidence that everything was properly tested and working as expected.

Lessons Learned

1. Not paying attention to usage metrics

Shortly after deploying to Kubernetes, we noticed a number of issues being reported by our users of application downtimes. In migrating our applications we were focused on cost-cutting measures, we limited the amount of memory and CPU a particular container can be allocated. This led to containers dying or restarting whenever they hit a memory or CPU limit. A key lesson for us was not paying attention to container usage metrics from our old infrastructure and using that to decide how we allocated resources to each service.

2. Misconfiguration of services by service owners

Another mistake we noticed was misconfigurations by service owners. There was a lot of manual work involved in setting up environment variables for services which led to engineers making mistakes. This is something we are actively working on and figuring out the best way to manage variables so as to limit the number of mistakes.

3. Lack of Unit and End-to-End Testing

Due to how we handled authentication in our old infrastructure and the need to migrate as soon as possible we omitted unit testing and end-to-end testing in our CI/CD pipeline. This led to several bugs which would have been caught. In the event that we do another migration, we will always ensure that this stage is included in the CI/CD process.

4. Switching traffic:

Switching traffic between our old and new infrastructure could have been handled differently. Instead of switching 100% of the traffic, we could have focused on doing it incrementally to reduce the blast radius.

It has taken about 6months to write this article as the team has been focused on ensuring the migration didn’t lead to a degradation in service quality and also focused on building/improving products that our customers interact with. I will say from interacting with different key stakeholders within and outside the business, the migration was a success. We continue to improve our services to better serve our customers and to ensure that we make it easy for patients to benefit from the huge infrastructure investment that mPharma is making in healthcare.

--

--