Designing transferable, in-country tech platforms
MomConnect was developed under the leadership of the South African National Department of Health and our hope was to partner with other governments to bring the service to their countries. We wanted to expand the impact of our platform by spreading it to more countries.
We came up with a platform which we called “Seed”. The idea was that we would “plant a seed” in a new country (build), “nurse the seedling” until it was self-sufficient (operate), and finally hand it over to a local partner (transfer). This platform would be entirely self-contained — running the platform in-country was important for data sovereignty.
It’s now 2018, and our model for the expansion of our maternal health platforms has changed substantially, especially with the integration and engagement of WhatsApp on MomConnect and our understanding of the importance (and unique effort) of each local ecosystem. Still, we brought Seed to two African countries: Nigeria with the HelloMama service and Uganda with the FamilyConnect service.
An important part of the Seed strategy was that it was always intended that the system would be handed over to a local partner. The aim was to expand our impact without hugely expanding our organisation. If we needed to set up offices in every country that Seed was deployed then we would quickly become a much larger organisation than we wanted to be.
Because Seed platforms are hosted in-country, the system involves more “layers of the stack” than is typical. Everything from the basic configuration of virtual machines, to the container orchestration system we use, up to the applications themselves must be handed over. We also couldn’t host our services using any of the major cloud providers, since they do not provide services in the countries we were deploying in.
In this blog I will reflect on some of the technology decisions we made and issues we encountered building, running, and handing over this platform.
Microservices and monoliths
A long-running debate when building web services is whether to use a “monolith” architecture where a single large service performs a variety of tasks or to use a “microservices” architecture where tasks are divided up into several separate services that communicate with one another to provide the overall application.
Most arguments for a monolith architecture revolve around the simplicity of the design and the speed with which such a system can be set up. Common arguments in favour of a microservices architecture are that it is easier to scale and that it allows different teams to work on separate components, thus allowing teams to work more efficiently in parallel.
The original Seed application stack was built around 2014–2016, probably around the peak of the microservices hype in the industry. As such, the platform used a microservices architecture composed of several Django REST Framework-based components. This platform was used locally in South Africa for the MomConnect service as well as in Nigeria for the HelloMama service.
We encountered several difficulties stemming from the platform’s microservices architecture:
- It took some time to become familiar with a lot of failure scenarios since many only arose in production. These failures were difficult to test for as they occurred due to interactions between the services rather than within a single service.
- It was difficult to query data across multiple services using multiple databases. This made common analytical queries harder to run.
- The transaction-level isolation we would normally rely on when querying a database became fragmented when operations were split across multiple services, resulting in complex failure cases where some operations were “half done”.
- Scaling was not easy, despite what proponents of microservices architecture will claim. Since a single call to an API may call several other APIs, determining which call(s) in particular was slow required significant monitoring instrumentation which, at the time, we lacked.
To make matters worse, Seed had been developed primarily by a partner that closed shop towards the end of 2016. At the beginning of 2017, a decision was made to halt feature development for the Seed platform.
Around the same time, UNICEF and Nyaruka had developed a software platform called RapidPro that performed most of what we needed the Seed application stack for. Our second iteration of the Seed platform — although by this point we weren’t using the name “Seed” — was the system we developed in Uganda. This used RapidPro which featured a more monolith-like, but greatly simplified architecture:
RapidPro has since been adopted for MomConnect as well. Many organisations will wrestle with the monolith vs microservices conundrum, and this isn’t an attempt to argue that one is overall better than the other. What we can say, at least, is that there are many considerations to take into account before designing a system with a microservices architecture.
In the case of a platform that needs to be managed by engineers who didn’t develop the original software, it can be argued that optimising for simplicity is a reasonable goal and that a monolith architecture is preferrable.
Dependence on managed services
At Praekelt.org, we use a number of third-party managed services to build our products. For example, when we are developing code, it is hosted on GitHub and generally has continuous integration set up using Travis CI. We often push code to the Python Package Index (PyPI) or push Docker images to Docker Hub. And once we’ve built the software, we might use even more external services to keep it running.
Any modern technology company relies on many managed services as replicating these services internally is often complex, expensive, or just impractical. Managed services allow companies to focus on what they do best while leaving many difficult tasks to third parties who have the necessary expertise.
If Seed is a self-contained platform, then where do we draw the line as to where the platform begins and ends? It would be impossible for us to build all these services into the Seed platform and run them in-country. In general, we used two “rules of thumb” to determine whether to build something into the cluster or to run it externally:
- Development and associated processes (e.g. continuous integration) should be separated from the system the applications run on in order to ensure portability.
- Don’t put the things you need to monitor unreliable infrastructure on unreliable infrastructure.
All these services can also complicate handovers as they require partners to adopt the services themselves. This could require the partner to create and manage logins and billing for these services, and require us to manage the transferral of resources between accounts.
One of the aims with the Seed platform was to leverage modern container orchestration platforms that provide much higher levels of automation than existing systems. This was important for two main reasons:
- Fewer steps and fewer decisions involved in common tasks such as deployments can make some tasks easier to perform and easier to teach to partners.
- An automated system can (ideally) self-heal and repair itself in the case of a failure — important if the system is hosted in an unreliable environment.
The problem with these automated, highly-available container orchestration systems is that they are complex. This complexity can increase costs due to additional hosting requirements. When failures do eventually occur that aren’t automatically handled, debugging issues can be very challenging. These systems can also take significant time for engineers to learn, increasing the training requirements and time needed for a handover.
In one case our hosting provider was only able to provide us with a single public address for the entire cluster so all incoming network traffic was required to travel through a single host. To make matters worse, the reliability of the hosts in the cluster was very poor, with malfunctions occurring sometimes multiple times in a single week.
On the one hand, our infrastructure was sometimes very useful to have around in these circumstances. It could generally survive a single host failure within the control-plane and two to three failing worker hosts. But if the host handling incoming traffic went down, the service would be impossible to access. And in many cases the system could fail in unpredictable ways, such as when the networking went down completely.
While we focussed on ensuring that this container orchestration system worked well, there are other areas we could have focussed our efforts:
- Optimising for simplicity and fewer hosts could have simplified failure cases and reduced costs.
- Increasing the observability of the infrastructure so that problems were easier to diagnose and fix.
- Extensive testing of failure cases (e.g. by using Chaos Engineering techniques) to better understand where things can go wrong and how to recover from those failures.
Ultimately, one can only be prepared for a certain amount of unexpected failures. The question is where it is most important to focus one’s efforts to ensure service uptime while also balancing cost, operational capacity, the amount of training required, and many other concerns.
Industry standard tools and abstractions
Perhaps the biggest advantage of our container orchestration system is not the increased automation or high-availability capabilities, but rather that it is a very useful abstraction. If we can get a set of servers to a state that our container orchestration software is running, then the system is in a well-known state and we can deploy our software as containers much as we would on a cluster anywhere else in the world.
Having this commonality between the clusters we run means that whether the cluster is in South Africa, Ireland, or Nigeria, if we have a Docker image for our software, we know we can run it and we know how. In at least one case, this common platform also allowed us to move between hosting providers quickly and smoothly.
Without having a common platform it is easy for each datacenter configuration to diverge resulting in “snowflake” platforms that are subtly different from each other, complicating deployments.
Recently, Kubernetes has seen a lot of adoption as a container orchestration platform, to the point that one could consider it the de-facto standard in the space. It could be argued that an advantage that Kubernetes has over other platforms is the fact that it is so widespread, which increases the chances that a partner will have experience with it or be able to hire expertise for it.
As you can see, we use Mesosphere DC/OS instead of Kubernetes. Some disadvantages we see with Kubernetes are difficulty deploying outside of the major cloud providers (DC/OS is simple in comparison) and additional abstraction complexity (for example, Kubernetes’ networking model is notoriously complex). In most cases we’re unable to use the latest managed Kubernetes products from the major cloud providers, such as AWS EKS. Still, we continue to assess Kubernetes for the future of our platform, potentially running on top of DC/OS.
Another tool we use that was critical to deploying the Seed platform to multiple countries was Puppet. Puppet is a configuration management (CM) tool that allows us to ensure that our server configurations are in a consistent state. Being able to push configuration changes to multiple servers at once and manage all the configuration from a central place makes it possible for a small team of 4 people to manage several clusters of servers.
While using configuration management tool could be considered an industry standard or best-practice at this point, some organisations still do not use these kind of tools. In at least one case, a partner had not used a configuration management tool before. This brings up the question of how much knowledge one can expect a partner to have ahead of a handover, which we explored in a previous blog post.
While it’s important to take into account how suitable a platform is for handing over to a partner, there are many other concerns when designing a technology platform. Many decisions are shaped simply by what technologies the organisation is already familiar with — which is mostly determined by the particular history of the organisation.
Still, it’s important to ask the difficult questions about, reflect on, and learn from past decisions in order to build better platforms. The lessons we learnt operating Seed shaped the subsequent version of the platform and the experiences we’re having handing over these platforms will help inform future technology decisions.