A month full of Kubernetes

Pedro Díaz
Mercadona Tech
Published in
5 min readOct 30, 2019

Mercadona Tech was created to rebuild/revamp from scratch the online channel of Mercadona. In that journey, we created our own warehouses where we prepare the orders from the customers (boss, in our internal lingo). Those warehouses run software that our team creates and maintains in the cloud. Currently, we rely on a stable internet connection for those applications to work, thus we have several layers of redundancy to mitigate a possible outage.

Recently in our industry, we have seen a trend towards hybrid cloud (more info about what it is here and here) for mission critical workloads.

In the pursuit of resilience, we have been for a few months toying with the idea of exploring a hybrid cloud. Currently, we have all our workloads in one public cloud. We would like our warehouses to work 99% of the time without being dependent on an internet connection. Right now, our services require a continuous connection.

Since Mercadona already has experience with bare metal (we do own a couple of data centers), exploring this option seems a natural step and we do not foresee major issues due to the learning curve of administering our own metal.

We feared though, that the technology could not be mature enough for production workloads, so we tackled it as an experiment/spike rather than a full project. There were, and still are, many unknowns about the feasibility of the technology, especially about the implementation and the maintenance of a hybrid cloud. Our current production ecosystem was born on the cloud, and now we will have metal to take care.

I would like to stop here for a second a give you a bit more context on how the Mercadona Tech ops/SRE team is composed. At the time of writing this blog post we are four SRE’s for the whole engineering team (around 30 engineers). Our daily tasks vary from tackling our own roadmap issues to supporting engineering teams as we are embedded into teams.

In addition, our current infrastructure resides on Google Cloud were we have a few Kubernetes GKE clusters. With the addition of a hybrid cloud we will not only administer a cluster, but also take care of the metal underneath it. Thus, exploring the hybrid solution must fulfill some requirements for us:

- We should be able to seamlessly introduce it in the developer workflow without having big changes to how they develop/deploy their applications.

- It should be fairly easy for a small SRE team to maintain the infrastructure. All of a sudden, we will have bare metal and those failing disks are not getting fixed by themselves.

- The infrastructure should be highly reliable

Five months ago, as the time of writing this article, we started to put together the plans. We knew we could not do this alone at Mercadona Tech. So we went to our bigger organisation, Mercadona, asking for expertise in order to form a cross-functional group that can explore the idea without any constraint. The Mercadona IT department has about 500 employees with a very different set of skills. Some of them were very much needed for this experiment.

We put together a team of about 10 people gathered from different Mercadona teams. It included us from Mercadona Tech, and few others such as network specialists, sysadmins and security experts. After an initial conversation, we set a start date and a list of goals to achieve.

We work in an agile fashion in Mercadona Tech. Therefore, we divided the time we had available in two sprints of two weeks each. As we had a lot of uncertainty about the project, we decided to relax the scrum/Kanban rules. That allowed us to move quickly and the process was there to help us, not to get in our way.

What we learned from the spike?

First, we needed to decide which flavour of Kubernetes to try. We wanted to have a Kubernetes certified distribution, so we ensure that it is as vanilla as possible. For that reason, we did some comparison on the current offering. We ended up with two candidates: Kontena Pharos and Rancher RKE. In addition, we needed to decide which OS to run. We have some internal knowledge of maintaining Ubuntu as OS, therefore we choose it as our main operating system, even though we tried CoreOS as well with RKE.

We set up an experimental laboratory with the virtualization software vSphere running on top of VMware ESXi hosts so we can create all the needed virtual machines to install Kubernetes on them. We benefited a lot from the experience of running and maintaining a large production metal from our peers at Mercadona.

The first two week sprint, we aquanited ourselves with all the components. We defined a simple architecture for the cluster: 3 master nodes for HA and 4 worker nodes with different labels so we can distribute workloads using the affinity rules. Additionally, we invested a bit of time on some minimal automation so we could increase velocity. We used Terraform to have our infrastructure as code so we could easily tear it up or down, allowing us to easily perform several different tests. Some of them quite aggressively that we knew the result was the death of the cluster.

Since the applications being run are running in a controlled environment within our warehouses, our load balancing needs are very modest. The only thing we need to ensure is HA across all the replicas of the application through a floating IP.

Knowing our needs we tried F5 BIG-IP and MetalLB. We loved how seamlessly F5 integrates with Kubernetes but we had the feeling that we bought a fancy car just to use the radio (leaving aside the license costs). On the other hand, MetalLB with the ARP configuration did just what we needed. So we decided to go with it.

We were very surprised by the maturity of the solution. Neither Pharos nor RKE gave us any issues deploying new clusters, upgrading them, or adding/removing nodes.

Comparing Pharos and RKE, we liked the streamlined process to build up a Kubernetes cluster from scratch that both offer. From Rancher, we liked that it supports both operating systems, Ubuntu and CoreOS. However, we did not like that all the core Kubernetes components are run as containers. We felt that having them as containers would open us up to having the docker daemon as a single point of failure. On the other hand, Pharos components are installed as system services instead of containers, which is a more familiar and comfortable design.

What are the next steps?

The knowledge gained from this experiment gave us the confidence that we can implement it for a production workload. Now is the time to sit down with our engineering teams, especially the backend teams, and plan carefully with them on how to proceed. This will involve adapting our applications to run in a new environment of the hybrid cloud.

If this sounds like a fun challenge to you, join us!?

--

--