How do you run a scalable Magento 2 instance in Kubernetes

Pepijn Blom
graciousagency
Published in
19 min readJan 29, 2020

--

In this post we try to explain how we managed to migrate our Magento 2 instances including the hurdles we faced along the way (including the less and more obvious ones).

At gracious we build and maintain e-commerce solutions and other applications for our clients running high traffic & high revenue webshops. A lot of our customers run Magento 1/2 which we hosted on the Rackspace Cloud platform. As our clients grew and traffic increased we had to start scaling out which presented quite a few challenges along the way.

Background Information

Traditional Magento 1 hosting consisted of using the largest possible servers one could afford. It was very common to see shops using a single 64GB dedicated server to host their webshop. The idea behind this setup was that you should keep Magento as close to the database as possible to minimise latency due to Magento 1 using an EAV data model for, amongst other things, the product attributes. They later added “flat tables” which were single tables containing the EAV data and were generated by indexers. They basically did queries on the EAV tables and inserted this data into a single table and then Magento would query those instead for product information. This made queries much faster and removed quite a bit of latency in the process. With this latency removed it started to become more common practice to use a separate database server again.

Support for caching in memcache or redis was added later. By storing the sessions in MySQL or redis it became possible to deploy multiple frontend servers and thus scale horizontally.

Our old setup on Rackspace Cloud

With horizontal scaling becoming an option we managed to come up with a working solution for our various clients which involved running their site on multiple servers. Having them use between 3 and 8 smaller servers seemed to be the sweet spot, depending on the traffic and quality of the code.

The Rackspace Cloud uses a OpenStack VPS solution and offers simple IaaS/PaaS solutions for databases, message queues and other software. We were able to combine hardware and software based solutions but quickly started running into trouble as our clients grew rapidly. We had trouble with large traffic spikes and sometimes had to add one or more new servers “quickly”. Thankfully we were using Ansible which sped up the process of setting up servers.

The biggest problem was syncing the code to all the servers. Since we only had a small amount of servers we were advised to use Lsyncd. What Lsyncd basically does is watch for inode changes and then fires off one or more rsync commands, to sync the watched directory to external servers.

This worked ok but sometimes it would crash and had to be restarted due to the amount of changed files. Magento 1 has a lot of tiny files, currently around 14500 and that’s just core Magento. Magento 2 clocks in at around 68400. Then add to the fact Magento thinks it’s fun to update the Copyright date in every file on updates and you have a recipe for disaster.

At that time we had a hardware loadbalancer and firewall which passed on traffic to various VPS servers running in Openstack. For some clients we also used dedicated hardware database clusters.

At the time it seemed like an amazing setup but as we discovered later, all was not well.

Problems with our setup as recommended by Rackspace

1.Hardware loadbalancer
We were sold a hardware loadbalancer to be able to connect to a hardware MySQL cluster Rackspace had recommended. The big problem with this was we were not allowed to use it’s API and had to add any rules through a webinterface. This basically stopped us dead in our tracks when wanting to automatically add new servers to a pool. This was later alleviated slightly by creating servers through the webinterface with the name of the pool in a certain field. It was not possible to do with through tools though so no automation possible there.

2.Hardware firewall
The firewall was also only configurable through a web interface. They also sold us a 10mbit firewall to start with which was no where near fast enough. It was severely limiting MySQL and Redis performance due to the fact they had sold us a hardware MySQL cluster which was connected to the cloud servers using Rackconnect which ran on the firewall. It resulted in many more hops on the network.

3.Central server running cache & distributing code
As you can imagine having one central server to distribute all code can turn into a problem. It happened on quite a few occasions that this server went down and it took down the whole cluster because the cache and sessions were not available. Magento does not fail gracefully when it doesn’t get what it wants. Distributing code using rsync also takes a while and this also caused problems when frontend servers were running different versions of code.

Problems with Rackspace in general

1.Fanatical support
If there’s one thing I remember from our time at Rackspace it’s the term “Fanatical Support”. They were very fanatic in saying “what you want is not covered by our SLA and we can’t guarantee that your ticket will be picked up by anyone if they don’t feel like it”.

Basically Rackspace only does what is safe and is afraid to do anything new. If you’re a dusty old company then by all means go this route but if you’re exploring the boundaries of internet development and are on the cutting-edge of technology this is a major hindrance.

More often than not when we called the support line we already knew what the problem was but had to go through a lengthy discussion convincing the tech that what we were saying was the problem. I remember clearly once that they had made an error in the redirect & caching settings of the load balancer and a lot of images were not loading on the site. Doing a quick curl revealed that these images were causing a redirect loop from https to https. We spent almost 4 hours on the phone with various techs to get them to undo the redirect and flush the cache. At some point the tech and their manager gave in to our pleas to flush the cache even though they didn’t believe it, the problem was fixed instantly. This is just one of the many examples of how fanatical support was more of a hindrance than a blessing.

2.Outdated firmwares on hardware
We have always been on the forefront of technology, implementing the newest developments well before the competition here in the Netherlands. As such we wanted to implement HTTP/2 asap when it became available but because we were using the F5 loadbalancer Rackspace had pushed on us were had to request a firmware update. The current firmware on the device was at least 2 years old (!!) and we were told we shouldn’t expect an update any time soon because it would have to go through quality control first. After a full year there was still no progress. We resorted to using Cloudflare to offer our clients HTTP/2. Eventually we even used Cloudflare’s load balancing and circumvented our own load balancer.

3.No containers
Rackspace missed the boat on container based hosting and deployment and instead focussed all their cloud efforts on [OpenStack] (https://en.wikipedia.org/wiki/OpenStack), which might be great for traditional IaaS providers, but didn’t work so well for the DevOps trying to use it. So when Google Cloud, Azure, Digital Ocean, Amazon were innovating with new technologies like Docker and Kubernetes, Rackspace decided to pivot their business model and become a premium service provider for these these other clouds which had been their direct competitors.

This meant further development on current IaaS/PaaS solutions came to a halt. No new features were introduced, bugs took longer to get solved and partner programs were dissolved.

4.A lot of downtime
As a result of their new business model we saw problems on our VPS’s and databases increase tenfold. VPS’s and databases became unresponsive often and starting a new server image took 1–2 hours. That meant we couldn’t uphold the uptime promise we made to our clients.

5.Horrible or no scaling solutions
Running high traffic e-commerce applications means certain unintended marketing campaigns can increase web traffic by 100x in a matter of seconds (which can happen often if you use instafamous social influencers). You can mitigate this by either always having a lot of overcapacity or by means of autoscaling or upscaling on traffic spikes. 1000% overcapacity is not really a viable solution when using Magento so we resorted to auto/upscaling VPS’s when possible. But as we mentioned before, starting up a new server image took 2 hours, so more often than not, we missed the peak and experienced either downtime or decreased performance due to over encumbered servers.

Databases also tend to grow very fast in Magento requiring resizing the RAM and disk size. This is possible in Rackspace Cloud but meant a downtime of 30 minutes to 1.5 hours depending on the database/server size.

6.The bigger they are the cheaper they get
In the end it all comes down to cost. AWS, Google Cloud and Azure all have better pricing in regards to PaaS and cloud computing services which meant lower TCO (Total Cost of Ownership) for us and our clients. Furthermore, they all heavily invest in the development of new technologies that make the lives of developers much better and easier.

Our wishlist

At some point we had had enough of all the problems and decided a change was in order. Our solution at Rackspace was turning exceedingly pricey and was basically bursting at the seams. We’d outgrown the setup and saw no easy way of expanding without pushing costs to extreme heights.

We decided to look what our options were and what technologies we wanted to use to be able to grow further. Our new hosting environment would have to at least be able to adhere to a few requirements:

  1. Automatic scaling
  2. Lower costs
  3. Advanced security options
  4. Support for containers
  5. Reliable services to lower our operation and hosting costs
  6. Fast global network
  7. Innovative company with regular improvements & new services
  8. Full control of all aspects of our environment
  9. Simple to use interface

Why we chose Google Cloud

We evaluated various clouds and eventually settled on Google Cloud. It was a breath of fresh air compared to dusty Rackspace. All of a sudden we had access to a humongous list of services with an easy to understand interface and the right amount of configuration options.

Even though AWS had more services at the time it didn’t speak to us. The interface is outdated and confusing, not to mention their overly complex account security settings. The pricing is also a bit unclear and the zone locations weren’t ideal for us, we really wanted to be able to host as close to our customers’ clients as possible: in the Netherlands.

Right when we started searching for a new host Google announced they were building a datacenter in Eemshaven. It was the perfect location for us and it also spoke to us because it was the first datacenter to be powered by 100% renewable energy right from the start.

The biggest challenge was choosing the option for hosting our customers’ websites because all of a sudden we had multiple:

  • App Engine
  • Compute Engine
  • VM Instances
  • Instance groups using templates
  • Kubernetes Engine (GKE)

We actually use all of the above currently, App Engine for smaller projects which don’t need a cluster of servers running 24/7 and a few single VM instances for legacy applications but for our big projects there was an obvious winner: Kubernetes

Why we chose Kubernetes

First off, what exactly is Kubernetes?

Kubernetes (commonly stylized as K8s) is an open-source container-orchestration system for automating deployment, scaling and management of containerized applications. It was originally designed by Google and is now maintained by the Cloud Native Computing Foundation. It aims to provide a “platform for automating deployment, scaling, and operations of application containers across clusters of hosts”. It works with a range of container tools, including Docker.

What this comes down to is that you containerise your application by making Docker images. These images are stored in a registry like the Google Container Registry. These images are used to spawn containers running in Kubernetes which handles things like routing of traffic and deploying your image to your nodes (the term used for servers running Kubernetes).

We had been running a Docker Swarm on a couple of VM instances but it didn’t feel as reliable as we had hoped. It also lacked the ability to automatically scale the amount of servers when traffic changed. We also had problems updating the swarm on a few occasions and a few network issues resulting in restarts of the docker daemon which didn’t always work out too well either.

The beauty of GKE is the fact that it has the ability to automatically scale the amount of servers and we have them running Google’s Container-Optimized OS (COS). It’s an OS built on top of the open-source Chromium OS which is specifically engineered to running containers.

A quick list of benefits of using COS:

  • container support
  • automatic updates
  • secure by default
  • minimal design
  • open-source

Our new setup

We have migrated almost all our clients to Kubernetes clusters and we’re now autoscaling Magento 1 and 2 and survived TV commercials and Black Friday without a single moment of downtime. I’ll break down how it’s all setup so you get a general idea of how Kubernetes works.

Running multiple environments

In our old setup we would have a VPS dedicated to testing and acceptance environments. With Kubernetes it’s possible to fence off your environments even though they are running on the same hardware. This is achieved using namespaces. Namespaces create a separation of resources and quotas can be set per namespace. By default any applications running in one namespace cannot connect to applications running in another namespace. There are ways around this and will be discussed later in the Redis/Elasticsearch section.

Pods

Kubernetes has a handy feature not found in standard Docker. It lets you run multiple containers bundled together which will always run on the same server (referred to as a node). Kubernetes keeps track of all containers in your pod via liveness and readiness probes. If even one container dies in your pod it deletes the pod and a new one is created. This might sound strange if you’re not using containers yet but it makes sense if you remember that all containers should be identical and not have to rely on any data which is written to them. For this you should use a shared volume via something like NFS/SMB. We use Google Filestore for this, a managed fault-tolerant NFS solution.

If you’ve ever worked with PHP in Docker you might have opted for an init tool which lets you run multiple processes inside a single container like . This is bad practice because you have a chance of one process dying without Docker recognising this. It would result in a broken container which is not being restarted and your application becoming (partly) broken. We used to use Chaperone for this and it worked ok but made the containers quite a bit bigger due to the fact it required python3. Considering we didn’t need python for anything else it was a shame to have to install it. We now use Krallin Tini exclusively for everything from static React sites to Symfony and Magento 2 applications.

Deployments

These are your bread and butter. Deployments are configurations which define the contents of pods and how they should be run. The beauty of the pod system is that you can run multiple containers in one pod which can communicate with each other as if they were running on the same server. We use it to run nginx and php-fpm as close together as possible. This reduces any latency that would be created by having them running on separate servers (which could happen with vanilla Docker) or having to mess with running multiple processes in a single container.

Common things you will find configured in the deployment files:

  • containers
  • ports
  • resource requirements and limits
  • volume mounts
  • environment variables
  • update strategy
  • replicas

Things NOT configured in your deployment files:

  • autoscaling
  • IPs
  • routing
  • crons
  • passwords

Services

Here is where you give your deployments a name which can be used by other deployments to talk to it. Lets say we have a redis service and name it redis-cluster. You will be able to connect from your code to this service using the DNS name redis-cluster. You won’t have to worry about it’s ip or which instance you are connecting to, the service will take care of that for you and handle all the routing.

Ingress

This holds your external ip and connects your services to the internet. Incoming requests will be handled by the ingress and routed to the correct services according to rules you have defined. It’s also possible to add certificates to your ingress to enable https support. On Google Cloud it’s trivial to setup a Managed Certificate which uses LetsEncrypt.

An example of an ingress:

The accompanying Managed Certificate:

Crons

We all know what crons are and they shouldn’t need further explanation. The only different thing about the kubernetes crons is that they are also started as a pod. Each time a cron is run a fresh pod is created and your command is run. You could compare this to doing the following using Docker (except for the fact your pod could start multiple containers):

1  docker exec your-php-application php cron.php

You can configure the crons with various options like retries if they fail or only allowing a single cron to run at a time.

One important thing you should keep in mind: if you have a container which doesn’t exit at the end of the cron (lets say a proxy or webserver) then Kubernetes will think your cron is still running and eventually kill it and mark it as failed. It could then restart the cron automatically.

HPA

The Horizontal Pod Autoscaler takes care of scaling your pods horizontally. What this means is it will in-/decrease the amount of pods according to certain criteria like cpu usage, the amount of requests coming in or, in the case of Google Cloud, you can even configure custom metrics from Stackdriver or the amount of messages in a Pub/Sub queue.

Persistent Volumes

If you need to store files from the application running in your containers you will want a persistent volume mounted to a directory in said containers. As you (should) know you cannot save files in a container, especially in an environment which automatically scales: your containers will come and go all the time.

There’s various types of volumes you can use but generally you’ll probably want to use one which lets multiple containers read from it and probably also write to it. In case of Magento we use these to store all the images and each container can write the resized images to the volume. In the case of Google Cloud we use a FileStore as the managed NFS backend for this. You are free to setup your own NFS server but even SMB is supported so you can use the same idea for Amazon or Azure.

Secrets

The first thing about secrets is: you should keep them to yourself. If you share them with everyone they aren’t secrets anymore. What do we consider secrets when it comes to DevOps? Login details for databases, SSL certificates and access tokens for various systems are all good examples of pieces of information you’d want to store in a secret and not, god forbid, in git.

They can be mounted as files in your containers or injected as environment variables. Magento 2 uses php files for it’s configuration so it’s no problem to use environment variables to configure it.

In the case of Magento 1 there was no support for environment variables in the app/etc/local.xml file but we have a module available which lets you do just that.

Building Magento 2 docker images

I’m not going to sugarcoat this. Magento 2 is possibly the worst system I have come across to build Docker images for. It’s a nightmare and took us quite a while to get right. For starters it needs a database to build. Yes..

During this build it also makes changes to the database. So you might want to think twice about using your production database to do this because one small error in your build process and you’re stuck with an upgraded database which is not in sync with the code.

We managed to overcome this problem by making a dump of a stripped down version of the production database and using that as a GitLab service during our build stage. What it basically does is spawn a fresh MySQL container and imports a pre-defined .sql file. This ensures there’s a database to work with for the build stage. After the stage is complete the container is thrown away again. This does mean that the production database will need to be updated after deployment of a new version of the code but we solved that during the deployment stage.

Here’s a slimmed down version of the build stage. You may notice it uses redis as well. This kind of sped up the build so we added it since it was an easy addition.

Our env.php uses the environment variables to connect to the database and redis so we can use the same file for all environments.

After the installation stage we process the resulting code in a build stage:

This stage builds the Dockerfile. As you can see it is dependent on the install stage. We also set MAGE_MODE environment variable by passing a build argument which is set to the env var with the same name in the Dockerfile:

1   ARG MAGE_MODE 
2 ENV MAGE_MODE=$MAGE_MODE

Deploying our images

Currently we are using quite an easy way of deploying our images. There’s an official Google Cloud SDK image ‘google/cloud-sdk:alpine’ but it is missing the ‘kubectl’ command so we created our own image which already has it in there, cutting some time off the deployment.

As you can see below we extracted the commands which can be used for multiple environments into a GitLab template. In this case that’s ‘.deploy.’ It lets us deploy to multiple environments by only changing a few variables as you can see in the ‘deploy:production’ section.

Deploying Magento requires quite a few separate image updates. We have 2 cronjobs and one deployment. You might notice 3 cronjobs but we’re actually abusing one to be able to upgrade the database (remember what we discussed above: Magento needs a database upgrade!). It’s a cronjob with no schedule and we spawn a job from it.

Using databases in Kubernetes

As the database is the most important asset in a webshop we opted to not maintain it ourselves. Due to various experiences we have had maintaining them ourselves at Rackspace it became clear it was not one of the things we wanted to spend our time on. They are fine if they run well but if shit hits the fan you’re generally in a world of hurt. Then there’s the fact databases can grow at a rapid rate if shops are doing well. We’ve had to increase storage space multiple times in the past and even if it is trivial it’s still not something we want to keep ourselves busy with. Google Cloud offers managed MySQL in the form of Cloud SQL. They automatically scale storage space and in/decreasing the specs is done within a minute or so.

When we started with Google Cloud the SQL servers did not support direct connections through a private ip and we were forced to use the Cloud SQL Proxy. This little proxy allowed for secure connections to the database but to work optimally required it to be included in each pod. This way Magento could connect to 127.0.0.1 as if the database was running locally. This sounds fine and dandy until you start working with Kubernetes CronJobs. A CronJob is considered done when all processes in the pod have finished running, but the proxy stays running. This caused cronjobs to never actually be considered finished.

Thankfully Google enabled private ip connections on Cloud SQL and added an option to Kubernetes clusters to connect to this ip. It required setting Native VPC on, however this was not possible on a pre-existing cluster and required us to re-create all our clusters. This is actually not that difficult if you keep all your yaml files which define the cluster in something like git. It can easily be done within half a day.

So now we have private ip connections and our cronjobs are working the way we want. Storage for the databases automatically increases and we can scale up and down as much as our hearts desire without considerable downtime.

Setting up Redis and Elasticsearch

Both these services should be run centrally since they can be used by each environment (test/acceptance/production) and can be setup as a cluster. We generally use Helm charts to set these up in the default namespace and use a service of type ‘ExternalName. Here’s an example of a service running in a production’ namespace which lets all pods in ‘production’ connect to the Elasticsearch cluster running in the default’ namespace:

Conclusion

Using Kubernetes gave us the opportunity to scale all our services without the headaches associated with creating and setting up individual servers. Even though Magento 1 and 2 are not optimised for Docker/Kubernetes in terms of architecture and build process we are still able to setup a platform which scales well. Figuring out the ideal setup has taken us quite a while and one should not expect to have everything running perfectly within a week if just starting with Docker & Kubernetes. There’s a lot to learn and sometimes requires you to change how you think about your setup.

Once you’ve setup full CI/CD you’ll be so happy. There’s no way to describe how much stress this has removed for us. Automatic deployments, automatic scaling, automatic testing. Did I mention automatic?

Next steps

There’s a few tools we still need to look at which I highly encourage you to look at as well.

  1. Kustomize — traverses a Kubernetes manifest to add, remove or update configuration options without forking
  2. FluxCD — Instead of using kubectl from our deployment stage FluxCD watches your registry and deploys when tags or branches match a certain pattern.
  3. Helm — We need to start making our own charts so we can deploy a full project with a single command.
  4. Terraform — Would be handy to setup the Kubernetes cluster
  5. Sealed Secrets — Store encrypted secrets in git and decrypt them on deployment

TLDR;

Don’t waste your time and money configuring and maintaining your own bare-metal or VPS cluster. Go with a good host who provides Kubernetes as a service and get rid of the stress and headaches you are experiencing now. It’ll take some time setting it up but it’s totally worth it.

This post is written by Pepijn Blom (DevOps Engineer) & Suraj Sanchit (CXO) working at Gracious.

Are you triggered by this article and want to know more about this subject, or are you curious about the possibilities? Please contact us via info@gracious.nl or send us a personal message via Medium.

We’re always looking for passionate and smart people to join our graceful agency. Check our vacancies here!

--

--

Pepijn Blom
graciousagency

DevOps engineer working for gracious in the Netherlands