The Expendables — Backends High Availability at BlaBlaCar

Introduction

Nearly two years ago, Simon Lallemand (System Architect) wrote an article about our big bang shift to containers. The most important thing we have learned during our way to containers in production is that you should consider everything as “expendable” — which is our implementation of the well known “Pets vs Cattle” meme. For more information, see the The History of Pets vs Cattle by Randy Bias.

For us, “expendable” means that every element of the infrastructure can be restarted at any given point in time without any impact on the apps, this restart is done the same way for any services and finally the restart can be done by a non specialist. In other words, your master database should be as cheap as your smallest Python worker.

While moving stateless services to containers is not an overnight project, doing it for stateful bricks is a huge effort in many areas. As Data Architect, when you hear for the first time that to be compliant with the “new infrastructure” you must be able to treat all your production databases as expendable, you suddenly feel lonely…

Did someone think about our 1TB MySQL monolith cluster ?

Then, when you step back on this, containers or not treating resources as cattle is always the right choice when you have the ambition to build scalable infrastructures.

In this article I will dig into several concepts we introduced in our infra to be able to allow high availability on all our components and especially our databases. I will give a quick overview of our infrastructure ecosystem, introduce our service discovery solution and reveal what I call “Backend High Availability Pillars” with MySQL as an example.

Infrastructure Ecosystem

Infrastructure Ecosystem — end of 2017

There are many benefits to build an infrastructure based on containers, the first one is the ability to mutualize resources. For production services we have only one type of hardware (24 cores, 128GB RAM), three storage profiles (local SSD in RAID 10) one Operating System, CoreOS — a minimal OS to run containers. By these choices you simplify a lot of things and save money (scale economy, maintenance specs…).

The most important element on the above illustration is the RKT Pods (represented by the green squares). A Pod is the basic unit of execution, it is grouping several Application Container Images (ACI) sharing contexts:

  • Execution Context: if one the app container of the Pod goes down, the whole Pod goes down
  • Network Context: there is one IP address by Pod

So, Pod is a combination of ACI, let’s take a look at the Pod manifest for MySQL:

name: aci.example.com/pod-mysql:10.1-29
pod:
apps:
- dependencies:
- aci.example.com/aci-mariadb:10.1-25
app:
mountPoints:
- {name: mysql-data, path: /var/lib/mysql}
- {name: mysql-log, path: /var/log/mysql}
- dependencies:
- aci.example.com/aci-go-nerve:21-23
- dependencies:
- aci.example.com/aci-prometheus-mysql-exporter:0.10.0-1

Basically, it is composed by an ACI MariaDB Server (with mount points to persist data on host) an ACI go-nerve, the first piece of our service discovery (we will describe how it works right after) and an ACI prometheus to export MySQL metrics (monitoring).

All our ACIs and Pods are created and built in-house — we have open sourced our builder DGR on Github. After built in the CI, our Pods are pushed into a central registry and are ready to use.

So, we have a bunch of « identical » servers & plenty of Pods ready to use, we can now think to run Pods on machines. Today we use a layer that we call « Distributed Units System » using fleet+etcd. The goal of this stack is to run classic systemd units on your entire datacenter as you would run those on your local linux. By adding metadata to a given unit file you can target a host that fit your prerequisite to run your Pod.

For a stateful service (e.g. MySQL database) you need to stick it to a given server to allow data persistence after restart, here is a example of how to do it with fleet metadata:

[X-Fleet]
MachineMetadata=name=r101-srv1

To generate and deploy these systemd units, we use GGN— also open sourced on our Github. I will not go deeper on this “Distributed Units System” as we are currently working to standardized the orchestration of our Pods with Kubernetes but the goal remains the same; to orchestrate services.

Service Discovery

Service discovery provides a dynamic way to connect services to each other. To provide high availability it is mandatory to have a good service discovery, you can have the best backend failover solution if you can’t re-route clients to your available node, you are down.

Our service discovery is a Golang rewrite of Airbnb’s Smart Stack. It is based on two core components Nerve and Synapse and used two well know products, Zookeeper and HAProxy.

Service Discovery — How does it work ?

Following the above illustration, the final objective is to connect the client (left) to the database node1. First, Nerve is responsible to do health checks on its dedicated service (here, the node1) and report the node’s info on a specified zookeeper key (/database). Nerve implements several kind of checks — TCP, HTTP, system exec, SQL… 
On the client side, there is Synapse. Synapse is configured to watch certain Zookeeper keys depending on the client needs — here the client must access the database. Any changes on these watched keys will trigger a reload of Synapse’s local HaProxy to route traffic to only available nodes.

If the node1 goes down, the health check fails, the service key /database is updated and all Synapse processes which are watching this key will reload their local HAProxy, finally the client does not access to the node1 anymore.

What happens when nerve goes down but node1 is still there?

As nerve is running on the same Pod than node1 they share their execution context so when one of the Pod’s element goes down the entire Pod is down.

Can node1 key stays in zookeeper as node1’s Pod is down?

Nerve and Synapse use many built in zookeeper features, the service key created by nerve is an “ephemeral key”. Ephemeral keys are directly linked to the session of their creator, it means that if Nerve session goes down, the key will be automatically cleaned. With this assumption we are sure that if we loose a Pod, its corresponding key will be removed instantly.

Backend High Availability Pillars

For the rest of the article we will take MySQL as an example. At BlaBlaCar we use MySQL and Cassandra as core databases, we also use Redis for caching, ElasticSearch for searching and PostgreSQL for geo features. I have chosen MySQL for the article because it’s where we built the most advanced features to validate our “High Availability Pillars”. Cassandra was built natively to be highly available and accepts “only” smart clients that manage most of the pain for you.

Pillar #1 Abolish Slavery

MySQL at BlaBlaCar

Traditionally a MySQL cluster takes the form of one « master » and several « slaves ». The writes goes to the master node and are replicated asynchronously to the slave(s).

Asynchronous replication comes with two problems.

  1. The master is a Single Point Of Failure, when it goes down, a cold sweat runs through your body, you must elect a candidate to be the new master and reorganize slaves. Many tools improve the procedure but, it is never an easy task to do when you experience prod issue.
  2. The asynchronous nature of this replication means you must manage replication lag in your application, even if there is a really small lag between master and slaves. It creates inconsistency in your apps if it is not well managed and I am sure (or I hope) that you choose to use MySQL for their strong consistency features…

These two reasons are why we shifted to synchronous replication powered by MariaDB Galera Cluster some years ago. On a Galera cluster, every nodes can have the same role, as they always have the same dataset. I will not dig into Galera internals but its Certification Based Replication using Group Communication and transaction ordering techniques have shown a lot of promise compare to the traditional “locky” 2-phases commit design. There are some prerequisites to run Galera efficiently, if you do it the right way you have a fully consistent database cluster with no single point of failure and no replication lag consideration… Today, all our MySQL production clusters are running on MariaDB Galera.

Synchronous Galera Cluster

With no specific role inside a given cluster, we’re on our way to “cattleization”!

Nerve — Track and report service status

We have our high available backend, let’s setup the service discovery! With Nerve we declare two service keys on our Galera clusters, one for writes and one for reads. Technically as Galera nodes can take reads & writes, we could use the same key for both usages and just split the connections on the app configuration (distinct users). For better performances, we dedicate a node for writes at a given point in time, the target node can change without any backend reconfiguration but having a node for writes and the others for reads is a good practice.

So, on our pod-mysql’s Nerve configuration we will declare two services:

# cat env/prod-dc1/services/mysql-main/attributes/nerve.yml
---
override:
nerve:
services:
- name: “main-read"
port: 3306
reporters:
- {type: zookeeper, path: /services/mysql/main_read}
checks:
- type: sql
driver: mysql
datasource: "local_mon:local_mon@tcp(127.0.0.1:3306)/"
- name: "main-write"
port: 3306
reporters:
- {type: zookeeper, path: /services/mysql/main_write}
checks:
- type: sql
driver: mysql
datasource: "local_mon:local_mon@tcp(127.0.0.1:3306)/"
haproxyServerOptions: "backup"

Literally, we declare a service named main-read that will serves read queries for the main MySQL cluster mysql-main on prod-dc1, Nerve will do the default SQL check (SELECT 1) on it on port 3306 with the DSN local_mon:local_mon@tcp(127.0.0.1:3306)/. This check will report on a Zookeeper key /services/mysql/main_read. Easy right?

The service main-write have the same configuration but with a specific attribute haproxyServerOptions: “backup” saying that the node we will be declare in backup mode in the final Synapse’s HaProxy configuration. This is how we fix writes on the first node — if you declare all the node in backup, by default HaProxy will use only the first available.

There are few more attributes but we will see them in the next sections.

Now let’s take a look at what it looks like in Zookeeper:

# zookeepercli -c lsr /services/mysql/main_read
mysql-main_read1_192.168.1.2_ba0f1f8b3
mysql-main_read2_192.168.1.3_734d63da
mysql-main_read3_192.168.1.4_dde45787
# zookeepercli -c get /services/mysql/mysql-main_read1_192.168.1.2_ba0f1f8b3
{
"available":true,
"host":"192.168.1.2",
"port":3306,
"name":"mysql-main1",
"weight":255,
"labels":{
"host":"r10-srv4"
}
}
# zookeepercli -c get /services/mysql/mysql-main_write1_192.168.1.2_ba0f1f8b3
{
"available":true,
"host":"192.168.1.2",
"port":3306,
"name":"mysql-main1",
"haproxy_server_options":"backup",
"weight":255,
"labels":{
"host":"r10-srv4"
}
}

Synapse — Service discovery router

Great, we have services dynamically declared on Zookeeper, let’s use them. Remember, on the client side we use Synapse to watch zookeeper key and reload the client’s local HaProxy.

Here is an example of Synapse configuration file:

# cat env/prod-dc1/services/tripsearch/attributes/synapse.yml
---
override:
synapse:
services:
- name: mysql-main_read
path: /services/mysql/main_read
port: 3307
serverCorrelation:
type: excludeServer
otherServiceName: mysql-main_write
scope: first
      - name: mysql-main_write
path: /services/mysql/main_write
port: 3308
serverSort: date

For the app tripsearch on prod-dc1 we need two MySQL services mysql-main_read and mysql-main_write. Each services have a distinct port, here 3307 for reads and 3308 for writes. Basically, it is just what you need to use the backends.

The serverSort: date tells Synapse to sort the server list by date (ASC), it allows to have the oldest node in first position in the write service — we don’t want a newbie server taking writes just after popping into production. The serverCorrelation — excludeServer allows us to exclude the first node of mysql-main_write from mysql-main_read service, meaning we don’t read on the node we use for the writes.

Here is what is looks like in HaProxy:

HaProxy console screenshot

Finally, let’s have a look at the application configuration file:

# cat env/prod-dc1/services/tripsearch/attributes/tripsearch.yml
—-
override:
tripsearch:
database:
read:
host: localhaproxy
database: tripsearch
user: tripsearch_rd
port: 3307
write:
host: localhaproxy
database: tripsearch
user: tripsearch_wr
port: 3308

Pillar #2 Be Quiet!

In a container powered infrastructure you must be able to restart every services without pain because it happens way more frequently than into “standard infrastructures”. You don’t converge changes, you redeploy from scratch each time. As containers is a world of dependencies (Pods depends on ACIs, ACIs depends on other ACIs), a simple change in a dependency requires full reboot of everything. It is a healthy constraint, like a natural Chaos Monkey. 
The objective behind the Be Quiet pillar is the ability for an element of the infrastructure to go to production without impacting the application’s global error rate. Putting a new MySQL node into production without warming cache will generate a blast of hanging connections (mounting tables into buffer pool) and impact response time and error rate.

You have different solutions to mitigate this issue, we have chosen to rely on a weight system to gently push or remove a node from production. We based this on the HaProxy ’s weight feature and we controlled it within the nerve API. When a new service comes into production it has a weight of 1, this weight increases until 255. In addition to this, we implemented a nerve attribute enableCheckStableCommand that allows to run a check command to control if the service is going well on its increase. For MySQL this check is a simple bash script running a SQL query that checks slow queries/sessions:

# cat /report_slow_queries.sh
#!/dgr/bin/busybox sh
. /dgr/bin/functions.sh
isLevelEnabled "debug" && set -x
slwq=$(/usr/bin/timeout 1 /usr/bin/mysql -h127.0.0.1 -ulocal_mon -plocal_mon information_schema -e "SELECT COUNT(1) FROM processlist WHERE user LIKE '%rd' AND LOWER(command) <> 'sleep' AND time > 1" -BN)
if [ $? -eq 0 ] && [ $slwq -eq 0 ]; then
return 0
else
return 1
fi

It is how the implementation looks like in nerve configuration (last line):

# cat env/prod-dc1/services/mysql-main/attributes/nerve.yml
---
override:
nerve:
services:
- name: “main-read"
port: 3306
reporters:
- {type: zookeeper, path: /services/mysql/main_read}
checks:
- type: sql
driver: mysql
datasource: "local_mon:local_mon@tcp(127.0.0.1:3306)/"
enableCheckStableCommand: ["/report_slow_queries.sh"]

Let’s start our node and see the weight evolution compare to active sessions in it. We use an internal tool called bbc mysql monitor to see the weight, processes and slow queries (processes with >1s query):

# bbc mysql prod-dc1 mysql-main mysql-main1 monitor
#1 Weight: 1/255 Processes: 0 Slow: 0
#2 Weight: 2/255 Processes: 0 Slow: 0
#3 Weight: 3/255 Processes: 3 Slow: 0
#4 Weight: 4/255 Processes: 7 Slow: 0
#5 Weight: 6/255 Processes: 10 Slow: 0
#6 Weight: 9/255 Processes: 12 Slow: 0
#7 Weight: 15/255 Processes: 20 Slow: 1 <- SLOW !
#8 Weight: 0/255 Processes: 20 Slow: 1
#9 Weight: 2/255 Processes: 12 Slow: 0
#10 Weight: 3/255 Processes: 4 Slow: 0
#11 Weight: 4/255 Processes: 7 Slow: 0
#12 Weight: 6/255 Processes: 10 Slow: 0
#13 Weight: 9/255 Processes: 12 Slow: 0
#14 Weight: 15/255 Processes: 20 Slow: 0
#15 Weight: 23/255 Processes: 35 Slow: 0
#16 Weight: 38/255 Processes: 40 Slow: 0
#17 Weight: 38/255 Processes: 35 Slow: 0
#18 Weight: 61/255 Processes: 36 Slow: 0
#19 Weight: 61/255 Processes: 47 Slow: 0
#20 Weight: 98/255 Processes: 44 Slow: 0
#21 Weight: 98/255 Processes: 41 Slow: 0
#22 Weight: 158/255 Processes: 38 Slow: 0
#23 Weight: 158/255 Processes: 50 Slow: 0
#24 Weight: 255/255 Processes: 46 Slow: 0 <- FULL POWER !
#25 Weight: 255/255 Processes: 46 Slow: 0

Explanation: The weight increases, the clients begin to work on the backend, then the script returns 1 as a slow query is detected (line #7). Right after the weight goes back to 0to remove load on the node (line #8) , the goes back increasing… Finally, as no slow query was detected again the weight is now at max and the node is fully in production (line #24).

Pillar #3 Die in Peace…

The Die in Peace pillar has the same objective as Be Quiet but, for shutdown. When a node should be stopped it should be smooth, if you just stop your backend with running sessions on it, you will increase the error rate instantly. 
Again the magic is done in Nerve with the weight system + an attribute disableGracefullyDoneCommand that runs a simple bash script and checking returns. For MySQL, we just check if active connections are still there.

# cat /report_remaining_processes.sh
#!/dgr/bin/busybox sh
. /dgr/bin/functions.sh
isLevelEnabled "debug" && set -x
procs=$(/usr/bin/timeout 1 /usr/bin/mysql -h127.0.0.1 -ulocal_mon -plocal_mon information_schema -e "SELECT COUNT(*) FROM processlist WHERE user LIKE '%rd' OR user LIKE '%wr'" -BN)
if [ $? -eq 0 ] && [ $procs -eq 0 ]; then
return 0
else
return 1
fi

When you shutdown a node, its weight is set to 0 to stop accepting new connections. Then, we « wait » for current processes to finish their work before shutting down the node — this is done by the report_remaining_processes.sh script trigged by nerve thanks to disableGracefullyDoneCommand (last line):

# cat env/prod-dc1/services/mysql-main/attributes/nerve.yml
---
override:
nerve:
services:
- name: “main-read"
port: 3306
reporters:
- {type: zookeeper, path: /services/mysql/main_read}
checks:
- type: sql
driver: mysql
datasource: "local_mon:local_mon@tcp(127.0.0.1:3306)/"
enableCheckStableCommand: ["/root/report_slow_queries.sh"]
disableGracefullyDoneCommand: ["/root/report_remaining_processes.sh"]

Conclusion

Ok, so to sum up everything, here is a simple illustration showing strategies & tools we have chosen, built and implemented to validate our three High Availability Pillars for MySQL.

I hope this article was clear and useful. We started the implementation of this pattern two year ago and we are constantly evolving BlaBlaCar’s infrastructure. Technologies may change in the future — Kubernetes instead of Fleet, Docker instead of RKT, cloud provider instead of bare metal… But the effort we made to shift our mindset and transform all our pets into cattle will remain across any platform change. Remember that making your resources expandable is always the right choice when you have the ambition to build scalable infrastructures.