+40PB per year — The challenge of data growth at Criteo
A summary of our journey
Our Hadoop journey starts in 2012. At that time, our dataset became too large for a SQL database and a bunch of machines to run our engine. We needed a way to scale beyond the SQL limits and Hadoop with MapReduce was the alternative that worked for us.
Fast forward in 2015. Our first large Hadoop cluster, AM5 has almost filled its 40PB and cannot scale further due to a lack of space in the datacenter.
This is when we started to build PA4, a Hadoop cluster planned to scale to several thousands of nodes and hundreds of PB.
You can find more information about it in Stuart’s presentation.
Fast forward in 2018, PA4 has now reached 3,000 nodes and 160PB of used capacity (220PB in total).
That’s about 40PB per year for the past 3 years. This post is about some of main storage challenges we faced during these past 3 years.
The 240M objects limit
So far, the biggest incident we had on HDFS.
The previous week, there was a few slowdowns and high block count on the AM5 Namenode but nothing to worry too much about.
The cluster went down Friday late in the night.
5 people took turns over 36 hours to make it back on track on Monday and it took 1 week to catchup the delay.
It took us more than 24 hours to identify the source of the outage. Why so long? Because we needed first to bring the Namenode back to identify and fix the root cause in its filesystem. Hopefully no data has been lost.
So what happened?
As you can see in the graph above, the block count has suddenly increased by 24M. The pressure it added on the Namenode’s 150GB JVM Heap became too high and caused the outage.
We discovered that a single user activity caused that incident.
Its MapReduce job’s partitioner was not properly configured and created millions of blocks.
However, identifying and cleaning it up first required to restart the Namenode…
But how do you restart a Namenode when its JVM Heap is full and you don’t have more memory available?
Our FSImages were about ~13GB gzipped at that time, a Namenode startup took ~45 minutes.
Also, Namenodes were always busy handling thousands of RPC requests all the time.
Therefore, we first isolated the Namenode at the network level, and fine tuned the JVM to optimize its behaviour at startup time. This was a long process to wait 1 hour between each startup but eventually, we succeeded to restart Namenodes.
In the process of restarting Namenodes, we have also discovered that at least one transaction was corrupted in the edit logs. Hopefully, with 5 journal nodes and 2 Namenodes, we were able to find a valid FSImage and restart after that corrupted transaction.
The next step was to identify the root cause, fix it, eventually restore the production JVM settings and start to gradually remove the network isolation.
That whole process took us 36 hours of rotation work.
Following that incident, we focused our efforts on ensuring that we would never have to cope with such big incident again.
The impact of logical raid on Namenodes
Not long after that incident, we experienced another issue on our PA4 Namenodes.
At that time, we used logical raid on spin disks to store FSImage.
At some point, logical raid started affecting the cluster when raid was doing sanity check even though it was supposed to do it on idle IOPS only.
That incident decided us to improve our Namenode hardware.
The road to a 600M objects Namenode
We started to think about ways to secure our production to avoid such incidents in the future.
Our first decision was to upgrade the Namenode hardware. SSD disks, faster CPUs — because of a lot of single threaded code caused minor outages — and much more memory in order to keep a handful twenty of GBs at all time available if the JVM ever needed it.
Beyond that, we made two important changes in our daily work:
- looking more closely to the Hadoop source code and gradually backport more and more patches to keep our production in good conditions
- switch our attention to JVM performances, both for our infrastructure and to ensure our clients are using our resources in a good way
Moving to a specialized JVM
Focusing more on JVM performances led us to further optimize the Namenode JVM. With our FSImages, it was critical in order to secure failovers and restarts but also to keep a low RPC response time, no matter when.
Tuning Parallel and CMS GCs
Well tuned, the Parallel GC, allowed us to achieve 18 minutes startup, 10 seconds GC with 20 seconds standard deviation and a rate of 1 GC every 6 minutes.
The CMS GC, allowed us to achieve, 10 minute startup, 2 seconds GC with 3 seconds standard deviation and a rate of 1 GC every 3minutes.
Tuning G1 GC
The last one was G1 GC. It allowed us to achieve 11 minute startup, ~630ms GC with 2 seconds standard deviation and a rate of 1 GC every 30 seconds + 1 long mixed GC (30 seconds-1 minute) once per day.
Clearly that was the best deal for our low RPC response time target.
Later we realized that G1 GC with the Namenode had a large native memory footprint which was preventing us from using more than 330GB from our 512GB memory machine.
We also observed a bad behaviour with G1 GC when the number of blocks increased, leading to new tuning work to be done and again when the block count decreased.
Once or twice a day, there was also a series of Mixed GC taking up to 2–3 min. Further tuning to reduce this mixed GC led to an increase in young GC.
We decided that this recurrent tuning work depending on the cluster’s activity was not sustainable overtime.
We needed better solution.
Moving to Azul Zing GC
The Zing GC, requires few configuration. Our tests showed an increased CPU usage and a slower response time compared to G1 but on average, performances are similar enough.
The good thing is a lower native memory footprint, allowing us to vertically scale our Namenode heap a bit more if needed and a much more stable GC duration.
As you can see Azul was a little slower than G1 in our case, taking around 5% more time to process request which leads to less RPC call being treated/more request being enqueued.
Given these characteristics, we have deployed Zing on our Namenodes.
Scaling beyond 600M objects
As the number of users and the block count increased, we needed to secure the HDFS layer scaling on:
- number of stored namespace data-points (folder, files and blocks)
- number of RPC calls
We considered 3 alternatives:
- Vertical scale
- Increase block density
- Dedicated HDFS clusters
- Federating namenodes
Vertical scaling has been quickly eliminated. We were already operating a Namenode on a 512GB machine and our previous GC tuning work made it clear that we will hit a limit soon. In spite of it, the upper limit of RPCs per Namenode is between 60k and 100k QPS (writing takes an exclusive lock on the metadata and leads to less QPS).
Increase block density
Increasing block density required a lot of work with our users to change the way they were generating files, everywhere in all pipelines.
While it could have allowed to reduce the number of blocks, it was not sustainable overtime and was not solving the maximum number of RPC calls issue.
Dedicated HDFS clusters
At first, creating dedicated HDFS clusters seems like a good and easier solution.
However, it has the same limitations as the federation with some important extra cost:
- it requires a migration effort from some users to new clusters and once again everytime we would add new clusters
- it requires that you can identify datasets to split or duplicate. When it cannot be done, it requires users to adapt their code to be able to read/write the data on 2 clusters at once.
- It is a waste of resources and an extra operational cost as each cluster need extra machines and configurations. i.e zookeepers, journalnodes, access gateways, kerberos domains, monitoring, …
Eventually we decided to use HDFS federation to scale horizontally the Namenode on the same cluster.
We created 3 dedicated namespaces, each supporting ~300M blocks which is a count at which we know that the Namenode operates well.
Large scale metrics collection is hard.
Somewhere in 2014, we switched to a dedicated OpenTSDB cluster to properly gather all Hadoop metrics.
Fast forward in 2017. At that time, our OpenTSDB cluster was handling ~90k requests/second among 50 region servers. Its operational cost was not so high but as our PA4 cluster was growing fast, we were looking for a way to fully focus on the Hadoop stack only.
Hopefully, our internal Observability team have released BigGraphite. It enabled us to switch back to Graphite and remove OpenTSDB.
You can find more information about it in Nathaniel’s presentation.
It was a challenge to switch back to Graphite as it forced us to aggregate our metrics. The issue with this is for incident’s root cause diagnosis or spotting a JVM’s GC spike activity. With aggregated metrics only, you cannot get the whole picture.
The introduction of Prometheus in the architecture relieved us a lot and allowed us to gather most of the metrics we need to properly operate HDFS.
More recently, we have introduced Garmadon, our homemade solution for near realtime events collection. It allows us to have a deep understanding of how our clusters and their jobs behave at the OS, JVM or framework level.
The scaling journey continues for us with very exiting challenges this year.
PA4’s little brother, AM6, has been rolled out in September with 1,000 nodes and 85PB of capacity to cope with the projected growth and secure our most critical data. The challenge is to scale AM6 to handle TensorFlow’s huge computation needs.
Migrating to Hadoop 3 on a live cluster is a big challenge.
Synchronizing Petabytes of data between data centers for production pipelines is a tricky subject, our HDFS team is now focused on building Mumak, a reliable large-scale data synchronization service to cope with the multi-DC Hadoop era we are entering now.
Scaling HDFS to handle more than a billion objects on a single cluster is another big challenge.
If you are interested in scaling large-scale distributed systems, consider applying for a role on our team!