Yet another Hadoop migration — but from the human perspective

Photo by Slim Emcee on Unsplash

Background

But first some context, Outbrain is the world’s leading content discovery platform. We serve over 400 billion content recommendations every month, to over 1 billion users across the world. In order to support such a large scale, we have a backend system built of thousands of micro services running inside Kubernetes containers spread over more than 7000 physical machines in 3 data centers and in public clouds (GCP & AWS). In order that the recommendations we supply will be valuable to our readers we invest in personalization as much as possible. To achieve this goal we have lots of machine learning algorithms that run in the background on top of the Hadoop ecosystem which makes it a very critical system for our business, we have 2 flavors of Hadoop clusters:

  • Online — we have 2 clusters in full DR mode running on bare metal machines, they are used for online serving activities.
  • Research — we have several clusters (per group and usage) running in GCP, they are used for research and offline activities.

The Trigger

Few years ago we migrated our online Hadoop clusters to use the MapR commercial solution (you can read more about this in the Migrating Elephants blog), it had lots of improvements and we enjoyed the support we got. But a few years later, our Hadoop usage increased dramatically which made us very professional in supporting this system, we improved our knowledge and we could handle things on our own instead of depending on external resources for solving issues (which is one of the benefits when using a commercial solution). So, a few months before the license renewal we wanted to get a decision for what is the best way for us to proceed with this system.

  • MapR have also a community version, we mapped the differences between it and the commercial version, there are few but the main ones that we had to solve were:
  1. No HA solution (for the CLDB, the main management service) — for that we implemented our own custom solution
  2. Support only single NFS endpoint — some of our use cases copy huge amount of data from the Hadoop into a separated DB so we needed a scalable solution, we ended up with a internal FUSE implementation
  • Since we had 2 clusters in DR mode, we were able to test and run the community version in parallel to the commercial version
  • Just for case, we made sure that we could extend the license for only several months instead of 3 years commitment

The Alternatives

We came up with those alternatives:

  1. Stay with the MapR commercial version — we didn’t want to stay with the community version (don’t have big users adoption and support).
  2. Migrate to Cloudera commercial version
  3. Migrate to Cloudera community version — this option was dropped immediately since the amount of nodes we had (>100) exceeded the limitation of using the community version
  4. Migrate to Apache Hadoop community solution
  5. Migrate to Google Cloud — the same we did with our research cluster

The Decision

After the successful migration we had with the research cluster to GCP (you can read about this in the Hadoop Research Journey blog series), lots of us thought that once we will need to migrate the online clusters, they will be migrated also to GCP. But like in real life, you need to handle each case by its own characteristics. I have 4 children and I wish that there was a single way to rule them all, but the reality is that each one requires a different approach and a different attitude. The same goes with our Hadoop clusters, each one has its own characteristics so each one has different requirements and needs, they differ by their usage so what is good for one is not necessarily good for the other. Due to the nature of the online clusters (lots of compute, less storage) we realized that we will not be able to benefit from the cloud features (elasticity and compute-storage separation) like we had with the research cluster.

The Migration (in short)

The migration needed to be done while the system itself continued to operate as a production system, this meant that we needed to continue to give the same service to the users while they will not be affected. As we have 2 clusters in separated DCs, so we did it one cluster at a time, in each it was done in a rolling manner, we started with a small Apache cluster and in several cycles we managed to move the workload from the MapR cluster into it.

Data migration

  • New data — we started to ingest the new data into the cluster, it was done by integrate the Apache cluster with our data delivery pipeline
  • Historical data — most of the workload needs some historical data for their processing, as the migration was in cycles, we copied only the data that was required for each cycle

Workload migration

We repeated the following steps until we managed to migrate the entire workload:

  • Moved some of the jobs from the MapR cluster to the Apache cluster
  • Stopped the migrated jobs in the MapR cluster
  • Moved machines from the MapR to the Apache cluster

Achievements

The migration was a huge success, it was a full cooperation of all relevant engineering teams. After preparations that took us a few months, in no time (~1 month per cluster) we managed to complete it. Now we can summarize what we achieved:

  • The human factor — our people improved their technical skills and got more professional in supporting the Hadoop system
  • We are using Open Source Software — there are lots of articles describing the benefits of using open source software which are: flexibility and agility, speed, cost effectiveness, attract better talents etc.
  • Improved cluster performance — with the same amount of cores (actually with a little less), all jobs finish their processing at least 2 times faster. For example, in the below graphs you can see the process time of the hourlyFactFlowFaliCreationInHiveJob in the Apache vs. MapR
  • Storage cleanup — during the preparation for the migration we were able to delete more than 700TB which reduced the cluster capacity from 73% to 49%, you can see the impact in the graph below
  • Workloads cleanup — during the preparation for the migration we disabled more than 800 jobs which was almost half of the total number (today we have ~950 jobs). In addition, we made some order with the jobs owners so instead of 19 ownership groups we now have only 16
  • Technical debt — one of the advantages of moving to open source software is the ability to use the latest versions of all integrated packages and in addition to be able to integrate even more packages, this ability is sometimes missing when using a commercial solution. In our case we were able to: upgraded our Spark jobs (version 2.3) & SparkStreaming jobs (version 2.4) and implemented Presto as another service for supplying a high performance query engine that can combine few data sources in a single query.

Epilog

The decision to migrate one of our critical systems from commercial to community version was not an easy decision to make, it involved some amount of risk but according to the blog’s title, we did it from the human perspective, when you invest in the people you get the investment back in big time.

--

--

Outbrain is the world’s leading native advertising platform, guiding the digital discoveries of consumers around the globe. Genuinely connecting marketers, publishers, and the consumers in-between, Outbrain serves more than 308 billion recommendations, organically personalizing,

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store