Our Hadoop Setup

Published in

Nuts & Bolts

4 min readNov 16, 2015

We are currently building out our data processing and capture capability with Hadoop stack technologies. This post looks at what we use and why we chose it.

A lot of these choices were driven by what is known to play well together and what the Hadoop distribution vendors provide. There’s so much complexity in a Hadoop stack that we have had to build up a fair amount of knowledge just so that we can really understand the alternatives.

Visual Overview

A schematic of our batch processing setup

The Basics

HDFS for data storage. Without a hosted solution like Amazon EMR it’s pretty much the only choice.
Yarn for resource management. We didn’t really evaluate this one much; most Hadoop stacks use Yarn as the bottom layer for resource scheduling. You can read a comparison with Mesos, the other big competitor, but the bottom line is that we’re not doing anything that Yarn can’t help with. In particular, both Spark and Hive are designed to work well with Yarn.
Zookeeper supports various things that use it. We don’t use it directly.

Airflow for Workflow Management

In this context, workflow management means scheduling (like cron), but also modeling the dependencies between tasks within the job. In a sense, it doesn’t do anything that cron and a Python script couldn’t do, but the logic to manage these dependencies is complex. In connection to that, Luigi is a Python library to model those dependencies, and is intended to be used in a script run by cron.

We started out with Oozie but we couldn’t get it to reliably execute anything. We tried for awhile, and because we were just getting set up, it’s far from certain that Oozie was the problem. Nevertheless, we prefer Airflow because the source code is small and easy to read, and the job definitions are also Python.

We’ve found with lots of tools that the documentation is either vague or nonexistent, and reading the source code is necessary. Small, nicely-written codebases definitely shine in that case. It also helps that the project maintainer, Maxime Beauchemin, is responsive to small pull requests with well-documented issue reports. When we find bugs or need feature enhancements, we can get them merged in easily.

Having the job definitions in Python is also a big win. Ultimately, it allows us to control at a very fine-grained level what is run, as well as to use control constructs to build our dags, rather than having to use a second tool to generate files in another format such as XML. Having those definitions in code, rather than in config files, allows us to pinpoint exceptions and problems to a line of code, using all the usual debugging and testing tools, rather than relying on error messages that may be more or less informative.

Hive for Data Organization

Hive provides a relational database view over data in HDFS. It pretty much requires that data be in a format that looks like a table. It supports all the usual relational querying patterns. The main disadvantage is that queries are relatively high latency, running as map reduce jobs through MapReduce or Tez.

We use it because it does provide a relational database view without pretending to be a database. There are a bunch of tradeoffs in database design in the big data space. Hive optimises for relational querying and availability, rather than speed.

Spark for Distributed Data Processing

Spark is a system for processing data in Hadoop. We chose it because it provides a high-level, functional interface to programming the jobs, and because it provides both batch and stream processing. We also chose it because it offers a Python binding, which is convenient for our data science and reporting function to work with.

It’s also fast and offers a bunch of APIs for integrating with the Hadoop ecosystem. We’ve found that we’re creating tools to replace, for example, Sqoop with our own stuff written in Spark.

Ambari for Cluster Management

Ambari is a system for managing Hadoop clusters. In addition to providing a rich interface for seeing the current state of systems in ones cluster, Ambari manages installation of components onto hosts and collecting metrics.

We started using it because it comes with Hortonworks’ Hadoop distribution, and we keep using it because it’s pretty good. Not great, mind you: as we use it, we find lots of complaints that we have that many others share, but it’s fine and it wraps up a lot of tasks, like distributing updated configuration and installing components that would be a lot of effort to automate with Chef.

Let’s Talk

If you have questions or comments about what we’re doing with our data pipeline, feel free to add a response to this post. We’d love to hear from you; and stay tuned for updates on what we’re doing with streaming, and a deeper dive on some of our components.

Do you work with Hadoop? You might be the type of person we’re looking for. Check out our careers page to see if we have an opening for you!