How DevOps helped us solve our Big Data Development problem

Photo by Alexandre Chambon

The challenges and solutions to developing data-intensive applications on Hadoop.

Developing big data applications on top of Hadoop is difficult. If managing the large number of services in the Hadoop ecosystem and ensuring applications are integrated correctly, is not handled effectively, productivity can be dramatically hindered. At Panaseer, we were starting to feel this pain and so we decided to do something about it. In this blog, I am looking to share the following with you:

  • The problem we faced at Panaseer.
  • How we broke the problem down.
  • How we solved it.
  • What this enabled for us.

The problem we faced at Panaseer

The complexities we faced developing data-intensive applications presented themselves in the following ways:

  • Building new environments was a slow process: Provisioning infrastructure, building a Hadoop cluster, deploying our application and populating it with data was a lengthy and potentially error-prone process. Because of the effort involved, these environments were rarely kept up-to-date and quickly deviated from standard configuration. Also, due to the manual setup, we could not guarantee that any two environments were the same.
  • Maintaining consistency across environments was impossible: Due to existing environments being modified all the time and in different ways, understanding which version of our platform was running on what version of Hadoop and with which datasets was impossible.
  • Core datasets were not easily shareable: For our internal development, our Data Scientists create representative datasets to build and test against. Without a centralised and simple mechanism to share these datasets our environments quickly got out of sync and were missing the key tools needed for development.

This meant that our manual attempts to create virtual machines, as local isolated development environments, were time consuming and due to the speed of product development, became out-of-date quickly. Unfortunately, this encouraged us to follow the path of least resistance, so despite our better judgement, we resorted to developing and testing directly against the shared environments.

Why you should never develop against a shared environment

  • Testing becomes increasingly difficult: If the area under test exhibited unexpected behaviour the process of identifying whether this was due to application changes or changes made to the shared environment was incredibly time consuming.
  • Developers stepping on each other’s toes: Developers who needed the shared environment to be in different, and often incompatible, states, ended up disrupting or even blocking each other’s work.
  • Increased downtime: Whilst features in progress for testing were deployed there was the possibility that this would break downstream components, bringing to a halt any team actively working on those components. The environments were easy to break but time-consuming to recover.

Ultimately, our productivity and ability to continue producing quality software was suffering.


How we solved the problem

We needed to make developing locally the easiest process to follow and migrate our developers away from the shared environments. As such, we looked to automation to solve our problem, with the goal that this would allow us to build environments quickly and consistently. Our ultimate goal was to enable those with little Hadoop and OS experience to be able to deploy complete development environments with a click of a button.

Before we could get started we had a few questions to answer:

  • What could we do first that would provide the most value?
  • How were we going to manage our configuration?
  • How were we going to package and distribute datasets?

Our first challenge was to determine the scope of the work that we were going to carry out. We did this by defining the architectural layers of our system and choosing the slice that would provide us the most value. This allowed us to better track dependent layers and identify a path through the system that would provide the most value.

Simplified Architecture

We chose to first target the data platform, the Hadoop infrastructure that supported it and the packaging and distribution of our development datasets. If we could automate the building and management of this we would have tackled the most complicated and time consuming part of our system. Once this was in place we could then look to deploy the rest of the platform.

Configuration Management

We chose to split out the work into two separate projects, the first covering all the steps required to build up our infrastructure and the second to deploy our platform. The primary reason for this split was that once the infrastructure build process was defined, it would change far less frequently than the deployment of our applications.

To configure and deploy our product we chose Ansible as it was a tool we could get up and running quickly. It required little to no infrastructure setup and was a relatively simple tool in comparison to its competitors. On top of this, the yaml files used to configure it are very readable and would allow us to create ‘living documentation’. The infrastructure and deployment projects were broken down into layers, represented in Ansible as a role, each one encapsulating a high-level step, as shown below:

---
- name: Deploy Hadoop Infrastructure
hosts: localhost
roles:
— base
— java
— mysql
— hadoop
— nifi

Each role is broken down into a series of steps that correspond to actions taken on the server. Below shows an example of what the Hadoop role may look like:

---
- name: Download Ambari Repo
get_url:
url: "{{ ambari.repo }}"
dest: "/etc/yum.repos.d/ambari.repo"
become: true
- name: Install Ambari
yum: name="{{ item }}"
with_items:
- ambari-server
- ambari-agent
become: true
- name: Setup Ambari Server
shell: ambari-server setup -s -j $JAVA_HOME
args:
creates: /etc/ambari-server/conf/password.dat
become: true

Finally, we chose to use Vagrant as the mechanism to create our development environments. For those unfamiliar, Vagrant is a tool by HashiCorp that simplifies the building and management of VMs. This enables us to abstract the underlying virtualisation implementation, works nicely with Ansible and produces artefacts in the form of a box that can be shared. Users only need to run the command vagrant up, which will then instruct Vagrant to install and start the desired VM and proceed to execute our Ansible scripts. Leading to the following process:

Dataset Management

Our final challenge was the packaging and distribution of our datasets. For local development, we didn’t need to worry about big data as they only needed to fit within our laptop and wouldn’t be used for research and performance tasks. We settled on creating a series of scripts that would allow us to retrieve and load data from the Hadoop File System, compress them, and then transfer them to Amazon S3 where they could easily be shared. For example:

# Export the table
tableloader -export TABLE_NAME.YYYY-MM-DD.tbz
# Upload the packaged table to the dataset repository
tabletransfer -up TABLE_NAME.YYYY-MM-DD.tbz

These could then be leveraged simply through Ansible:

---
- name: Download development dataset
shell: /usr/local/bin/tabletransfer -down dev_table
- name: Install development dataset
shell: /usr/local/bin/tableloader -f -i dev_table

The Results

Within a month, we were in a position where we could start beta testing, which, fortunately for us, meant talking to our colleagues sitting right next to us. We went through a phased roll-out to a few team members at a time, taking on feedback and quickly turning around improvements. One of the key areas of feedback that we received was that what seemed intuitive to us, due to our experience with the tooling and process, was not to others. A lot of the improvements we made at this point was about reducing the friction as much as possible.

Getting this far was quite a milestone for us, in the short term we were able to benefit from:

  • Uniform and up-to-date development environments making collaboration a lot easier.
  • Ansible scripts meant our documentation is now always up-to-date.
  • The ability to iterate and upgrade our local infrastructure faster and with less risk.

But just as importantly it laid the foundation from where we could start to:

  • Automate the building of production environments.
  • Ensure consistent configuration of security controls.
  • Have a fully automated QA process.

Key Takeaways

  • Don’t maintain infrastructure manually, this may work in the early days but doesn’t scale.
  • Focus on delivering thin slices of value quickly.
  • Treat configuration as code, all changes are committed to version control.
  • Split your infrastructure into well-encapsulated layers, this will simplify reasoning about your infrastructure.

If you’ve faced similar problems or have come up with interesting solutions please let us know in the comments or by tweeting us at @panaseer_team.