Basics of Data Engineering

From a position of Data Scientist

Andriy Lazorenko
5 min readNov 22, 2017

Introduction:

Before we dive into details of this “data engineering” business, we first need to understand the limitations of a standard (PyData) machine learning pipeline, shown on the picture below:

Typical PyData Machine Learning Pipeline. Connections are done via Python scripts (usually single-threaded)

Pros of the model:

  • Low entry barrier
  • Quick to deploy

Cons of the model:

  • Poor scalability

There’s a need to abandon typical PyData pipeline when there’s a clear understanding that it is going to be heavily loaded in production.

Deciphering poor scalability in terms of reactive Machine Learning (ML) is best done here. We’ll take a simpler approach and will describe 3 categories of problems that start to appear as we adding a heavy load on the pipeline.

1. Too much computation

If we start increasing the load on a single-threaded PyData pipeline, it will eventually fail. We cannot scale it beyond a single node, which means that if your data is too big to fit in a server’s RAM, it’s impossible to batch process it, it needs to be broken to smaller batches or it won’t get processed at all. If there is not enough CPU resources to learn a model quick enough, you can only buy a better processor, but if you have the best one, what’s next? Worse, if the model cannot spit predictions at a rate, that is required due to rate of incoming requests, that’s the limit for the application. Multithreading in Python is limited and if a thread fails, there’s no way to gracefully move to defaults, it just fails.

All the above problems can be solved using split-apply-combine strategy that allows to split a task to subtasks to be parallelized across different threads or nodes, perform operations and combine the result, speeding up the pipeline. MapReduce can be implemented from scratch in different programming languages, but it’s mad to do that due to hours of pain required. A viable alternative is to use Hadoop’s MapReduce or Spark. Hadoop’s MapReduce is slower, because of disk read-writes that occur after each operation (so hardly anyone uses it now) while Spark does all computations in-memory, speeding iterative algorithms typical for ML even more than usual pipelines (you’ll need a lot of RAM for larger datasets).

Spark is a great choice for data pipelines incorporating ML, because on top of batch processing or streaming of data, provided by Spark, Spark ML library allows for distributed machine learning, replacing PyData pipeline and connectors in form of Python scripts fully by Spark jobs, that are able to run from a distributed cluster (and fail gracefully).

2. Too much data

Okay, let’s imagine that we’ve replaced all intermediate scripts in typical PyData ML Pipeline by Spark jobs and data modelling step by Spark ML as follows:

Improved pipeline

There is still one weak spot in this pipeline: DB. Relational databases perform poorly when dealing with (approx) over 5 TB of data. Sharding and replication of relational DBs are a separate issue and art of its own, being a costly venture in terms of support efforts, so one might want to consider abandoning relational DB before reaching glass ceiling of 5TB.

So, what other storage options are viable for big data? First, there’s a need for distributed data storage system, able to combat:

  • hardware failure and loss of data
  • need for high throughput
  • portability need for many platforms

All that points are satisfied with Hadoop Distributed File System (HDFS), part of Apache Hadoop. Once again, it’s possible to set up HDFS on local hardware cluster, if there are a couple of DevOps specialists at hand with no workload, or on a Cloud computing cluster. For storage of cold data it is common to use AWS S3 as a storage. It is not recommended to use S3 as a part of a pipeline if speed is of concern: it’s better to keep such “hot” data closer to computational resources (e.g. local HDFS cluster or cloud HDFS if your computational servers are in cloud).

But that’s just infrastructure for distributed data. What about method of storing data?

As we’ve already figured, relational databases are poor options. What about plain text? Surprisingly, .csv or .tsv files are commonly dumped onto HDFS system as a quick-and-dirty way to store data dumped from external service (e.g. when receiving data in batches).

Pros:

  • Quick dump

Cons:

  • Max size is limited
  • Need to read entire file
  • Not easily queryable

What about NoSQL databases? Note, that not all of them are distributed.

Pros:

  • Queryable
  • High throughput

Cons:

  • Maintenance cost

Cassandra is one typical choice for distributed NoSQL database, but it still requires a dedicated DevOps to setup, maintain, monitor, scale, etc. So, distributed NoSQL is a default option to store data, which needs to be queried.

For those looking to decrease workload of DevOps person, there is a solution, that covers some use-cases: Parquet.

Pros:

  • Great integration with Spark
  • Columnar, which allows speed-ups of queries by analysts
  • Not a distributed database, just files -> less maintenance effort
  • Quick to read
  • Efficient to store

Cons:

  • Schema
  • Difficult schema evolution
  • No data mutation

To wrap it all up on a single picture:

3. Messy pipeline

As number of spark jobs grows, a pipeline becomes messy, as there are many things that need to be taken care of:

  • Job scheduling
  • Time-dependent jobs
  • Logic-dependent jobs (task chaining)
  • Monitoring & Alerting for job failures
  • Failover recovery (retries for failed jobs, not entire pipeline)
  • Job visualizations (dashboards)
  • other workflow management tasks

There are a couple of solutions to orchestrate pipelines in a manageable way. We’ll review 2 of them.

Luigi

A simple workflow management solution on Python developed and open-sourced by Spotify.

Pros:

  • Low entry barrier
  • Dependency management (chaining)
  • Web UI centralized job planner with statuses and error tracking
  • Failover recovery

Cons:

  • No scheduling
  • No streaming
  • Difficult scalability

Airflow

More sophisticated and capable workflow management supporting most use cases developed and open-sourced by Airbnb.

Pros:

  • All of the above + scheduling and scalability

Cons:

  • Higher entry barrier due to increased functionality

Wrapping it all up:

--

--