Basics of Data Engineering
From a position of Data Scientist
Introduction:
Before we dive into details of this “data engineering” business, we first need to understand the limitations of a standard (PyData) machine learning pipeline, shown on the picture below:
Pros of the model:
- Low entry barrier
- Quick to deploy
Cons of the model:
- Poor scalability
There’s a need to abandon typical PyData pipeline when there’s a clear understanding that it is going to be heavily loaded in production.
Deciphering poor scalability in terms of reactive Machine Learning (ML) is best done here. We’ll take a simpler approach and will describe 3 categories of problems that start to appear as we adding a heavy load on the pipeline.
1. Too much computation
If we start increasing the load on a single-threaded PyData pipeline, it will eventually fail. We cannot scale it beyond a single node, which means that if your data is too big to fit in a server’s RAM, it’s impossible to batch process it, it needs to be broken to smaller batches or it won’t get processed at all. If there is not enough CPU resources to learn a model quick enough, you can only buy a better processor, but if you have the best one, what’s next? Worse, if the model cannot spit predictions at a rate, that is required due to rate of incoming requests, that’s the limit for the application. Multithreading in Python is limited and if a thread fails, there’s no way to gracefully move to defaults, it just fails.
All the above problems can be solved using split-apply-combine strategy that allows to split a task to subtasks to be parallelized across different threads or nodes, perform operations and combine the result, speeding up the pipeline. MapReduce can be implemented from scratch in different programming languages, but it’s mad to do that due to hours of pain required. A viable alternative is to use Hadoop’s MapReduce or Spark. Hadoop’s MapReduce is slower, because of disk read-writes that occur after each operation (so hardly anyone uses it now) while Spark does all computations in-memory, speeding iterative algorithms typical for ML even more than usual pipelines (you’ll need a lot of RAM for larger datasets).
Spark is a great choice for data pipelines incorporating ML, because on top of batch processing or streaming of data, provided by Spark, Spark ML library allows for distributed machine learning, replacing PyData pipeline and connectors in form of Python scripts fully by Spark jobs, that are able to run from a distributed cluster (and fail gracefully).
2. Too much data
Okay, let’s imagine that we’ve replaced all intermediate scripts in typical PyData ML Pipeline by Spark jobs and data modelling step by Spark ML as follows:
There is still one weak spot in this pipeline: DB. Relational databases perform poorly when dealing with (approx) over 5 TB of data. Sharding and replication of relational DBs are a separate issue and art of its own, being a costly venture in terms of support efforts, so one might want to consider abandoning relational DB before reaching glass ceiling of 5TB.
So, what other storage options are viable for big data? First, there’s a need for distributed data storage system, able to combat:
- hardware failure and loss of data
- need for high throughput
- portability need for many platforms
All that points are satisfied with Hadoop Distributed File System (HDFS), part of Apache Hadoop. Once again, it’s possible to set up HDFS on local hardware cluster, if there are a couple of DevOps specialists at hand with no workload, or on a Cloud computing cluster. For storage of cold data it is common to use AWS S3 as a storage. It is not recommended to use S3 as a part of a pipeline if speed is of concern: it’s better to keep such “hot” data closer to computational resources (e.g. local HDFS cluster or cloud HDFS if your computational servers are in cloud).
But that’s just infrastructure for distributed data. What about method of storing data?
As we’ve already figured, relational databases are poor options. What about plain text? Surprisingly, .csv or .tsv files are commonly dumped onto HDFS system as a quick-and-dirty way to store data dumped from external service (e.g. when receiving data in batches).
Pros:
- Quick dump
Cons:
- Max size is limited
- Need to read entire file
- Not easily queryable
What about NoSQL databases? Note, that not all of them are distributed.
Pros:
- Queryable
- High throughput
Cons:
- Maintenance cost
Cassandra is one typical choice for distributed NoSQL database, but it still requires a dedicated DevOps to setup, maintain, monitor, scale, etc. So, distributed NoSQL is a default option to store data, which needs to be queried.
For those looking to decrease workload of DevOps person, there is a solution, that covers some use-cases: Parquet.
Pros:
- Great integration with Spark
- Columnar, which allows speed-ups of queries by analysts
- Not a distributed database, just files -> less maintenance effort
- Quick to read
- Efficient to store
Cons:
- Schema
- Difficult schema evolution
- No data mutation
To wrap it all up on a single picture:
3. Messy pipeline
As number of spark jobs grows, a pipeline becomes messy, as there are many things that need to be taken care of:
- Job scheduling
- Time-dependent jobs
- Logic-dependent jobs (task chaining)
- Monitoring & Alerting for job failures
- Failover recovery (retries for failed jobs, not entire pipeline)
- Job visualizations (dashboards)
- other workflow management tasks
There are a couple of solutions to orchestrate pipelines in a manageable way. We’ll review 2 of them.
Luigi
A simple workflow management solution on Python developed and open-sourced by Spotify.
Pros:
- Low entry barrier
- Dependency management (chaining)
- Web UI centralized job planner with statuses and error tracking
- Failover recovery
Cons:
- No scheduling
- No streaming
- Difficult scalability
Airflow
More sophisticated and capable workflow management supporting most use cases developed and open-sourced by Airbnb.
Pros:
- All of the above + scheduling and scalability
Cons:
- Higher entry barrier due to increased functionality