Big-Data Engines for building ML Data Pipelines — Introduction (Part 1)

Published in

petaByts

3 min readMay 1, 2020

Due to the diversity of data sources, and the volume of data that needs to be processed, traditional data processing tools fail to meet the performance and reliability requirements of modern machine learning and data analytics applications.

Part 1 of this series will focus on big data processing engines — Hadoop, Spark, Presto & Airflow. Following Parts will cover how to setup cost efficient and highly scalable & reliable Data pipelines on GCP and AWS.

Apache Hadoop/Hive — batch analytics

Hive is an Apache-open source project built on top of Hadoop for querying, summarising and analysing large data sets using a SQL like interface (similar to AWS Athena), we will ponder more upon AWS Athena in Part 2 of this series.

Apache Hive is used mostly for Batch processing of large ETL jobs & Batch SQL queries on a very large data sets as well as exploration on large volumes of structured, semi-structured and un-structured data. Hive includes a Metastore which provides schemas and statistics that are useful for optimising queries and data exploration. Hive is Distributed, Scalable, Fault Tolerant, Flexible and it can persists large data sets on cloud file systems.

Apache Spark — stream analytics

Open source general purpose computational engine for big data. Spark provides a simple & expressive programming model that supports a wide range of applications including ETL, machine learning (Spark ML libs), stream processing and graph computations. Due to its in-memory processing capabilities it’s quite fast. Spark is also Distributed, Scalable, Fault Tolerant, Flexible and natively supports wide array of programming languages , including JAVA, Python, Scala and R.

Because Spark is memory intensive, it may not be most cost-efficient engine for all use-cases. Few important things to consider while operating Spark clusters in production — Fast Storage, distributed caching, advanced indexing, metadata caching and job isolation on multi-tenant clusters and use of SparkLens — an open source Spark profiler that provides insights into Spark running workloads across different environments.

Presto — open source SQL query Engine

Presto is an open source SQL query engine developed by Facebook, it is used for running interactive analytic queries against data sources of all sizes from gigabytes to petabytes. Presto is build to work with disparate data sources and you can combine different data sources in a single query (Federated Queries) .Presto is tuned like an MPP (Massive Parallel Processing) SQL engine and is optimised for SQL execution.

With Presto connectors you can work directly on files that resides on file systems like (S3, Azure Storage) and can join terabytes of data in few seconds, or cache queries intermittently for rapid response upon later runs. Presto is also Distributed, Scalable, and Flexible.

For cost efficient usage make sure to setup Dynamic Cluster sizing based on workloads & termination of idle clusters — ensuring high reliability while reducing computing costs. Presto cluster supports multi-tenancy and you can setup logs and metrics to track performance of queries.

Apache Airflow — Open-source tool to Author, schedule & Monitor Data Workflows

With Airflow users can author workflow as directed acyclic graphs (DAGs) of task. A DAG is a set of tasks needed to complete a pipeline organised to reflect their relationships and interdependencies. Airflow’s rich user interface makes it easy to visualise pipelines running in production, monitor progress and troubleshoot issues when needed.

It connects out-of-the-box with multiple data sources, it can alert via email or slack if a tasks succeed or fails. Because workflows are defined as code, they become more maintainable, version-able, testable and collaborative. Airflow can be configured to provide single-click deployment, automates cluster and configuration management and can include dashboards to visualise Airflows DAGs.

If you liked this article, click the👏 below so other people will see it here on Medium.

Big-Data Engines for building ML Data Pipelines — Introduction (Part 1)

Apache Hadoop/Hive — batch analytics

Apache Spark — stream analytics

Presto — open source SQL query Engine

Apache Airflow — Open-source tool to Author, schedule & Monitor Data Workflows

Written by Him Bhankar