An Overview of Databases — Part 7.1: Distributed DBMS (Apache Spark, Parquet + Pyspark + Node.js)

7 min readAug 1, 2024

Part 1: DBMS Flow
Part 2: Non-Relational DB vs Relational
Part 3: CAP and BASE Theorem
Part 4: How to choose a Database?
Part 5: Different Solutions for Different Problems
Part 6: Concurrency Control
Part 7: Distributed DBMS
>> Part 7.1: Distributed DBMS (Apache Spark, Parquet + Pyspark + Node.js)
Part 8: Clocks
>> Part 8.1: Clocks (Causal Consistency With MongoDB)
>> Part 8.2: Clocks (MongoDB Replica and Causal Consistency)
Part 9: DB Design Mastery
Part 10: Vector DB
Part 11: An interesting case, coming soon!

What we are going to discuss in this post:

What is Apache Spark?
Key Features
Core Components
Use Cases
Example Workflow
Parquet File Format and ParquetJs
Using Spark with Python vs Node.js
An educational project

What is Apache Spark?

Apache Spark is an open-source, distributed computing system designed for big data processing and analytics. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

At the end of this post you can find an educational project on GitHub to learn how to set up a Spark Cluster and run some jobs.

Key Features

1. In-Memory Computing:

Spark processes data in memory, making it significantly faster for tasks like iterative algorithms and interactive data analysis compared to traditional disk-based processing.

2. Distributed Processing:

Spark distributes data across a cluster of computers, allowing parallel processing and leveraging the power of multiple nodes to handle large-scale data efficiently.

3. Fault Tolerance:

Spark uses Resilient Distributed Datasets (RDDs), which automatically recover from node failures, ensuring reliability and fault tolerance.

4. Unified Data Processing:

Spark supports multiple data processing operations including batch processing, real-time stream processing, machine learning, and graph processing within a single framework.

5. Ease of Use:

Spark provides high-level APIs in Scala, Java, Python (PySpark), and R, making it accessible to a wide range of developers. It also includes a rich set of built-in libraries like Spark SQL, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for real-time data processing.

Core Components

1. Spark Core:

The foundation of Spark, responsible for basic functionalities like task scheduling, memory management, fault recovery, and interaction with storage systems.

2. Spark SQL:

A module for structured data processing using SQL and DataFrame APIs, enabling seamless integration with a variety of data sources.

3. Spark Streaming:

Allows real-time stream processing of live data streams.

4. MLlib (Machine Learning Library):

Provides scalable machine learning algorithms for classification, regression, clustering, collaborative filtering, and more.

5. GraphX:

A library for graph parallel computation, enabling the processing and analysis of large-scale graph data.

Use Cases

Batch Processing: Processing large volumes of static data stored in databases or files.
Real-Time Processing: Analyzing streaming data from sources like IoT devices, logs, and social media feeds.
Data Warehousing: Running large-scale SQL queries for business intelligence and reporting.
Machine Learning: Applying machine learning algorithms on large datasets for predictive analytics.
Graph Analytics: Performing operations and analysis on graph datasets for social network analysis, fraud detection, etc.

Example Workflow

Data Ingestion: Load data from various sources (HDFS, S3, etc)
Data Processing: Transform and process data using Spark’s APIs (DataFrame, RDD, Datasets)
Machine Learning: Train machine learning models using MLlib.
Querying: Query data using Spark SQL
Real-time Analysis: Stream and analyze data using Spark Streaming
Output: Store results or serve them directly to applications

Parquet File Format and ParquetJs

Apache Spark is a powerful big data processing framework that is written in Scala and primarily supports languages like Scala, Java, R, and Python. Node.js is generally used in the context of server-side JavaScript and isn’t natively supported by Apache Spark.

However, you can still interact with Apache Spark in a Node.js project, but it would require using a REST API or other means to interface with a Spark cluster. Here’s how you might go about doing this:

Using Livy for Spark Interaction: Apache Livy is a service that enables easy interaction with a Spark cluster over a REST API.
Using Node.js HTTP/REST client: You can use libraries like Axios to interact with the Livy server from your Node.js application.
Interact with Spark to perform operations like data ingestion, data transformation, and data analysis by writing jobs in an appropriate supported language (Scala, Java, Python).

Why Not Node.js for Spark Jobs?

Runtime Environment: Spark operations are optimized to run on the JVM, and Node.js operates in a JavaScript runtime (V8). These environments are inherently different.
API Compatibility: Spark’s API is not available in Node.js and rewriting those capabilities in Node.js would be impractical and inefficient.
Integration and Support: Spark seamlessly integrates with data ecosystems like Hadoop, HDFS, and YARN, which are also JVM-based.

Apache Livy and parquetjs serve different purposes in the ecosystem of big data processing and storage. Their functionalities and use cases do not overlap; rather, they complement each other.

Livy is a service that enables interactive and batch-based access to Apache Spark via a REST API. It allows you to submit Spark jobs, manage running jobs, and track their status. Use Cases:

Job Submission: Submit and manage Spark jobs from different clients (e.g., applications written in Node.js, Java, Python, etc.).
Interactive Queries: Run interactive Scala, Python, shell, and SparkSQL queries against a Spark cluster.

Parquet is a columnar storage file format optimized for use with complex data processing tasks. It is designed to bring efficiency both in terms of I/O operations and space. Use Cases:

Storage Efficiency: Store large-scale datasets in a compressed and efficient manner.
Query Performance: Optimize read operations for analytical queries by enabling selective scanning of columns.
Interoperability: Facilitate data interchange across multiple big data tools like Spark, Hive, and Presto.

How They Complement Each Other:
In a large-scale data processing environment:

Apache Livy could be used to submit, manage, and monitor Spark jobs that perform various data processing tasks.
Parquet could serve as the format for the data being processed in those Spark jobs, where the jobs would read from or write to Parquet files to take advantage of the efficient columnar storage.

By integrating both, you can run efficient Spark jobs from a Node.js application via Livy and handle the actual data in an optimized format like Parquet.

Here is a sample project on GitHub, This project showcases how to set up and run a Spark and Livy environment using Docker, along with running a Python script to process Parquet files using PySpark. Additionally, it includes a Node.js application to interact with Livy and write data using parquetjs-lite.

A few key notes to consider about this educational project:

It is now up to you to elaborate the project to add more complex tasks.
This repository contains a simple node.js app to write parquet file, you can implement a more complex micro-service to listen to different Kafka messages, process data and then persist them as parquet files.
This repository also contains a simple node API server to manage Spark jobs and a simple python script to run one specific Spark job, you can continue to work to have more complex Spark jobs.

Using Spark with Python vs Node.js

1. Distributed Processing:

Spark: Designed for distributed computing, Spark can handle massive datasets distributed across many nodes in a cluster. This allows it to process large-scale data much more efficiently than a single-node Node.js application.
Node.js: The parquetjs library can read Parquet files in a single-threaded, single-node environment, which is not suitable for large-scale data processing.

2. Performance and Scalability:

Spark: Leverages data parallelism, task distribution, and optimized execution plans to process data quickly and efficiently. It’s tailored for performing complex transformations and aggregations on large datasets.
Node.js: While Node.js is great for handling asynchronous, I/O-bound operations, it isn’t optimized for CPU-bound tasks or large-scale data processing. Processing significant amounts of data would be slow and resource-intensive.