Beginning to Apache Spark

İrem Fırat
3 min readOct 22, 2023

--

Apache Spark uses in-memory caching and optimized query execution for fast queries against data of any size. Briefly, Spark is a fast, open-source distributed computing system for large-scale data processing.

>> Apache Spark has many ready-made libraries and these libraries are compatible with programming languages such as Python, Java, Scala,R and SQL.

>> Apache spark is written in Scala and has its own ecosystem.

>>Thanks to Spark Streaming, one of Apache Spark’s libraries, live and continuously produced data can be processed, or there is MlLib library, one of its internal libraries, to perform machine learning operations.

>>Databricks; It is a contributor to Apache Spark code, provides an optimized version of Spark, offers interactive notebooks, and has the full enterprise security any large organization would need.

Apache components

· The Apache Spark Ecosystem consists of Spark SQL and data frameworks, M Llib for machine learning, data streams for real-time data, and GraphX for understanding graph-structured data at scale.

Spark Core API; includes Spark’s core functionality for structured data processing, machine learning, graph processing, and streams, as well as task scheduling, memory management, error recovery, interaction with storage systems, and more.

Spark SQL allows interacting with data using queries. It provides an environment for using SQL queries with programmatic data manipulations supported by low-level APIs (Python, Java, Scala..) within a single application and combining SQL with complex analytics .

Spark Streaming

Spark Streaming requires the ability to process and analyze new data streams in real time as well as aggregate data.

Data can be obtained from many sources and processed using complex algorithms expressed with high-level functions such as map, reduce and join. Processed data can be transferred to file systems and databases.

MLlib (Machine Learning)

It is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy.

Some algorithms supported by MlLib are listed below;

· Classification Algorithms

- Logistic Regression

- Decision Tree

- Random Forest

- Gradient Boosted Tree

- Multilayer Perceptron

- Linear Support Vector Machine

- Naive Bayes

· Regression

- Linear Regression

- Decision Tree

- Random Forest

· Clustering

- K-Means Clustering

- Bisecting K-Means

- Gaussian Model

GraphX (Graphical Calculation)

It is the graph computing engine found in Spark, allows users to create interactive graph-based calculations. An example is the analysis of social media friend networks.

>> To summarize again on a diagram;

Summary

Apache Spark Architecture

(Spark’s cluster structure)

· Driver calls the main program of an application/program and creates the SparkContext. It distributes and schedules the work among the Executer.

· Spark Driver and SparkContext collectively monitor job execution within the cluster. Spark Driver works with Cluster Manager to manage various other jobs.

· Executor located in the Worker node are the servers where the data is stored and performs the work assigned by the Driver, reporting the status of the calculation back to the Driver.

· The executor allocated to the application remains alive as long as the program runs and runs tasks in multiple threads.

· Cluster Manager is used as task manager or resource manager. When Spark applications are submitted to Cluster Manager, they grant resources to the applications so that the job can be completed. The work is then split into multiple smaller tasks that are distributed to more worker nodes.

· Spark is a distributed system. If workers need to work in parallel, Spark needs to separate the data into chunks or partitions.

· It is a chunk of data stored on a partition in the cluster. Therefore the data frames part is how data is physically distributed across the cluster of machines during execution.

--

--