Getting started with Apache Spark — I

Udbhav Pangotra
Geek Culture
Published in
5 min readJan 4, 2022

A series of articles to get you started on your Apache Spark journey!

Photo by imgix on Unsplash

Defining Big Data

With the gradual increase in the distributed computing, computational power and the multitude of storage options that has emerged in the last decade, the term big data has widely been used. What is exactly big data and what data can be called big data?

Basically we can describe it using the 3V’s
Volume
The amount of data that is generated (measured in bytes, megabytes, gigabytes, terabytes …..)

Velocity
The speed of data generation, real-time or streaming or batches

Variety
The type of data — structured or unstructured

Datatypes and Sources

Data can come from various data sources and can have a variety of datatypes that are in the data, from float to character to text files. Data can take various shapes and be ingested from various sources. Some of the most prominent data sources are —

Applications data — Can contain transactional data, CRM data, Customer data, Employer data, Employee data etc. from internal applications or from certain public applications.

Logs and Monitoring — It mainly consists of the events occurring in the IoT devices, meta data form various applications and run/crash logs form various applications. It can also have some monitoring checks data as well.

Streaming sources and IoT sensors — Real time feeds like IoT device events, video streaming, audio streaming and events streaming.

Datatypes:
Data can take many forms but it is mainly divided into 3 categories:
1. Structured Data
2. Semi-Structured Data
3. Unstructured Data

  • Structured data
    It is generally tabular data that is represented by columns and rows in a database. Databases that hold tables in this form are called relational databases. The mathematical term “relation” specify to a formed set of data held as a table. In structured data, all row in a table has the same set of columns. SQL (Structured Query Language) programming language used for structured data.
  • Semi-structured Data
    The data is information that doesn’t consist of Structured data (relational database) but still has some structure to it. Semi-structured data consist of documents held in JavaScript Object Notation (JSON) format. It also includes key-value stores and graph databases.
  • Unstructured data
    It is the information that either does not organize in a pre-defined manner or not have a pre-defined data model. Unstructured information is a set of text-heavy but may contain data such as numbers, dates, and facts as well. Videos, audio, and binary data files might not have a specific structure. They’re assigned to as unstructured data.

Distributed Systems

A system whose components are located at multiple places in a network, which communicate to each other by passing messages in between to achieve a common goal. It can be thought of a a single ant hill, which inside has thousands of workers and silos. Failure of one component does not have a major affect on the whole system.

A client-server model diagram. Source: Wikimedia Commons

There are several advantages of distributed systems-

  • Horizontal Scaling
  • Reliability
  • Parallel computing
  • Higher performance
  • Flexibility
  • Openness

Along with these advantages we face some challenges as well-

  • Monitoring and Maintenance
  • Coordination, dependency, scheduling

HDFS (Hadoop Distributed File System)

the HDFS has a master and slave server architecture. The master node manages the file namespace and the access regulation of files by the clients. The data nodes manages the storage attached to the nodes, the data is stored as files. The file is split into data blocks with size of 128mb by default. Each block is replicated by 3 times default so as to have a fault tolerant system. The Data Nodes send heartbeat signal to name node periodically.

The above paragraph is basically the simplified version of what HDFS is and the architecture it follows. Now we should first cover the terms used in the above paragraph-

  • Master-Slave Architecture- Apache Hadoop HDFS Architecture follows a Master/Slave Architecture, where a cluster comprises of a single NameNode (Master node) and all the other nodes are DataNodes (Slave nodes). HDFS can be deployed on a broad spectrum of machines that support Java
  • NameNode works as Master in Hadoop cluster. Below listed are the main function performed by NameNode:
    1. Stores metadata of actual data. E.g. Filename, Path, No. of Data Blocks, Block IDs, Block Location, No. of Replicas, Slave related configuration
    2. Manages File system namespace.
    3. Regulates client access request for actual file data file.
    4. Assign work to Slaves(DataNode).
    5. Executes file system name space operation like opening/closing files, renaming files and directories.
    6. As Name node keep metadata in memory for fast retrieval, the huge amount of memory is required for its operation. This should be hosted on reliable hardware.
  • Data Node works as Slave in Hadoop cluster . Below listed are the main function performed by DataNode:
    1. Actually stores Business data.
    2. This is actual worker node were Read/Write/Data processing is handled.
    3. Upon instruction from Master, it performs creation/replication/deletion of data blocks.
    4. As all the Business data is stored on DataNode, the huge amount of storage is required for its operation. Commodity hardware can be used for hosting DataNode.
  • Heartbeat Signal is a signal from Datanode to Namenode to indicate that it is alive. In HDFS, absence of heartbeat indicates that there is some problem and then Namenode, Datanode can not perform any computation

Hope this serves well as an introductory article to the concepts of BigData and Spark! I will be posting subsequent articles and their links below. Cheers!

Follow for more content!

Other articles that might be interested in:
- Part of this series :

Getting started with Apache Spark — I | by Sam | Geek Culture | Jan, 2022 | Medium
Getting started with Apache Spark II | by Sam | Geek Culture | Jan, 2022 | Medium
Getting started with Apache Spark III | by Sam | Geek Culture | Jan, 2022 | Medium

Misc: Streamlit and Palmer Penguins. Binged Atypical last week on Netflix… | by Sam | Geek Culture | Medium
- Getting started with Streamlit. Use Streamlit to explain your EDA and… | by Sam | Geek Culture | Medium

Cheers and do follow for more such content! :)

You can now buy me a coffee too if you liked the content!
samunderscore12 is creating data science content! (buymeacoffee.com)

--

--