Apache Spark Underline Mechanism
This article provides an introduction to Apache Spark and its underlying mechanism for improving computation time and handling big data in an efficient way. This is the kick-off article for those who started working with Big Data.
Introduction to Apache Spark
Spark is essentially a Framework that uses multiple loosely coupled machines(nodes) for distributing the load of computation and in-memory storage of big data. Above I have used the word loosely coupled because it is a distributed system so it has a minimum interaction between the machine (nodes). Every node has its separate RAM which is not shared by other machines in a loosely coupled environment. It is an open-source framework. It currently provides API with Python, Scala, Java, and R.
Distributed and Parallel Computing
Some questions must be coming into the minds of professionals who are not very familiar with parallel computing and distributed computing. On a high-level view in Parallel processing, the RAM of a system is shared by multiple processors in a single node but in distributed computing, we have separate RAM at each node, and at each node, there could be a single processor or multiple processors anything. When the data size is not large and it can be contained in the RAM of a single node, then the use of distributed computing is not required. The main purpose of a distributed system is Scalability i.e. storage and computation of big data (which cannot be contained in single node RAM rather we need multiple nodes to store it) efficiently. (fast computing and with less memory)
To perform distributed computing we need a framework and Spark is one of them which is widely used as it provides fast computing along with in-house services for many big-data operations
Mechanism: Lazy Evaluation Technique and In-Memory computation
Apart from Scalability, there is another important purpose of spark which is the increase in computation speed. There are multiple mechanisms that contribute to faster computation which are in-memory computation and Lazy Evaluation.
In-Memory computation: The computation speed depends on the speed to which the data is fed to the processor and when the data is in RAM it is fed at a much faster rate when compared to hard disk, Spark does the in-memory computation, it carried the maximum possible data in memory rather in hard disk during computation. If the data size is too large that it cannot be contained in the memory of all nodes, then it is compelled to put the rest data on the hard disk.
The lazy Evaluation mechanism also plays an important role in faster computing.
Spark operation is divided into two steps 1) Transformation and 2) Action
The key role here is played by Transformation, unlike the processing and transformation steps in pandas and R data frame where at each transformations steps it executes and creates the instance of an object in memory it does not execute transformation steps until the action steps are called rather it remembers the required transformation steps in memory. It takes an RDD as input and creates another RDD or RDDs as output. RDDs are immutable which means they cannot be modified but can be transformed into different RDDs so at each transformation steps different RDDs are created.
It helps in both the Space and Time trade-offs, as it avoids intermediate transformations and clubs the simple transformations also, it does not create an unnecessary object (RDD) in between. For example, if you need to do a filter on RDD1 — → RDD2 and ‘add a column’ on RDD2 — — ->RDD3, it will not create the intermediate RDD2 when the action is called rather it will create the final one which is RDD3.
Action steps are actual execution steps where we can see the results, Spark internal mechanism does the execution right away by doing all the previous transformation steps. Action Steps are show(), count(), fit() etcetera.
Caching of Data in Spark
Caching is an optimization technique for iterative and interactive computations. Caching helps in saving interim, partial results so that they can be reused in subsequent stages of computation. These interim results are stored as RDDs (Resilient Distributed Datasets) and are kept either in memory (by default) or on disk. Data cached in memory is faster to process, of course, but cache space is always limited.
Say, for example, you have a Spark job running on 10 nodes and it is going to finish in one hour today. But your boss needs it to finish in under 15 minutes. How would you go about that?
If you cache all of your RDDs, you will run out of memory. Not every time you can cache a dataset to improve your performance instead you need to consider the RAM of the cluster(all nodes) before caching the data so that all the RAM is not filled up with data, a good chunk of RAM is required for processing of the data also. So caching is the solution to do ‘in-memory computation’ but caching space is limited hence computation time can be reduced but if it is reduced to 15 mins from 1 hour or not it's not sure in a current cluster(say 10 Nodes) setup. Caching can provide a 10X to 100X improvement in performance.
Caching is suggested when :
· For RDD in re-use in iterative machine learning applications (for-loop).
· RDD that needs to be used multiple times in one Spark application.
· When the cost of computing is expensive.
Conclusion
Spark Provides plenty of things to tame big data but importantly it needs to be used in an efficient way to take the best result out of it as we are dealing with big data which is a time-consuming and very expensive process, So it's very important to know the fundamentals the underline mechanism of Apache Spark.