Spark — Beyond basics: Required Spark memory to process 100GB of data
Wanting more is not always good! Don’t believe me? Ask that Data Engineer who takes on tasks because he can’t say ‘NO’. (Oh sorry, did I hurt your feelings🙈🤐)
You can also ask that executor who just got 5 more tasks added in the queue because YOU can't figure out the number of executor cores required for your spark job! 🙃
But, seriously, let's say you are given a task to set up a spark cluster which should be able to process 100GBs of data “efficiently”.
How would you start?
Let me show you how the boss does it! 😏
STEP 1: Number of executor cores required?
We start by deciding how many executor cores we need 🤔
- One partition is of 128MB of size by default — important to remember
- To calculate number of cores required, you have to calculate total number of partitions you will end up having
- 100GB = 100*1024 MB = 102400MB
- Number of partitions = 102400/128 = 800
- Therefore, 800 executor cores in total are required
STEP 2: Number of executors required?
Now that we know how many cores, next we need to find how many executors are required.
- On an average, its recommended to have 2–5 executor cores in one executor
- If we take number of executors cores in one executor = 4 then, total number of executors = 800/4 = 200
- Therefore, 200 executors are required to perform this task
Obviously, the number will vary depending on how many executor cores do you keep in one executor 😉
STEP 3: Total executor memory required?
Important step! how much memory to be assigned to each executor 🤨
- By default, total memory of an executor core should be
4 times the (default partition memory) = 4*128 = 512 MB
- Therefore, total executor memory = number of cores*512 = 4*512 = 2GB
SUMMARIZE: Total memory required to process 100GB of data
We are here! 🥳lets finalize on total memory to be required to process 100GBs of data
- Each executor has 2GB of memory
- We have total of 200 executors
Therefore, 400GB of total minimum memory required to process 100GB of data completely in parallel.
Meaning, all the tasks will run in parallel 😲
Say, it takes 5 mins to run one task, how much time it will take to process 100GBs of data? — Answer is 5 mins!! since all tasks will run in parallel
BONUS STEP: What should be the driver memory?
- This depends on your use case.
- If you do df.collect(), then 100Gbs of driver memory would be required since all of the executors will send their data to driver
- If you just export the output somewhere in cloud/disk, then driver memory can be ideally 2 times the executor memory = 4GB
And that my friend, is how you efficiently process 100GBs of data 😉
One thing to keep in mind, this is an ideal solution which can easily be toned down to fit the budget of a project😇
If project has a tight budget, you can reduce number of executors in half or by one-fourth. Though the time taken to process will definitely increase.
If you liked the blog, please clap 👏 to make this reach to all the Data Engineers.
Thanks for reading! 😁