Spark — Beyond basics: Required Spark memory to process 100GB of data

3 min readAug 1, 2024

This is what happens when you have lesser cores and more tasks 🤷

Wanting more is not always good! Don’t believe me? Ask that Data Engineer who takes on tasks because he can’t say ‘NO’. (Oh sorry, did I hurt your feelings🙈🤐)

You can also ask that executor who just got 5 more tasks added in the queue because YOU can't figure out the number of executor cores required for your spark job! 🙃

But, seriously, let's say you are given a task to set up a spark cluster which should be able to process 100GBs of data “efficiently”.

How would you start?

Let me show you how the boss does it! 😏

STEP 1: Number of executor cores required?

We start by deciding how many executor cores we need 🤔

One partition is of 128MB of size by default — important to remember
To calculate number of cores required, you have to calculate total number of partitions you will end up having
100GB = 100*1024 MB = 102400MB
Number of partitions = 102400/128 = 800
Therefore, 800 executor cores in total are required

STEP 2: Number of executors required?

Now that we know how many cores, next we need to find how many executors are required.

On an average, its recommended to have 2–5 executor cores in one executor
If we take number of executors cores in one executor = 4 then, total number of executors = 800/4 = 200
Therefore, 200 executors are required to perform this task

Obviously, the number will vary depending on how many executor cores do you keep in one executor 😉

STEP 3: Total executor memory required?

Important step! how much memory to be assigned to each executor 🤨

By default, total memory of an executor core should be

4 times the (default partition memory) = 4*128 = 512 MB

Therefore, total executor memory = number of cores*512 = 4*512 = 2GB

SUMMARIZE: Total memory required to process 100GB of data

We are here! 🥳lets finalize on total memory to be required to process 100GBs of data

Each executor has 2GB of memory
We have total of 200 executors

Therefore, 400GB of total minimum memory required to process 100GB of data completely in parallel.

Meaning, all the tasks will run in parallel 😲

Say, it takes 5 mins to run one task, how much time it will take to process 100GBs of data? — Answer is 5 mins!! since all tasks will run in parallel

BONUS STEP: What should be the driver memory?

This depends on your use case.
If you do df.collect(), then 100Gbs of driver memory would be required since all of the executors will send their data to driver
If you just export the output somewhere in cloud/disk, then driver memory can be ideally 2 times the executor memory = 4GB

And that my friend, is how you efficiently process 100GBs of data 😉

One thing to keep in mind, this is an ideal solution which can easily be toned down to fit the budget of a project😇

If project has a tight budget, you can reduce number of executors in half or by one-fourth. Though the time taken to process will definitely increase.

If you liked the blog, please clap 👏 to make this reach to all the Data Engineers.

Thanks for reading! 😁