Force caching Spark DataFrames
Caching of DataFrame (df.cache() or df.persist(LEVEL)) in Spark is lazy, which means a DataFrame will not be cached until you trigger an action on it. Besides, shuffled DataFrames are automatically cached and may cause out-of-memory error if you don’t notice the factor.