Big salary? Take a look at PySpark basic questions!

Rohit Veryani
ILLUMINATION
Published in
1 min readNov 21, 2022

With each passing day, Python is gaining traction and is the most sought programming language due to the fact of its easy-to-understand, clean structure, and ever-increasing community support. With its momentum, Spark has also integrated its API with Python which is commonly known as PySpark.

Spark has a massive capability of processing huge data parallelly. Because of PySpark, parallel processing has now become easier for the python community. Looking at the surging demand for PySpark developers, below are a few questions that should come in handy in the first go :

  1. Spark Architecture? Cluster types, modes, and spot instances? Mounting storage? Job vs Stage vs Task?
  2. Actions vs Transformations? Directed Acyclic Graphs? Lazy Evaluation?
  3. RDD vs Dataframe vs Dataset? Parquet file vs Avro file?
  4. StructType vs StructField? Delta lake? Time travel?
  5. Syntax errors vs Exceptions?
  6. startsWith() vs endsWith()? withColumn vs select vs withColumnRenamed? Map vs FlatMap? Why do we use ‘literals’?
  7. .collect() ? show vs display? How to display the full values of a column?
  8. Create RDD from a list? Create RDD from a text file? Current_date vs current_timestamp ?
  9. Reading and writing a file? Create an empty dataframe?
  10. Convert dataframe to rdd and rdd to dataframe?
  11. Broadcast variable, explode, coalesce, and repartition?
  12. Merge or union two dataframes with a different number of columns?
  13. Iterate through each row of a dataframe in PySpark?
  14. How to handle NULL values?

I hope the above-listed questionnaires will bring some help with your recruitment! Happy Learning!!

Cheers!

rohitveryani@gmail.com

--

--

Rohit Veryani
ILLUMINATION

I am a tech enthusiast who loves to experiment and fond of implementing things that have learned. Have 9+ years of exp into Analytics and Data Engee domain