Guide and Tips for Apache Spark 3.0/2.4 Databricks Certification Preparation

Anoop Malayamkumarath
5 min readAug 6, 2020

--

Apache Spark Certification

Lightning-fast cluster computing

Databricks has changed the pattern recently for the Spark certifications. One of the major changes was on the question pattern, it has changed to multiple-choice questions now.

Since the new pattern is out very recently, there are not many spark certification dumps available to practice or guidelines for the certification. Whatever it is available online are outdated. Recently I have cleared both the versions and hence thought of giving some insights.

Databricks Certified Associate Developer for Apache Spark 3.0/2.4

Spark 3.0 certification is newly released by Databricks in June 2020. This consists of 60 questions that are framed mostly around Dataframe API. One of the best books you can refer to clear the certification is the Spark: The Definitive Guide. The Following are the specific chapters you need to cover for this exam. Chapters: I, II, and IV

Exam Details:

The exam consists of 60 multiple-choice questions and you have to score 70% i.e. 42 marks out of 60 to pass the exam. You will have 120 minutes to complete the exam. There is no free re-attempts for both the versions. Apache Spark documentation API 3.0/2.4 will be provided while writing the exam in the form of PDF. It is important to know the functions and the corresponding packages to navigate around else you might encounter difficulties to find the relevant functions while writing the exam.

The exam is on theoretical knowledge, data frame API functions, and a couple of scenario-based questions. It would be great if you have some hands-on experience on the data frame API. The exams are available in Scala and Python languages and the format is the same for both.

Topic Details:

The following are the topics that I am highlighting below that are relevant for the exam. You will get some ideas on the topics which you should concentrate more on.

  1. Spark Architecture [Around 10 Questions] (A) . Should have basic knowledge on the architecture] (B). Understand the importance of Slot, thread, driver, executor, stage, node, memory, job, etc.
  2. Garbage Collection [1 Question]
  3. Coalesce and repartitions [2 or 3 Questions]
  4. Cache and Persist [3 or 4 Questions]
  5. Read and Write parquet/text/JSON file [4 Questions]
  6. Data frame Joins [2 or 3 Questions]
  7. Scenario-based [2 Questions] (A). Cluster configuration [Nodes, driver, memory, executors] would be provided to choose the best scenario-based answer out of it. (B). Understand worker node failure configurations from the different scenarios (C).Which one likely to result in the greatest number of shufflings, etc.]
  8. Transformation and Action [1 or 2 Questions]
  9. Lazy evaluation [1 or 2 Questions]
  10. Deployment Mode: Cluster/Client [1 or 2 Questions]
  11. Register UDF [2 Questions]
  12. Broadcast [2 or 3 Questions]. This is not asked in 2.4 version.
  13. Spark SQL [2 Questions] [createOrReplaceTempView or UDF on spark sql]
  14. Partition [3 to 4 Questions]
  15. Syntax related questions [15 to 20 Questions]

Spark Functions

There is n number of functions that are available in the data frame now. Going through all of them is tough and it takes a lot of time. For everyone’s convenience, I have put together here all of the functions which are being asked in the exam for you to understand and practice. Concentrate more on the syntax and the specific arguments on these functions. I have further categorized these below functions to get more understanding for your search.

You can expect one or more questions on each function from the below list

Actions: collect, count, first, head, show, take, toLocalIterator

Typed Transformations: coalesce, distinct, dropDuplicates, filter, limit, orderBy, repartition, sample, select, sort, union, unionAll, where, repartition,

Untyped Transformations: agg, apply, col, drop, groupBy, join, select, withColumn, withColumnRenamed, crossJoin, register, sql

Aggregate Function: approx_count_distinct, count, first, mean, variance, std_dev

Collection: explode

Column: asc, desc, cast

Date and Time Function: months, unix_timestamp, from_unixtime

Non Aggregate Function: broadcast, coalesce, col, lit

Sorting Function: asc, desc

String Function: split, regex_replace

UDF Function: udf

Dataframereader: text, parquet, load, textFile, json, option, format

DataFrame Na Functions: na.fill, na.drop

DataFrame Functions: printSchema, createOrReplaceTempView, cache, persist

Configuration: Understand the difference between these configurations . [spark.sql.shuffle.partitions, spark.default.parallelism, spark.sql.autoBroadcastJoinThreshold]

Should be well versed with the Syntax: select, filter, withColumn, withColumnRenamed

Sample Question:

Q: Which of the following code blocks returns a DataFrame with a new column aSquared and all previously existing columns from DataFrame df?

Options:

df.withColumnRenamed(“aSquared”, “aSquared”)
df.withColumn(col(“aSquared”), col(“aSquared”))
df.withColumnRenamed(“aSquared”, col(“aSquared”))
df.withColumn(“aSquared”, col(“aSquared”))
df.withColumn(col(“aSquared”),”aSquared”)

Following topics, you can exclude upfront.

  1. RDD
  2. Dataset
  3. Streaming

My Experience:

The one I had attended was the online proctored exam. Due to this pandemic, they have removed the requirement of having an external camera. My experience was overall good with the online exam. There was one interruption as my WIFI went offline in between, and hence I had to call the support and then had to restart since I have stopped the exam. Otherwise, everything went normal.

If you are not 100% sure of the answer, then you can mark it as “Review later”. I have got some time to review the questions at the end.

The options you get are very identical with the other ones and identifying the correct one is the challenge especially on the syntax related and the architecture questions. Be thorough on the syntax and the basic architecture.

Exam Tips:

1) Understand and practice all data frame functions [Aggregate, collection, date and Time, Nonaggregate, sorting, String, UDF functions, etc..]. and the actions and transformations listed above. You should know the syntax well to recognize the difference between the multi-choice options.

2) Do not miss any chapters specified above on the Spark Definitive Guide.

Databricks may change this pattern very frequently as we have seen this before. Please check their website for the latest updates before you appear for the exam.

About Me:

Data Enthusiast with strong attention to detail, who specializes in applying analytical techniques for building scalable and efficient big data pipelines; create data insights that helps business to achieve their goals.

Feel free to connect with me on LinkedIn for any further questions. Happy to help!

Good luck!

--

--

Anoop Malayamkumarath

Data Enthusiast with strong attention to detail, who specializes in applying analytical techniques for building scalable and efficient big data pipelines..