How to pass GCP Professional Data Engineer exam in 2 months

Without any cloud experience

Ting Hsu
9 min readMar 7, 2020

I passed my GCP professional data engineer exam mid-December 2019. A few people have reached out to me to ask for advice so I’m going to share my experience here and hope that this can help you nail the exam! Good luck :)

Before diving into the exam prep…

First, I’m going to briefly touch on my background, profession, and motivation for taking the GCP data engineer exam. Currently working at Deloitte in the enterprise technology consulting space, I have an academic background in data science. Prior to studying for this exam, I had no experience in any cloud platforms. I had some idea of what data engineering is and did a distributed databases unit at uni which introduced me to implementation of data pipeline in Apache Kafka, MongoDB, PySpark, and streaming visualisation with Python.

My manager at work suggested that I looked into GCP data engineer exam. I thought it sounds interesting and I love learning new things so why not!

What I didn’t know at that time was…this journey helped me find my passion in data engineering! Since then, I’ve been going to various meetups in Melbourne and upskilling myself but hey that’s for a different story :)

Let’s have a look at how to prepare for it

Goal setting

Do I need the certification?”

Why am I taking this exam?”

What knowledge/experience I have that can be transferred across?”

When am I taking this exam? How much time am I willing to commit to study?”

What does GCP Professional Data Engineer exam cover?”

First thing first, these important questions because they help you plan your study schedule and set your objectives. As you see in the official exam guide , there are lot of things to learn — batch processing, streaming, Dataproc, Apache Beam, BigQuery, ML/AI,…., goal setting allows you to plan and prioritise.

Setting goals!

For example, my goals were

  1. Have a solid understanding of the concepts and design of each product
  2. Apply my learning to design data pipelines to meet business requirements (performance, cost, availability, latency, scalability, security, integration, etc)

As I had experience in data science, I only needed to spend time learning the products and could allocate the extra time to other topics.

Recommended resources

I used a range of different resources — online courses, official documentation, blogs, and Medium articles. As I’m intended to work as a data engineer professionally, I also looked for real-world use cases to understand how implementation of data pipeline solves business problems or create value.

Out of all the resources, Linux Academy and Google documentation helped me the most. I’d say Google documentation was my favourite because Google explained the design, concepts, and best practices of each product clearly and in the documentation (thanks to my friend Hamza from Servian who suggested this resource!).

  1. Udemy course (paid < $30, prices depend on the discounts) — This is recommended by a friend (thanks Hamza!)
  2. Linux Academy (paid) — They have 1-week free trial!
  3. Google Official Documentation

Highly recommend it! It was like a massive playground which I let my curiosity take me to whatever I didn’t understand or found interesting. This helped me gain in-depth knowledge about each product.

4. Medium

There are some awesome articles about how to pass the exam here on Medium. I recommend using them as a quick overview. As products and features have changed overtime, official documentation is your best friend for the latest information.

5. Search for products on Google — Learn from real world use cases and practical tips.

6. A good blog post by Dmitri Lerko

7. Google Cloud Blog

Good for getting the latest information and understanding how different organisations implement GCP.

8. Github — Search for the product name — Dataflow, BigTable, etc. to see the implementation and codes

9. Google ML Crash Course — A refresher on ML concepts such as overfitting, variance, bias, etc.

You might notice that I didn’t use Coursera’s free course. I did go through all 4 courses but felt that the content didn’t suit my learning style as it was more conceptual/theoretical. But hey give it a go! It might work really well for you.

My approach

Apart from following the course syllabus on Linux Academy and Udemy, I created my own strategy in learning about data engineering on GCP.

  1. Understand the Google ecosystem
  2. Understand each product by reading through documentation
  • Purpose & Features — What each product is (not) designed for?
  • Architecture — How does it work?
  • Best Practices — How to achieve optimal outcomes?
  • Potential Issues — Why? How to identify and overcome?

3. Compare and contrast — Why/when would you use one over the other?

  • Cloud Storage vs. BigQuery Storage
  • Batch processing vs. Streaming processing
  • AutoML vs. ML API
  • Etc…

4. Hands-on practice

  • Cloud Shell commands
  • Build a real-time PubSub streaming pipeline for on-street parking in City of Melbourne — I picked on-street parking because it was open sourced and easy to access, you can use anything or even simulate your own streaming data!
  • Write/execute Python codes for Dataproc
  • Qwiklabs — I only did labs when I needed to see tangible results or the flow of execution.

Topics to study

Now we’re moving on to the specific topics — storage, processing, machine learning, security, monitoring, real-time messaging service, workflow management, and others.

The key takeaway here is: the exam is testing you on your ability to design a data pipeline based on business requirements — one size doesn’t fit all. For example, BigTable is a highly scalable storage solution. But if you’re asked to design a pipeline to support transactional data and latency isn’t a concern, is this the best option? What might be the best option and why? Hence, it’s important to know when to use what and why.

Storage

  1. BigQuery — more than 20% of the questions were about BigQuery (directly or indirectly)

2. Cloud Storage

  • When to use which storage option: standard, nearline, coldline?
  • Staging area for data analytics
  • Store all types of data
  • How to continuously sync data between local and CS? (What commands to use?)

3. Datastore

4. BigTable

5. Cloud SQL

  • Lift and shift on-prem relational databases
  • Read replica

6. Cloud Spanner

  • Scalability and strong ACID

Processing

7. Dataproc

8. Dataflow

  • Run Apache Beam program
  • How to handle and process invalid inputs?
  • Side inputs — lookup table from BigQuery
  • Understand PCollections, Transforms
  • Pipeline I/O available for BigQuery, Datastore, etc.

9. Dataprep

  • How to schedule and execute a recipe as a daily job? What options are there?
  • Benefits of Dataprep

Monitoring

10. StackDriver

Workflow Management

11. Cloud Composer

Real-time Messaging Service

12. Cloud PubSub

Security

13. IAM

Machine Learning

14. ML

  • Concept, causes, and solutions for overfitting and underfitting
  • When to use AutoML or APIs — what’s the benefits of each?
  • Speech-to-Text API
  • Cloud Natural Language API
  • Cloud Data Loss Prevention API
  • Can train model locally before production
  • ML pipeline — storage model in Cloud Storage (need to consider permission to access the bucket), set the model path to this object
  • What to do when the business needs quick ML results with limited expertise?

Other topics

15. Data Migration

16. Data Pipelines — Potential Use Case

  • Near real-time inventory dashboard. For example, if we want to ingest data from POS to storage, perform aggregation, store aggregation in main table, archive transactions in historical table, accuracy and latency are important, what are the options?
  • Low latency ML prediction. For example, if we want to ingest data produced by customers, perform prediction, feed prediction to customer using unique customer ID in 100 ms, what are the options?
  • Long run time, poor performance after migrating custom TensorFlow models to Dataproc. Need to improve performance but keep the cost as low as possible, what to do?

17. Other solutions apart from GCP

Exam, the exam

2 hours, 50 questions- it was tough but fun! The distribution of the questions were roughly

  • 75% complex — what’s the best design for this business use case?
  • 25% straightforward—how to achieve a task? Technical specifications (actual commands or steps)

About 10-15% of the questions had commands such as cloud shell commands, BigQuery, etc. If you have hands-on experience with GCP, you probably don’t need to spend time on commands. But if you don’t have experience (like me), this is something worth looking into after you’ve gone through design and concepts.

In terms of the design type of questions in the exam (~75% of all), the goal is to choose the best solution given specific requirements surrounding:

  • Availability
  • Durability
  • Read/write latency
  • Data retrieval
  • Type of pipelines
  • Scalability
  • Performance
  • Level of access
  • 3Vs (volume, velocity, variety) of data
  • Cost
  • Fixes for issues
  • Timeframe
  • Limitations of resources, expertise, etc

The actual questions usually doesn’t contain these words but in your mind you need to make the connection as you read the description.

Also, don’t get stuck! You can mark the questions for review and come back to them later. My suggestion is to get your confidence up by answering a few simple questions first even if that means you have to skip ahead. This way you won’t go into panic mode early on and will find it easier to think clearly. Remember, your aim is to pass, not to get every question right!

One thing, you do get a pass or a fail at the end of the exam. It’s SUPER nerve-wrecking but at the same time liberating. There was a lag after I submitted my exam and I assumed I didn’t pass. After what felt like the longest 5 seconds of my life, I finally saw a line of tiny text “Pass” on my screen!

It took about 3 days for me to receive the certification as Google had to verify the exam results first. So don’t worry if you don’t receive the certification right away!

Final notes

I hope my experience helps you prepare for the exam! If you have questions, feel free to reach out to my on LinkedIn. I’m happy to help in every way I could :)

Good luck!

--

--