How to pass GCP Professional Data Engineer exam in 2 months

Without any cloud experience

9 min readMar 7, 2020

I passed my GCP professional data engineer exam mid-December 2019. A few people have reached out to me to ask for advice so I’m going to share my experience here and hope that this can help you nail the exam! Good luck :)

Before diving into the exam prep…

First, I’m going to briefly touch on my background, profession, and motivation for taking the GCP data engineer exam. Currently working at Deloitte in the enterprise technology consulting space, I have an academic background in data science. Prior to studying for this exam, I had no experience in any cloud platforms. I had some idea of what data engineering is and did a distributed databases unit at uni which introduced me to implementation of data pipeline in Apache Kafka, MongoDB, PySpark, and streaming visualisation with Python.

My manager at work suggested that I looked into GCP data engineer exam. I thought it sounds interesting and I love learning new things so why not!

What I didn’t know at that time was…this journey helped me find my passion in data engineering! Since then, I’ve been going to various meetups in Melbourne and upskilling myself but hey that’s for a different story :)

Let’s have a look at how to prepare for it

Goal setting

“Do I need the certification?”

“Why am I taking this exam?”

“What knowledge/experience I have that can be transferred across?”

“When am I taking this exam? How much time am I willing to commit to study?”

“What does GCP Professional Data Engineer exam cover?”

First thing first, these important questions because they help you plan your study schedule and set your objectives. As you see in the official exam guide , there are lot of things to learn — batch processing, streaming, Dataproc, Apache Beam, BigQuery, ML/AI,…., goal setting allows you to plan and prioritise.

For example, my goals were

Have a solid understanding of the concepts and design of each product
Apply my learning to design data pipelines to meet business requirements (performance, cost, availability, latency, scalability, security, integration, etc)

As I had experience in data science, I only needed to spend time learning the products and could allocate the extra time to other topics.

Recommended resources

I used a range of different resources — online courses, official documentation, blogs, and Medium articles. As I’m intended to work as a data engineer professionally, I also looked for real-world use cases to understand how implementation of data pipeline solves business problems or create value.

Out of all the resources, Linux Academy and Google documentation helped me the most. I’d say Google documentation was my favourite because Google explained the design, concepts, and best practices of each product clearly and in the documentation (thanks to my friend Hamza from Servian who suggested this resource!).

Udemy course (paid < $30, prices depend on the discounts) — This is recommended by a friend (thanks Hamza!)
Linux Academy (paid) — They have 1-week free trial!
Google Official Documentation

Highly recommend it! It was like a massive playground which I let my curiosity take me to whatever I didn’t understand or found interesting. This helped me gain in-depth knowledge about each product.

4. Medium

There are some awesome articles about how to pass the exam here on Medium. I recommend using them as a quick overview. As products and features have changed overtime, official documentation is your best friend for the latest information.

5. Search for products on Google — Learn from real world use cases and practical tips.

6. A good blog post by Dmitri Lerko

7. Google Cloud Blog

Good for getting the latest information and understanding how different organisations implement GCP.

For example, Why and how Spotify migrated from Kafka to Cloud PubSub https://cloud.google.com/blog/products/gcp/spotifys-journey-to-cloud-why-spotify-migrated-its-event-delivery-system-from-kafka-to-google-cloud-pubsub

8. Github — Search for the product name — Dataflow, BigTable, etc. to see the implementation and codes

9. Google ML Crash Course — A refresher on ML concepts such as overfitting, variance, bias, etc.

You might notice that I didn’t use Coursera’s free course. I did go through all 4 courses but felt that the content didn’t suit my learning style as it was more conceptual/theoretical. But hey give it a go! It might work really well for you.

My approach

Apart from following the course syllabus on Linux Academy and Udemy, I created my own strategy in learning about data engineering on GCP.

Understand the Google ecosystem
Understand each product by reading through documentation

Purpose & Features — What each product is (not) designed for?
Architecture — How does it work?
Best Practices — How to achieve optimal outcomes?
Potential Issues — Why? How to identify and overcome?

3. Compare and contrast — Why/when would you use one over the other?

Cloud Storage vs. BigQuery Storage
Batch processing vs. Streaming processing
AutoML vs. ML API
Etc…

4. Hands-on practice

Cloud Shell commands
Build a real-time PubSub streaming pipeline for on-street parking in City of Melbourne — I picked on-street parking because it was open sourced and easy to access, you can use anything or even simulate your own streaming data!
Write/execute Python codes for Dataproc
Qwiklabs — I only did labs when I needed to see tangible results or the flow of execution.

Topics to study

Now we’re moving on to the specific topics — storage, processing, machine learning, security, monitoring, real-time messaging service, workflow management, and others.

The key takeaway here is: the exam is testing you on your ability to design a data pipeline based on business requirements — one size doesn’t fit all. For example, BigTable is a highly scalable storage solution. But if you’re asked to design a pipeline to support transactional data and latency isn’t a concern, is this the best option? What might be the best option and why? Hence, it’s important to know when to use what and why.

Storage

BigQuery — more than 20% of the questions were about BigQuery (directly or indirectly)

Designed for appending data
Backup methods
BigQuery snapshot decorator
Advantages of Avro file format
Compatible data types — no PDF!
Storage can be as cheap as GS
How to validate whether data in BigQuery is the same as the source after migration?
Update and merge using DDL
What to do when exceed maximum concurrent slots per project for on-demand pricing?
How to batch load data for it to be available for analysis within 1 minute of load time?

2. Cloud Storage

When to use which storage option: standard, nearline, coldline?
Staging area for data analytics
Store all types of data
How to continuously sync data between local and CS? (What commands to use?)

3. Datastore

Highly scalable NoSQL database which is good for mobile and web applications
What Datastore is good for and what’s not? It’s for structured data but not ideal for OLAP.
Terminology — Kind, entity, property, key
The newest version is Firestore — all queries strongly consistent
How to schedule periodical backup?

4. BigTable

Designed for sparse data
Metrics to determine when to scale BigTable in a case of perfect row key
Magnesium to retrieve data quickly — each row is indexed
Optimise row key design
Potential reasons for suboptimal performance and fixes — hotspotting, large amount of data in a cell, etc

5. Cloud SQL

Lift and shift on-prem relational databases
Read replica

6. Cloud Spanner

Scalability and strong ACID

Processing

7. Dataproc

How to prevent data loss from termination of clusters when using preemptive workers?
How to optimise performance of custom built TensorFlow models while keeping the cost low
Scaling clusters
Use GS instead of HDFS. Why?
How to schedule Dataproc jobs?
Migrate on-prem Spark jobs to Dataproc
IAM — dataproc.worker
No support for TPU
When to use high availability mode?

8. Dataflow

Run Apache Beam program
How to handle and process invalid inputs?
Side inputs — lookup table from BigQuery
Understand PCollections, Transforms
Pipeline I/O available for BigQuery, Datastore, etc.

9. Dataprep

How to schedule and execute a recipe as a daily job? What options are there?
Benefits of Dataprep

Monitoring

10. StackDriver

Difference between Monitoring and Logging
How to install StackDriver Monitoring agent & Logging agent for products? — e.g. install MariaDB as MySQL on VM and install agents accordingly

Workflow Management

11. Cloud Composer

Role of cloud composer — managed Apache AirFlow

Real-time Messaging Service

12. Cloud PubSub

Basic building blocks — topic, subscriber, publisher, at least once delivery, acknowledgement of a message, no order
Migrate from Kafka to PubSub
What are pull and push? When to use which?
Kafka to PubSub — use PubSub as sink or source
Metrics to monitor PubSub to Dataproc pipeline
Ways to order messages and why it doesn’t order messages in the first place
Potential reasons for PubSub ingestion to be suboptimal
Stream processing

Security

13. IAM

Understand primitive role, predefined role, service account, policies
How to set access for teams to access datasets in BigQuery and different projects?
Best practice — reflect organisation hierarchy
Billing accounts

Machine Learning

14. ML

Concept, causes, and solutions for overfitting and underfitting
When to use AutoML or APIs — what’s the benefits of each?
Speech-to-Text API
Cloud Natural Language API
Cloud Data Loss Prevention API
Can train model locally before production
ML pipeline — storage model in Cloud Storage (need to consider permission to access the bucket), set the model path to this object
What to do when the business needs quick ML results with limited expertise?

Other topics

15. Data Migration

What to use when on-prem network doesn’t allow outside IP?
Secure way to transfer data
Understand Dedicated Interconnect and when will one benefit from it

16. Data Pipelines — Potential Use Case

Near real-time inventory dashboard. For example, if we want to ingest data from POS to storage, perform aggregation, store aggregation in main table, archive transactions in historical table, accuracy and latency are important, what are the options?
Low latency ML prediction. For example, if we want to ingest data produced by customers, perform prediction, feed prediction to customer using unique customer ID in 100 ms, what are the options?
Long run time, poor performance after migrating custom TensorFlow models to Dataproc. Need to improve performance but keep the cost as low as possible, what to do?

17. Other solutions apart from GCP

Exam, the exam

2 hours, 50 questions- it was tough but fun! The distribution of the questions were roughly

75% complex — what’s the best design for this business use case?
25% straightforward—how to achieve a task? Technical specifications (actual commands or steps)

About 10-15% of the questions had commands such as cloud shell commands, BigQuery, etc. If you have hands-on experience with GCP, you probably don’t need to spend time on commands. But if you don’t have experience (like me), this is something worth looking into after you’ve gone through design and concepts.

In terms of the design type of questions in the exam (~75% of all), the goal is to choose the best solution given specific requirements surrounding:

Availability
Durability
Read/write latency
Data retrieval
Type of pipelines
Scalability
Performance
Level of access
3Vs (volume, velocity, variety) of data
Cost
Fixes for issues
Timeframe
Limitations of resources, expertise, etc

The actual questions usually doesn’t contain these words but in your mind you need to make the connection as you read the description.

Also, don’t get stuck! You can mark the questions for review and come back to them later. My suggestion is to get your confidence up by answering a few simple questions first even if that means you have to skip ahead. This way you won’t go into panic mode early on and will find it easier to think clearly. Remember, your aim is to pass, not to get every question right!

One thing, you do get a pass or a fail at the end of the exam. It’s SUPER nerve-wrecking but at the same time liberating. There was a lag after I submitted my exam and I assumed I didn’t pass. After what felt like the longest 5 seconds of my life, I finally saw a line of tiny text “Pass” on my screen!

It took about 3 days for me to receive the certification as Google had to verify the exam results first. So don’t worry if you don’t receive the certification right away!

Final notes

I hope my experience helps you prepare for the exam! If you have questions, feel free to reach out to my on LinkedIn. I’m happy to help in every way I could :)

Good luck!