How to pass GCP Professional Data Engineer exam in 2 months
I passed my GCP professional data engineer exam mid-December 2019. A few people have reached out to me to ask for advice so I’m going to share my experience here and hope that this can help you nail the exam! Good luck :)
Before diving into the exam prep…
First, I’m going to briefly touch on my background, profession, and motivation for taking the GCP data engineer exam. Currently working at Deloitte in the enterprise technology consulting space, I have an academic background in data science. Prior to studying for this exam, I had no experience in any cloud platforms. I had some idea of what data engineering is and did a distributed databases unit at uni which introduced me to implementation of data pipeline in Apache Kafka, MongoDB, PySpark, and streaming visualisation with Python.
My manager at work suggested that I looked into GCP data engineer exam. I thought it sounds interesting and I love learning new things so why not!
What I didn’t know at that time was…this journey helped me find my passion in data engineering! Since then, I’ve been going to various meetups in Melbourne and upskilling myself but hey that’s for a different story :)
Let’s have a look at how to prepare for it
Goal setting
“Do I need the certification?”
“Why am I taking this exam?”
“What knowledge/experience I have that can be transferred across?”
“When am I taking this exam? How much time am I willing to commit to study?”
“What does GCP Professional Data Engineer exam cover?”
First thing first, these important questions because they help you plan your study schedule and set your objectives. As you see in the official exam guide , there are lot of things to learn — batch processing, streaming, Dataproc, Apache Beam, BigQuery, ML/AI,…., goal setting allows you to plan and prioritise.
For example, my goals were
- Have a solid understanding of the concepts and design of each product
- Apply my learning to design data pipelines to meet business requirements (performance, cost, availability, latency, scalability, security, integration, etc)
As I had experience in data science, I only needed to spend time learning the products and could allocate the extra time to other topics.
Recommended resources
I used a range of different resources — online courses, official documentation, blogs, and Medium articles. As I’m intended to work as a data engineer professionally, I also looked for real-world use cases to understand how implementation of data pipeline solves business problems or create value.
Out of all the resources, Linux Academy and Google documentation helped me the most. I’d say Google documentation was my favourite because Google explained the design, concepts, and best practices of each product clearly and in the documentation (thanks to my friend Hamza from Servian who suggested this resource!).
- Udemy course (paid < $30, prices depend on the discounts) — This is recommended by a friend (thanks Hamza!)
- Linux Academy (paid) — They have 1-week free trial!
- Google Official Documentation
Highly recommend it! It was like a massive playground which I let my curiosity take me to whatever I didn’t understand or found interesting. This helped me gain in-depth knowledge about each product.
4. Medium
There are some awesome articles about how to pass the exam here on Medium. I recommend using them as a quick overview. As products and features have changed overtime, official documentation is your best friend for the latest information.
- https://medium.com/nooblearning/2019-google-cloud-professional-data-engineer-certification-exam-6a5d6581e507
- https://medium.com/@sathishvj/notes-from-my-google-cloud-professional-data-engineer-exam-530d11966aa0
- https://medium.com/weareservian/google-cloud-data-engineer-exam-study-guide-9afc80be2ee3
5. Search for products on Google — Learn from real world use cases and practical tips.
6. A good blog post by Dmitri Lerko
7. Google Cloud Blog
Good for getting the latest information and understanding how different organisations implement GCP.
- For example, Why and how Spotify migrated from Kafka to Cloud PubSub https://cloud.google.com/blog/products/gcp/spotifys-journey-to-cloud-why-spotify-migrated-its-event-delivery-system-from-kafka-to-google-cloud-pubsub
8. Github — Search for the product name — Dataflow, BigTable, etc. to see the implementation and codes
9. Google ML Crash Course — A refresher on ML concepts such as overfitting, variance, bias, etc.
You might notice that I didn’t use Coursera’s free course. I did go through all 4 courses but felt that the content didn’t suit my learning style as it was more conceptual/theoretical. But hey give it a go! It might work really well for you.
My approach
Apart from following the course syllabus on Linux Academy and Udemy, I created my own strategy in learning about data engineering on GCP.
- Understand the Google ecosystem
- Understand each product by reading through documentation
- Purpose & Features — What each product is (not) designed for?
- Architecture — How does it work?
- Best Practices — How to achieve optimal outcomes?
- Potential Issues — Why? How to identify and overcome?
3. Compare and contrast — Why/when would you use one over the other?
- Cloud Storage vs. BigQuery Storage
- Batch processing vs. Streaming processing
- AutoML vs. ML API
- Etc…
4. Hands-on practice
- Cloud Shell commands
- Build a real-time PubSub streaming pipeline for on-street parking in City of Melbourne — I picked on-street parking because it was open sourced and easy to access, you can use anything or even simulate your own streaming data!
- Write/execute Python codes for Dataproc
- Qwiklabs — I only did labs when I needed to see tangible results or the flow of execution.
Topics to study
Now we’re moving on to the specific topics — storage, processing, machine learning, security, monitoring, real-time messaging service, workflow management, and others.
The key takeaway here is: the exam is testing you on your ability to design a data pipeline based on business requirements — one size doesn’t fit all. For example, BigTable is a highly scalable storage solution. But if you’re asked to design a pipeline to support transactional data and latency isn’t a concern, is this the best option? What might be the best option and why? Hence, it’s important to know when to use what and why.
Storage
- BigQuery — more than 20% of the questions were about BigQuery (directly or indirectly)
- Designed for appending data
- Backup methods
- BigQuery snapshot decorator
- Advantages of Avro file format
- Compatible data types — no PDF!
- Storage can be as cheap as GS
- How to validate whether data in BigQuery is the same as the source after migration?
- Update and merge using DDL
- What to do when exceed maximum concurrent slots per project for on-demand pricing?
- How to batch load data for it to be available for analysis within 1 minute of load time?
- When to use which storage option: standard, nearline, coldline?
- Staging area for data analytics
- Store all types of data
- How to continuously sync data between local and CS? (What commands to use?)
3. Datastore
- Highly scalable NoSQL database which is good for mobile and web applications
- What Datastore is good for and what’s not? It’s for structured data but not ideal for OLAP.
- Terminology — Kind, entity, property, key
- The newest version is Firestore — all queries strongly consistent
- How to schedule periodical backup?
4. BigTable
- Designed for sparse data
- Metrics to determine when to scale BigTable in a case of perfect row key
- Magnesium to retrieve data quickly — each row is indexed
- Optimise row key design
- Potential reasons for suboptimal performance and fixes — hotspotting, large amount of data in a cell, etc
5. Cloud SQL
- Lift and shift on-prem relational databases
- Read replica
- Scalability and strong ACID
Processing
7. Dataproc
- How to prevent data loss from termination of clusters when using preemptive workers?
- How to optimise performance of custom built TensorFlow models while keeping the cost low
- Scaling clusters
- Use GS instead of HDFS. Why?
- How to schedule Dataproc jobs?
- Migrate on-prem Spark jobs to Dataproc
- IAM — dataproc.worker
- No support for TPU
- When to use high availability mode?
8. Dataflow
- Run Apache Beam program
- How to handle and process invalid inputs?
- Side inputs — lookup table from BigQuery
- Understand PCollections, Transforms
- Pipeline I/O available for BigQuery, Datastore, etc.
9. Dataprep
- How to schedule and execute a recipe as a daily job? What options are there?
- Benefits of Dataprep
Monitoring
10. StackDriver
- Difference between Monitoring and Logging
- How to install StackDriver Monitoring agent & Logging agent for products? — e.g. install MariaDB as MySQL on VM and install agents accordingly
Real-time Messaging Service
12. Cloud PubSub
- Basic building blocks — topic, subscriber, publisher, at least once delivery, acknowledgement of a message, no order
- Migrate from Kafka to PubSub
- What are pull and push? When to use which?
- Kafka to PubSub — use PubSub as sink or source
- Metrics to monitor PubSub to Dataproc pipeline
- Ways to order messages and why it doesn’t order messages in the first place
- Potential reasons for PubSub ingestion to be suboptimal
- Stream processing
Security
13. IAM
- Understand primitive role, predefined role, service account, policies
- How to set access for teams to access datasets in BigQuery and different projects?
- Best practice — reflect organisation hierarchy
- Billing accounts
Machine Learning
14. ML
- Concept, causes, and solutions for overfitting and underfitting
- When to use AutoML or APIs — what’s the benefits of each?
- Speech-to-Text API
- Cloud Natural Language API
- Cloud Data Loss Prevention API
- Can train model locally before production
- ML pipeline — storage model in Cloud Storage (need to consider permission to access the bucket), set the model path to this object
- What to do when the business needs quick ML results with limited expertise?
Other topics
15. Data Migration
- What to use when on-prem network doesn’t allow outside IP?
- Secure way to transfer data
- Understand Dedicated Interconnect and when will one benefit from it
16. Data Pipelines — Potential Use Case
- Near real-time inventory dashboard. For example, if we want to ingest data from POS to storage, perform aggregation, store aggregation in main table, archive transactions in historical table, accuracy and latency are important, what are the options?
- Low latency ML prediction. For example, if we want to ingest data produced by customers, perform prediction, feed prediction to customer using unique customer ID in 100 ms, what are the options?
- Long run time, poor performance after migrating custom TensorFlow models to Dataproc. Need to improve performance but keep the cost as low as possible, what to do?
17. Other solutions apart from GCP
Exam, the exam
2 hours, 50 questions- it was tough but fun! The distribution of the questions were roughly
- 75% complex — what’s the best design for this business use case?
- 25% straightforward—how to achieve a task? Technical specifications (actual commands or steps)
About 10-15% of the questions had commands such as cloud shell commands, BigQuery, etc. If you have hands-on experience with GCP, you probably don’t need to spend time on commands. But if you don’t have experience (like me), this is something worth looking into after you’ve gone through design and concepts.
In terms of the design type of questions in the exam (~75% of all), the goal is to choose the best solution given specific requirements surrounding:
- Availability
- Durability
- Read/write latency
- Data retrieval
- Type of pipelines
- Scalability
- Performance
- Level of access
- 3Vs (volume, velocity, variety) of data
- Cost
- Fixes for issues
- Timeframe
- Limitations of resources, expertise, etc
The actual questions usually doesn’t contain these words but in your mind you need to make the connection as you read the description.
Also, don’t get stuck! You can mark the questions for review and come back to them later. My suggestion is to get your confidence up by answering a few simple questions first even if that means you have to skip ahead. This way you won’t go into panic mode early on and will find it easier to think clearly. Remember, your aim is to pass, not to get every question right!
One thing, you do get a pass or a fail at the end of the exam. It’s SUPER nerve-wrecking but at the same time liberating. There was a lag after I submitted my exam and I assumed I didn’t pass. After what felt like the longest 5 seconds of my life, I finally saw a line of tiny text “Pass” on my screen!
It took about 3 days for me to receive the certification as Google had to verify the exam results first. So don’t worry if you don’t receive the certification right away!
Final notes
I hope my experience helps you prepare for the exam! If you have questions, feel free to reach out to my on LinkedIn. I’m happy to help in every way I could :)
Good luck!