How I Prepared to Take the Google Cloud Professional Data Engineer Certification Exam and Tips.
Check out Here — Professional Data Engineer Playlist — Youtube
I wrote the GCP Cloud Professional Data Engineer exam and passed it. Yaay! Here are my immediate impressions and notes. Hope it is useful to future test takers.
Result
Soon after the exam, I got the result as PASS. After a few days, I got the certificate from Google Cloud.
Here’s my certificate:
How I studied for the test
The exam was very Tough. I assumed this one would be easier because I spent more time preparing and I had the experience of the previous certifications. After the exam I went over the questions again to remind myself later what areas were covered — the answer is, everything. Zero direct questions. Every question was embedded in a situation/use case.
- BigQuery Data Transfer Service. I knew of storage transfer services and big query connectors, but I went ‘huh?’ when I saw this.
https://cloud.google.com/bigquery/transfer/ (Edit: at the time that I wrote the exam, this was new for me. Now, its capabilities have also expanded.) - IAM + Dataflow. Dataflow developer mode.
https://cloud.google.com/dataflow/docs/concepts/access-control - IAM + BigQuery. Access level. As Nikhil pointed out in the comments, you can now restrict access at the Table level.
https://cloud.google.com/blog/products/data-analytics/introducing-table-level-access-controls-in-bigquery - BigQuery: partitioning tables. Based on what are they partitioned — ingestion time, timestamp, date. How are they named? How are they then accessed in queries? Using _PARTITIONTIME.
https://cloud.google.com/bigquery/docs/partitioned-tables - BigQuery. The syntax for wildcards in big query names. And in legacy SQL?
https://cloud.google.com/bigquery/docs/querying-wildcard-tables - BigQuery: table date range for bq. Accessing tables with dates and partitioned tables with functions like TABLE_DATE_RANGE, _TABLE_SUFFIX, TABLE_QUERY.
https://stackoverflow.com/questions/22641894/bigquery-wildcard-using-table-date-range - Cloud Spanner: secondary index for cloud spanner. How indexes are created for you and how you can create secondary indexes.
https://cloud.google.com/spanner/docs/secondary-indexes - Datastore: multiple indexes for datastore. Default indexes. The syntax for creating custom, composite indexes.
https://cloud.google.com/datastore/docs/concepts/indexes - BigTable: row key scheme. What are the recommended ways for creating the row key? How do you avoid hot-spotting? Should you use a timestamp, and where?
https://cloud.google.com/bigtable/docs/schema-design - BigTable: ways to optimize.
https://cloud.google.com/bigtable/docs/performance - PubSub, Dataflow, Dataproc — properties and uses of these products. The courses from Coursera, Linux Academy, and Cloud Academy cover these well.
- Dataproc: usage of GCS instead of the existing file system. It is a best practice to use Google Cloud Storage instead of using HDFS. You can destroy the compute nodes after data crunching and save cost on them.
- BigQuery+DataStudio — caching/pre-fetch cache. Learn how you connect DataStudio to storage solutions. Learn the difference between default caching (which cannot be disabled) and pre-fetch caching (which can be disabled). What is the difference between doing that with Viewer credentials and Owner credentials?
https://support.google.com/datastudio/answer/7020039?hl=en - Dataprep: jobs. How are Dataprep jobs created and run? What permissions do you need? A term I saw was that this is a more ‘casual’ way of data cleaning. Dataproc/Dataflow would be more programmatic and therefore ‘intense’, I suppose.
https://cloud.google.com/dataprep/docs/html/Jobs-Page_57344842 - DataStudio: visualisation. What are the causes of stale data? And how do you get the latest? What caching options do you need to set?
- Machine Learning: feature crosses. Learn what these are and what issues it solves.
https://developers.google.com/machine-learning/crash-course/feature-crosses/video-lecture - Machine Learning. Go through the Coursera course on machine learning.
https://www.coursera.org/learn/serverless-machine-learning-gcp/home/welcome - Machine Learning: Dealing with overfitting.
https://developers.google.com/machine-learning/crash-course/generalization/peril-of-overfitting - Machine Learning: Regularization. What does it mean to increase or decrease regularization?
https://www.coursera.org/lecture/deep-neural-network/why-regularization-reduces-overfitting-T6OJj - Dataproc: how to control scaling? Configure autoscaling?
https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/autoscaling - Avro file format. This is a compressed format that big query/dataflow can work with it directly.
https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro - Know a bit about other technologies outside of just GCP also. Remember that as a Professional on GCP, you are also expected to know technologies in the general ecosystem. You might have to decide between GCP solutions and alternatives in the market. Just by-hearting GCP won’t cut it.
- Key Management Service. Using KMS with non-GCP products. Note that there is a default key management where Google manages all the keys, then there is a customer-managed encryption key, and also a customer-supplied encryption key.
https://cloud.google.com/kms/docs/ - BigQuery query plan. BigQuery allows you to see the query plan and execution profile for queries that you run. Know the phases, the difference between average and max time, why there can be a skew in the plan, and how to optimize for it.
https://cloud.google.com/bigquery/query-plan-explanation - BigQuery + GCS. Know how to link tables between GCS and BigQuery as permanent tables and temporary tables.
https://cloud.google.com/bigquery/external-data-cloud-storage - You don’t have to by-heart the case studies but study them well. Work through solutions for it by yourself. The Linux Academy course has a module that goes over the case studies. (I believe the updated exam has no more case studies.)
- Bigquery. Know what a federated table is. While you are at it, learn also about clustered tables.
https://cloud.google.com/bigquery/external-data-sources
The Data Engineer exam was refreshed on March 29th. These are some extracted key points and links that others have posted. From what I am reading of others’ notes
Notes
- Cloud Composer: added in new topics.
- No case studies in the new exam.
- BigQuery: streaming data, quotas, and limits, ETL data verification, BigQuery ML, User Defined Functions.
- Datastore: backup and migrate.
- ML: I heard there is a little more ML. Scaling TensorFlow.
- PubSub: migrating from Kafka, debugging via Stackdriver.
Preparation Strategy
I had been preparing for all the data-related exams together for around 2 months. So I made a parallel approach to learning for the certifications.
- Visit the Google Cloud Professional Data Engineer page on Google. This will give you an overview of the exam and what is required to complete the certification.
- Understand what the exam is all about. Visit the exam guide to see the contents in detail. You will get a solid understanding of the content you will need to prepare for the exam.
- Visit the official Google Cloud Platform documentation to clearly understand each of the GCP services. These would make your GCP knowledge but you still have another essential factor to cover: “Hands-on experience”.
Exam Experience
Since I don’t have a good working experience in data, I only had a brief understanding of the services and how it works at a high level. Maybe due to that reason, the exam for me was very tough. I thought I was going to fail as I found it difficult to comprehend the questions and the options to match them together and find the correct answer.
I had to re-read the questions and the options before answering and flagged the questions which I didn’t know. After I completed the first cycle, I came back to the flagged questions and spent time answering those questions. After answering all the questions, I reviewed the questions again and changed some of the answers I chose earlier, due to confusion. My heart was beating high the whole time as I was nearing submission. It was only after I saw the PASS result that I was able to breathe in relief.
Resources
Sathish VJ’s AwesomeGCP Certification Repo
Cloud Skills Boost Learning Path
Google Cloud Practice Questions
GCP Data Engineer Coursera Course
Google Cloud video on Machine Learning
Official Study Guide by Dan Sullivan
Practice, practice, and practice…..
Notes from each of my exams
For those appearing for the various certification exams, here is a list of sanitized notes (no direct questions, only general topics) about the exam.
Notes from the Associate Cloud Engineer exam
Official Links
Main Link — https://cloud.google.com/certification/data-engineer
Topics Outline — https://cloud.google.com/certification/guides/data-engineer/
Practice Exam — https://cloud.google.com/certification/practice-exam/data-engineer
Stay tuned till the next blog
If you Want to Connect with Me:
Linkedin: https://www.linkedin.com/in/vanamali-matha-811035232/
Twitter: https://twitter.com/Vanamalimatha32
GitHub Repo: awesome-GCP-certifications
A collection of posts, videos, courses, qwiklabs, and other exam details for all exams: GitHub
I hope this helps you in your preparation and to pass your exam. Thank you for reading. Perform well and all the best!