Notes from my Google Cloud Professional Data Engineer Exam
Immediately after the exam I do a memory dump as notes. Hence it is also quite unordered. This is a sanitized list that gives general topics and questions I encountered. The intention is not to give you the questions, but to give you topics that you can be prepared for. I was often stumped by some questions; hopefully you can be more prepared based on my experience. Wish you the very best!
Tough exam. I assumed this one would be easier because I spent more time preparing and I had the experience of the previous certifications. After the exam I went over the questions again to remind myself later what areas were covered — the answer is, everything. Zero direct questions. Every question was embedded in a situation/use case.
- BigQuery Data Transfer Service. I knew of storage transfer service and big query connectors, but I went ‘huh?’ when I saw this.
- IAM + Dataflow. Dataflow developer mode. There is an IAM permission that allows developers to work with pipelines without having data access.
- IAM + BigQuery. Handful of questions. Big query related were the most. At least 2 or 3 questions that had options on access level via tables/datasets. Remember that you cannot restrict access at table level. It is only at dataset level. Also look up what Authorized Views are.
- BigQuery: partitioning tables. Based on what are they partitioned — ingestion time, timestamp, date. How are they named? How are they then accessed in queries? Using _PARTITIONTIME.
- BigQuery. Syntax for wildcards in big query names. FROM
`bigquery-public-data.noaa_gsod.gsod*`. And in legacy SQL?
- BigQuery: table date range for bq. Accessing tables with dates and partitioned tables with functions like TABLE_DATE_RANGE, _TABLE_SUFFIX, TABLE_QUERY.
- Cloud Spanner: secondary index for cloud spanner. How indexes are created for you and how you can create secondary indexes.
- Datastore: multiple indexes for datastore. Default indexes. Syntax for creating custom, composite indexes.
- BigTable: row key scheme. What are the recommended ways for creating the row key? How do you avoid hotspotting? Should you use timestamp, and where?
- BigTable: ways to optimize.
- PubSub, Dataflow, Dataproc — properties and uses of these products. No direct questions, but applied to a scenario. The courses from Coursera, Linux Academy, and Cloud Academy cover these well.
- Dataproc: usage of gcs instead of existing file system. It is a best practice to use Google Cloud Storage instead of using HDFS. You can destroy the compute nodes after data crunching and save cost on them.
- BigQuery+DataStudio — caching/pre-fetch cache. Learn how you connect DataStudio to storage solutions. Learn the difference between default caching (which cannot be disabled) and pre-fetch caching (which can be disabled). What is the difference between doing that with Viewer credentials and Owner credentials.
- Dataprep: jobs. How are Dataprep jobs created and run? What permissions do you need? A term I saw was that this is a more ‘casual’ way of data cleaning. Dataproc/Dataflow would be more programmatic and therefore ‘intense’, I suppose.
- DataStudio: visualisation. What are the causes of stale data? And how do you get the latest? What caching options do you need to set?
- Machine Learning : feature crosses. Can’t give more information than that or I would be revealing the question. Learn what these are and what issues it solves.
- Machine Learning. There was one more question on feature crosses and computed features. It was a direct lift from the Coursera material on ML.
- Machine Learning: Dealing with overfitting.
- Machine Learning: Regularization. One option that confused me was ‘increase regularization’. What does it mean to increase or decrease regularization? Increase or decrease the values or increase or decrease the number of parameters to be regularized? You might want to find out what this means. I personally think this was confusingly worded and I randomly picked between the two options.
- Dataproc: how to control scaling? Configure autoscaling? I might not be paraphrasing this question correctly, but I was confused by the options on this. When we set autoscaling should we set or not set the number of workers or maximum number of workers? (Or was it nodes?) I did not know the answer to this one exactly.
- Avro file format. This came up multiple times in options and questions. Look up what it is and and know that it is a compressed format. Also, that bigquery/dataflow can work with it directly.
- I noticed at least 2 questions where the options were just wrong for incompleteness. E.g. here are 3 requirements that need to be met. Here is an option that meets 2 of them only but does it as per gcp recommendations. Here is another option that covers all three requirements but is blatantly wrong in the approach. I read the questions over and over again looking for clear clues on which option to choose but I saw nothing. I honestly have no suggestion if you face such a situation.
- There was one question where I had to choose from a list of non-GCP products. E.g. Redis, Cassandra, Hbase with Hive, MySQL, etc. So this required knowing a bit about other technologies and their storage/query formats also. Looks like just by-hearting GCP won’t cut it.
- Key Management Service. This question was about using KMS with non-GCP products. Note that there is a default key management where Google manages all the keys, then there is a customer managed encryption keys, and also a customer supplied encryption keys.
- BigQuery query plan. BigQuery allows you to see the query plan and execution profile for queries that you run. Know the phases, difference between average and max time, why there can be skew in the plan, and how to optimize for it.
- BigQuery + GCS. Know how to link tables between GCS and BigQuery as permanent tables and temporary tables.
- About 8 questions were from the case studies, FlowLogistics and MJTelco. You don’t have to by-heart them, but study them well. Work through solutioning for it by yourself. The Linux Academy course has a module that goes over the case studies.
- Bigquery. Know what a federated table is. While you are at it, learn also about clustered tables.
Notes from each of my exams
For those appearing for the various certification exams, here is a list of sanitized notes (no direct question, only general topics) about the exam.
Wish you the very best with your GCP certifications. You can reach out to me at LinkedIn and Twitter, especially for training for the certifications, short term consulting on GCP, and anything related to GoLang.