Google Cloud Professional Data Engineer — Roadmap for preparation
After being certified as Google Cloud Professional Architect, I wanted to continue the momentum and conquer upon “Google Cloud Certified — Professional Data Engineer” certification as well. It took me approx 1.5 months (along with my full-time job) realistically to prepare for the cert and to feel confident before giving it a shot.
For those of you familiar with Google’s Cloud Architect exam, in my opinion, the Data Engineer exam questions are slightly more difficult but the scope is much much smaller. The Data Engineer certification covers a wide range of subjects including Google Cloud Platform data storage, analytical, machine learning, and data processing products. Here is what all is covered:
Cloud Storage and Cloud Datastore
Surprisingly, these products are not covered much in the exam, perhaps because they are covered more extensively in the Cloud Architect exam. Just know the basic concepts of each product and when it is appropriate (or not appropriate) to use each product, and you should be well covered.
Cloud SQL
There were surprisingly few questions on this product in the exam. If you have practical experience using the product, you should be fine to answer any questions that may come up. As with questions related to other data storage products, be sure to know in what scenarios it is appropriate to use Cloud SQL and when it would be more appropriate to use Datastore, Bigquery, Bigtable, etc.
Bigtable
This product is covered quite extensively in the exam. You should at least know the basic concepts of the product, such as
- how to design an appropriate schema and row key
- Instances, Cluster and Nodes
- whether Bigtable supports transactions and ACID operations
- CBT tool
- Schema design for time series data
- Access control in Bigtable
- what the size limits for Bigtable are (cell and row size, maximum number of tables, etc).
BigQuery
BigQuery will be covered in greater details in exam. If you know BigQuery, you will be able to answer approximately 40% questions in exam. You should know about:
- the basic capabilities of BigQuery and what kind of problem domains it is suitable for.
- BigQuery security and the level at which security can be applied (project and datastore level, but not table or view level)
- Partitioned tables, wildcard queries (“backtick” syntax)
- Views and their usage scenarios
- Importing/Exporting data to/from BigQuery
- have an understanding of the methods available to connect external systems or tools to BigQuery for analytics purposes
- how the BigQuery billing model works, and who gets billed when queries cross project and billing account boundaries
- Access control in BigQuery
- BigQuery best practices
- Query plan explanation
- Legacy v/s Standard SQL
Pub/Sub
The exam contains lots of questions on this product, but all reasonably high level so it’s just important to know the basic concepts (topics, subscriptions, push and pull delivery flows, etc). Most importantly you should know when it is appropriate to introduce Pub/Sub as a messaging layer in an architecture, for a given set of requirements.
Apache Hadoop
Technically not part of Google Cloud Platform, but there are a few questions around this technology in the exam, since it is the underlying technology for Dataproc. Expect some questions on what HDFS, Hive, Pig, Oozie or Sqoop are, but basic knowledge on what each technology is and when to use it should be sufficient.
Cloud Dataflow
Lots of questions on this product, which is not surprising as it is a key area of focus for Google with regard to data processing on Google Cloud Platform. In addition to knowing the basic capabilities of the product, you will also need to understand concepts like:
Cloud Dataproc
Not many questions on this besides the Hadoop questions mentioned above. Just be sure to understand the differences between Dataproc and Dataflow and when to use one or the other. Dataflow is typically preferred for a new development, whereas Dataproc would be required if migrating existing on-premise Hadoop or Spark infrastructure to Google Cloud Platform without redevelopment efforts.
TensorFlow, Machine Learning, Cloud DataLab
The exam contains a significant amount of questions on this. You should understand all the basic concepts of designing and developing a machine learning solution on TensorFlow, including concepts such data correlation analysis in Datalab, and overfitting and how to correct it. Detailed TensorFlow or Cloud ML programming knowledge is not required but a good understanding of machine learning design and implementation is important.
Stackdriver
A surprising numbers of questions on this, given that Stackdriver is more of an “ops” product than a “data engineering” product. Be sure to know the sub-products of Stackdriver (Debugger, Error Reporting, Alerting, Trace, Logging), what they do and when they should be used.
Data Studio
Not many questions on this besides caching concepts, setting up metrics, dimensions and filters in a report.
How Do I Prepare?
Here is a number of reference courses that I went through:
- Coursera — Data Engineering on GCP Specialization Course
This course is divided into 5 modules with increasing complexity. The courses are driven by Valliappa Lakshmanan from Google. He does a pretty great job overall. Modules are shaped initially with slides and discussion, followed by Labs run through Google Codelabs which is a free to use training platform for hands-on labs in the Google Cloud Platform. I would highly recommend these labs.
Although, this is the official Google course for the certification, but this will not be enough for the certification.
- Linux Academy — GCP Data Engineer Course
This course is driven by Matthew Ulasien and he did really well in this course. This course is a must do course to understand variety of questions that you may encounter related to official case studies by Google. Also, the quizzes are good enough to test your readiness for the exam. High recommend this course.
- Cloud Academy — Data Engineer course
Last week refresher. Greatly covers ML concepts and the related questions you may encounter.
- Official Case Studies
Google has officially shared 2 case studies. Read them thoroughly and prepare yourself for all possible questions that may appear. Approx. 20% of exam questions will come from these case studies.
- Official Practice exam
Once you feel confident enough to go for exam, run over this practice exam shared by Google.
Additional Resources
- Google Cloud Next 2017 videos for Big Data and Machine Learning
- Official Data Engineer exam page from Google
- Google Cloud blogs
- Google Cloud blogs on Medium
- Sample solutions given by Google
Best of luck for your preparation and exam :-)