Google Cloud Professional Data Engineer Certification
Experience from August 2020
On August 24th I successfully passed the Google Cloud Professional Data Engineer exam. During preparation I came across many similar articles, which were quite helpful, so this is my contribution with the recent experience. Cloud platforms and Data Engineering are fast-evolving fields and I saw some changes in the exam vs. what I was preparing for.
There are many articles on WHY and WHAT, so I will just explain my motivation as perhaps it will resonate with someone.
Wikipedia states that Data Engineering is a sub-domain of Data Science and is about data infrastructure. I’d further describe it as a mix of three disciplines — Cloud Architecture, DevOps and AI / Machine Learning. It is a fast growing field, which so far remains shadowed by the hype of Machine Learning and AI sub-domains of Data Science. However, as more organizations deal with data-driven products more people start to feel the pain of getting data to right quality, into right systems and building a resilient model pipeline. As we get more ML models industrialization becomes more critical. So, for me, coming from ML background and working on data-driven products the primary motivation was to understand the whole pipeline from getting data from IoT sensors to producing business insights to user.
There is an alternative angle, as well, if you come from Cloud Architect or DevOps background. A lot of ML/AI logic is now getting abstracted by generic cloud providers AI offerings (e.g. GCP AutoML and Computer Vision and NLP APIs) and are rapidly becoming a commodity. Having good data set now is the main problem, not creating the model (well, to be honest it was always like this). For folks with strong DevOps or architectural experience Data Engineering may be an opportunity to move into Data Science roles by adding some knowledge around AI/ML to the mix.
Let me also address here the potential question of “why Google Cloud and not <insert your favorite provider>?”. There is not a a lot of variety in data processing pipelines / products. In fact most of GCP Big Data products, for instance, are managed versions of open source technologies (Apache Kafka and Beam are obvious examples). As a result your knowledge should transfer well to other platforms if you map GCP products to other providers.
Background and Experience
Let’s talk about the value of experience. The official exam guide recommends “3+ years of industry experience including 1+ years designing and managing solutions using GCP”.
If I use the formula stated above where Data Engineering is a mix of Cloud Architecture, DevOps and ML/AI then my only direct relevant experience was 6 years of ML practice and training, on top of PhD involving good deal of regressions and statistics 15 years ago. You can expect exam to have 5–8 questions on ML/AI aspects, but generally they won’t be too deep. As a result, if you come from ML background you can expect up to 15–20% of exam to be easy. This being said — I still encourage ML practitioners not to take it for granted and have a deep look at GCP AI/ML products to understand specifications and use cases.
On Architecture side — I worked closely with several cloud teams developing on GCP over the last few years. Through daily interactions I picked a sense of patterns and anti-patterns. After some time you realize that data pipelines are not that diverse, in fact there are only a few recommended patterns, so if you have seen a couple of working systems on GCP you should not have problems with building a right mental model.
Finally, the DevOps aspect was probably the most challenging for me, IAM, deployment, Logging and Monitoring aspects — I had to spend quite some time on it.
So, key takeaway is to leverage your background. Keep the three areas above in mind, know your strengths and weaknesses and create a study plan correspondingly.
Courses
Courses may provide a foundation, but don’t expect them to be sufficient alone to pass the exam. I think at best you can become 60–70% ready for the exam after courses alone.
Coursera Data Engineering Specialization
The specialization consists (at the time of writing) of six two-week courses. I don’t think you can sign up for all of them at the same time, so even if you are aggressive it will take you probably at least 6 weeks to finish specialization. Coursera’s monthly cost is $50.
Overall I found it to focus too much on general patterns. It won’t give you the level of depth you will need for exam. Also when I took it — it was quite outdated, which, by the way, is likely a problem for all courses at the moment. Coursera partnered with QwikLabs which, if you are not familiar, provides a controlled environment to play with cloud platforms and also have the built-in grader. These labs are a great way (in theory) to get practical knowledge, but I found most of the labs to be fairly basic, you don’t have opportunity to experiment, you are expected to be a monkey following the script (and it does not like if you deviate from it) and labs are full of bugs. On three occasions I had major issues with labs. On one occasion — I waited for more than a week for QwikLabs to fix the lab grader to pass the course and move on and everyone had the same issue.
One main (only?) highlight in my opinion is the first course in the specialization, which is on Machine Learning. It was a great overview and even after a dozen of ML courses on multiple MOOC platforms I found some of their explanations very intuitive, succinct and fit to the exam scope. So definitely recommend it if you are relatively new to ML and don’t want to go through longer introductory courses like Andrew Ng’s, for instance.
Linux Academy (now merged with A Cloud Guru)
I saw someone recommending this platform and I wish I saw it before spending time (and money) on Coursera specialization. If you need to take one course — I highly recommend this platform. Matthew Ulasien did a fantastic job as instructor, achieving very high information density in the course. It covers more products and has deeper depth than Coursera. It is a bit more up-to-date than Coursera. Labs are less fancy than QwikLabs (still a monkey script), but they actually work and give you exposure to a wide variety of products.
Other benefits of this platform include: the full-duration 2hr exam with 50 questions (I’ll talk about practice exams in next section), 100+ pages Data Dossier PDF with all notes which you can download (the platform has an interactive version as long as you are an active subscriber) and a dozen of Anki/flashcards collections, created by users (need to be an active subscriber as well).
Linux Academy is $50/month. This specialization has about 12hrs of videos, split by a Product (e.g. BigQuery, BigTable) or a concept (Machine Learning / AI). Overall, these were the money well-spent and if I decide to continue with cloud certifications I’ll likely stick to this platform.
Preparation
I’ll list some resources which I used after courses to finish my preparation.
Cheat Sheets / Linux Academy Data Dossier
There are a dozen of GCP Data Engineering cheat sheets available on github, slideshare and similar resources. In combination with Linux Academy Data Dossier they are great to polish/memorize key concepts behind each product. However, keep in mind that the level of details of these is less than what you get in the exam , which is what you would expect from cheat sheet.
Linux Academy Flashcards / Anki cards
This is one of the most useful features of Linux Academy. There are about 6–8 decks available, some having up to 300 cards. You can rank them by popularity, the top 2 or 3 are definitely worth studying. I had a very busy schedule during preparation and several multi-day pauses, so these cards helped me to bring memories back. They have the right level of details, but won’t cover all products. Be also aware that quite a few of them are outdated (e.g. BigQuery allowing Table-level permissions or Bigtable not requiring three nodes for Production).
Practice Exams
There are only a couple of Practice Exams available. First, there is an official practice exam from Google (not full 50 questions), there is a short practice exam within Coursera specialization and full 50 questions exam on Linux Academy. There may be others on other MOOC platforms, but I found a lot of overlap between the three I mentioned, so don’t expect other practice exams to be very different. I suspect what happened is that both Coursera and Linux Academy used some questions from official practice exam and/or from each other.
As such, I suggest to use Practice Exams strategically. In a way they should become your training-validation-test data sets. I had Coursera exam first, then I didn’t take Linux Academy exam immediately after I finished the course. I studied for several days and only then took their exam. Finally, I left Google practice exam for the end and took it few days before real exam to confirm that I can handle unseen questions with the good level of comfort.
Syllabus
Going through a detailed syllabus is a also great way to assess your readiness. There are several extended versions available, which I won’t recreate. Here is the good example. The idea is to go over every item and assess your knowledge of use cases, limitations and key specifications.
GCP Product How-To Guides
Well, this is a trivial advice, but you’ll actually need to study official product pages. Read How-To guides. Pay attention to Beta announcements and General Availability announcements. Pay attention to product name changes as well. Pay attention to quota and configuration changes.
Taking Exam
Due to certification terms and conditions I cannot talk too much about the exam or provide a “brain dump”. Overall I read some people estimating exam to be 20% more complex than training exams, which I agree.
Some tips:
- Scan for non-functional requirements, like “cheap, secure, no extra development, highly available” to pick the best answer. Remember that in most questions almost all answers will work, you have to pick the recommended one, which meets all non-functional requirements.
- Keep in mind Google’s recommendations around IAM and using predefined roles and service accounts. Overall, invest some time in IAM understanding.
- Managed services vs non-managed ones, in almost all cases the managed ones are preferred (it is a certification from cloud provider who promotes their products, at the end of the day).
- Definitely familiarize yourself with new offerings like Cloud Composer, Dataprep
- Stackdriver was replaced by GCP Operations and there are new recommendations for logging and monitoring agents
- Check GCP DevOps offerings, namely GCP Cloud Deployment Manager
- Understand the differences between BigQuery and Bigtable, as with streaming inserts and updates (DML) the boundary between the two is getting blurry, especially when you don’t have a huge size of streaming data.
Good luck!