From zero to GCP Professional Data Engineer
A step-by-step guide on how to study for earning a GCP Professional Data Engineering certification
“Every kid coming out of Harvard, every kid coming out of school now thinks he can be the next Mark Zuckerberg, and with these new technologies like cloud computing, he actually has a shot.” ~ Marc Andreessen, Co-founder of Netscape, Board Member of Facebook
Isn’t this quote already enough to persuade you to start studying how to leverage cloud computing? Well, of course you will need an exaggerated determination, an out-of-the-ordinary intelligence and a little bit of luck, other than just access to a cloud, to be the next Zuckerberg… but you have to start somewhere!
Indeed, cloud computing provides you with an unbelievable amount of high-quality resources, which you can leverage to build your system, application or space shuttle, by taking advantages of tech colossuses, such as Google in this case.
Google Cloud Platform is a suite of cloud resources offered by Google, which provides developers with a vast range of services for computing, storage, analytics and much more. A subset of these services are particularly relevant for a wannabe Data Scientist or Data Engineer (naming a few: Cloud Storage, Big Query, Dataflow and many more), as he/she can easily make use of them to build flexible and complex pipelines for data collection, ingestion, elaboration and storage. GCP Professional Data Engineer is a certification witnessing that you know and are able to use these GCP products at a professional level, including plenty of subtle details. You can earn it by taking a not-so-trivial exam, containing questions about how the services work, interact and meet certain business requirements.
In the following of this blogpost, I will talk about my personal experience with this exam, from where I started to when I passed it, guiding you through each step. However, before starting, I would like to point out the following: this is not a guide on how to be a pro at GCP. Indeed, even if you follow it thoroughly, you will see that you will still miss some hands-on experience. Indeed, the exam itself requires very little practical knowledge to pass. However, getting your hands dirty with the GCP services is something which I strongly recommend to grasp better their functioning.
About the Exam
The Exam Guide provides a (wide) description of what are the skills and knowledge expected from a Data Engineer to pass the certification. Indeed, the goal of the exam is to certify that you are able to leverage Google products to build data pipelines. The products I am talking about are, among the others, BigQuery, Cloud Storage, Cloud SQL, Cloud Spanner, Dataflow, Dataproc, AI Platform (…and more), additionally to some other services that support your operations on the cloud (such as Stackdriver, Cloud Composer…).
The exam is composed of 50 multiple-answer questions, to be completed in 2 hours, with no public passing score: it’s just Pass or Fail (but I suppose the passing score is somewhere around 80%, so pretty high). Rather than asking silly definitions about what the Google products do, the questions are more likely to pose some small use-cases, asking you what is the best way to tackle the problem, by using one (or more) Google products. That is why you do not have to study each product as if it was in an isolated compartment, but rather you should think on how you can combine each one of them to solve a task. Usually, the use cases will ask you to optimize one among cost, performance and speed-to-production, which make the difference in the choice of one product and another.
My background
I must be honest: I lied in the title of the blogpost. I was not actually starting from zero, as I have a Master in Computer Science from Politecnico di Milano. However I had no experience with Google Cloud and I have never really had the chance to take data engineering classes, at least not at the deep level that is required to pass the certification. Nonetheless I did take many courses of Machine Learning (et simila), whose notions were quite helpful for some questions. It must be said that typically the ML questions of the exam are pretty standard and trivial, so even a superficial knowledge of the subject is enough. Being already familiar with the main concepts of data storage and processing will help you a lot… anyways, I think that if you are about to get this certification, you also have some experience with these topics.
As far as I am concerned, it took me roughly 1 month to study for the exam, being fully-commited on it. It may take a bit more, if you are busy working on something else at the same time.
Let’s start
My first step was to get a flavour of what the Google Cloud Platform was and what kind of tools are at disposal of a Data Engineer. That is why, as a first step, I took the following course from Coursera.
1. Data Engineering with Google Cloud Professional Certificate
Cost: $49 USD per month (but 7-day trial is enough…)
This course provides a broad overview on each of the products (or most of them) you need to know for the exam. Furthermore, it guides you through some practice-session, with the help of a platform called Qwiklabs.
It is a valid course if your aim is to get a flavour of what each service does, but it is far from being enough to get the certification.
2. Get a taste of what the exam is like
After this course, I wanted to measure the level of knowledge I acquired. You will notice that the small quizzes provided in the Coursera course are pretty easy. That is why you would better challenge yourself with something more serious, such as the Google Official Practice Exam.
Spoiler: it will be hard, but do not be discouraged. You will notice that most of the questions are not covered in the course you have just taken. However, it makes you understand what are your weakest spots.
3. Google Certified Professional Data Engineer
Cost: $31.59 USD per month (but 7-day trial is enough…)
Once I realized what kind of knowledge I needed to get, I enrolled for this course, without actually taking all the lessons. At this point, I had clear in my mind what were the hottest topics and the ones I needed to know more about. Hence I only took the classes about the ones I was insecure about (such as Dataflow or Dataproc).
Furthermore, at the end of the course you are given a final quiz which is pretty similar to the real one, in terms of content and complexity.
Again, this is still not detailed enough to get you a deep knowledge of everything, but you are one step ahead.
4. Practice as hell
At this point I was really sick of spending my time in front of videos. And in fact, I had already gotten everything I could get from online courses.
If this is not your first blogpost about this certification, you must have already read the following suggestion: read the documentation. Sure, that is definitely the ultimate resource you have to pass the certification. And I can guarantee that if you learn it by heart you will smash the exam. BUT… it is way too long to read it all!
So here is what I did. I took all the questions from the following sources:
To my opinion, these three sources contain questions which closely resemble the difficulty of the real exam. I went through all those questions and answer to each one of them. Whenever I was wrong with an answer, I would note the specific topic down . Then, once I was done with those, I would go through the documentation about all the topics I had taken note of. And then all over the questions again.
I suggest you to iterate this procedure until you are confident about each answer you give and its motivation. This is the most important step. It is long, but it is definitely going to be worth it.
How do you know you are ready to take the exam? When you are able to answer to a question without even reading the multiple answers.
Additional resources
Here are a couple of other resources I exploited for succeeding at this exam:
- GCP Data Engineering study guide: this is a long (!) list of links to GCP documentation, containing only the ones relevant for the GCP exams. It is not strictly required to go through all of them, but if you have time…
- GCP Data Engineering cheat-sheet: this cheat-sheet summarizes the main concept of each product in a few pages. It is really useful in the days before the exam to keep in mind the most important details and differences between the services.
Final Remarks
Even after going through everything 100 times, you will still miss something, this is a given. In fact, there are some tricky low-level questions whose answers are simply impossible to know without having a proper experience with GCP. Indeed, that is why Google recommends 3/4 years of experience with data engineering, for being fully prepared.
However, I claim this is enough for passing the exam. I repeat, I claim this is enough for passing the exam. This means this is NOT enough for mastering GCP. Earning this certification is like passing the theory part of your driving license exam: after that, you need to get your hands dirty to be proficient enough with a car, or with GCP in this case.
My final tip is the following: by taking questions and questions, you will see “associative rules” emerging. As an example, when the requirement asks for a real-time platform, it is very likely that the answer will regard Dataflow, being it a data streaming tool. My tip is to learn these rules so that, when you meet them again, you will already know the standard answer and can check if the other ones are truly wrong.
This is it. That is how I succeeded, and I am truly confident you can make it as well. Best of luck!
This blogpost is published by the PoliMi Data Scientists association. We are a community of students of Politecnico di Milano that organizes events and write resources on Data Science and Machine Learning topics.
If you have suggestions or you want to come in contact with us, you can write to us on our Facebook page.