A COMPREHENSIVE GUIDE FOR DATA PROFESSIONALS
Data Engineering on GCP Specialisation
From the Coursera Specialisation to the Certification Exam
If you are a data professional considering to upskill, there is no shortage of learning options, but if you are looking for ways to transition your data and analytics to the Cloud, you can choose between only a limited number of public Cloud providers.
This guide focuses on Google Cloud, more specifically the Data Engineering on Google Cloud specialisation (formally known as Data Engineering on Google Cloud Professional Certificate), and provides you with up-to-date information and practical advice.
It is based on my own experience completing the specialisation along with input gathered from other data engineers also working in the field.
If you are interested in the Google Cloud Professional Data Engineer Certification Exam, the specialisation will likely be your starting point anyway. The certification exam will also be discussed.
The guide covers the following:
- Overview of the relevant learning options
- What the material covers
- Why to consider the Data Engineering specialisation
- Review: What’s good/what’s missing/what are the highlights
- Practical tips for the specialisation
- Next steps after completing the specialisation
- The certification exam
After reviewing this information, you should be in a better position to decide if the course is right for you.
About the specialisation
First, let’s make sure that you are in the right place.
The Data Engineering on Google Cloud specialisation is one of several on-demand specialisations that belong to Google Cloud’s data track.
It has a “twin sibling”, a specialisation called Data Engineering, Big Data and Machine Learning on Google Cloud. Both specialisations received a major overhaul in February 2020. This guide applies to the redesigned versions of both of them. I will explain the minor difference between the two (and how to kill two birds with one stone, if that is something you want to do).
1. Learning options for Google Cloud
This section will help you put the specialisation into context and to see how it relates to the various learning options. With multiple alternatives available, finding the best learning path isn’t always obvious.
Options for Google Cloud include:
- Labs
- Quests
- Courses
- Specializations
- Professional Certifications.
Several topics, such as BigQuery, Data Studio and AI notebooks, are covered in multiple labs, courses and specialisations, though at varying degrees of depth.
For simplicity, think of a course as the standard unit you will use to meaningfully “learn something”. Then associate all other up- or downstream options to this course concept.
- Courses are produced by Google but made available via external providers, such as Coursera and Pluralsight.
- A course, along with the videos and reading materials, will give you access to several Labs, which will be assigned as part of the homework. These labs are hosted by Qwiklabs, a training company that was acquired by Google.
- If you want more practice, you can usually find additional labs related to the specific topic of interest on the Qwiklabs site. There are 400+ labs available covering virtually all the Google Cloud products.
- Many labs combine more than one Google Cloud product. These related labs form groups, the so-called Quests, which can have various levels of difficulty.
- A Specialisation is a collection of typically four or five courses that are based around a broader topic, such as data engineering.
- Once you have successfully completed the specialisation, you can start thinking about the corresponding Google Certification Exam (read more about it below).
2. What the material covers
At a high level
The specialisation teaches you how to design and manage data pipelines on the cloud: from accessing data from various sources and transforming and storing them to performing analytics for business insights or machine learning for predictions. It then teaches you how to package all these steps and automate the data pipeline. The workflows must be adaptable to any type of data volume, velocity and variety.
Is the data arriving in batches or streaming in? Is it structured or unstructured? Is it small or extremely large? These are aspects of the job that a data engineer must know how to handle well, and they are all covered thoroughly in the course.
The end goal is to add value by enabling data-informed decision making for the business.
The Products
When the visible final product is for instance an interactive real-time dashboard or an adaptive ML-driven web app, the data engineer is often the unseen hero that makes it all possible.
However, to achieve all this one must be familiar with an increasing number of Google Cloud tools, which are gradually covered in the 6 courses that make up the specialisation.
Here are some of the key products that are covered:
Cloud Storage, BigQuery incl BigQuery ML and Data Studio, Dataflow, Dataproc, Pub/Sub, Composer, Datafusion, Kubeflow, ML APIs, AutoML, AI notebooks.
If you want to know more about these products, there is a handy guide that describes each of these, and many other Google Cloud products in 4 words or less.
Learning to work with all these products is like putting together a jigsaw puzzle of different Google Cloud products where you need to decide which pieces best fit together to achieve a reasonable balance between performance, practicality of implementation and cost effectiveness.
3. Why take the specialisation?
If you are moving professionally towards the direction of data engineering (there are good reasons to do this), the answer to this question is obvious.
If you are wondering why to do it on the Cloud, the automation promise of No Ops/Serverless can be quite appealing.
Moreover, if you are a data analyst, data scientist or someone who leads teams of this type, this specialisation will be very relevant to you if you want to understand the spectrum of possibilities created when Cloud + AI + Data come together. If this describes you, I recommend checking out the following 3 courses from the specialisation:
4. Review
4.1 Essential info about the specialisation
Level of Difficulty: There is a medium level of difficulty when compared to other Coursera programming courses, as you don’t actually have to debug code.
Time commitment: For the same reason as above, you can complete the six courses fairly quickly if you want to. But it is not recommended to squeeze all the courses into one or two weeks. It is better to give yourself time to absorb the concepts and to practice as you go.
Grades: 80% or more is required to pass the quizzes and labs, but you can attempt them multiple times. The final result doesn’t contain a number or letter grade (it is just a pass).
Prerequisite knowledge:
- SQL and understanding of database management/Extract-Transform-Load/big data concepts and terminology
- Basic command line
- Scripting language basic knowledge (Python is used in some labs)
Familiarity with general Cloud (VMs, storage, etc.) and machine learning concepts is also useful, though the intro course will help you to get up to speed with these concepts.
Basic knowledge suffices. There is no need for deep expertise in any of the areas above.
4.2 What is good about it?
- High-quality production and easy to follow material with a natural flow between courses and the different sections.
- 2020 updated contents, including some cool features that are still in alpha.
- Good balance between the concepts and the practical aspects.
- A lot of hands-on labs and demos in all courses.
- Context and perspective on how the various tools and technologies evolved over time.
- Everything is either in SQL or Python (no java programming required, making the course a bit more inclusive).
- Tips and tricks on how to bring down the bill, especially for BigQuery
4.3 What is missing?
- More opportunities to write code and labs generally becoming more autonomous as one progresses.
- More active forum conversations
- More detailed quizzes to help with better understanding of the concepts.
- More TensorFlow, at least an overview or introduction. Despite it being a signature Google project, TensorFlow is practically missing.
- More content that goes beyond the few lines of the SQL code needed to build a machine learning model. It should also cover the basics of the machine learning process, the possible pitfalls and the ways to evaluate the results, so that data engineers can apply it more confidently.
- More discussions about service costs of and comparisons to other available options. For example, “How much would it cost on average to run the labs if done outside of the course?”
4.4 Highlights
These are my personal highlights about data analytics and engineering on Google Cloud based on my experience with the specialisation:
1. The ubiquity of BigQuery
BigQuery provides much more functionality than that which can be found in a data warehouse. It is present in every single course and connects to almost every stage within the data engineering pipeline. Its position will likely remain central as the boundaries between data lakes and data warehouses are becoming blurry and the traditional ETL are starting to look more like ELT.
2. Accessibility to machine learning at multiple levels
A new reality has emerged in terms of the accessibility and scalability of machine learning through products like BigQuery and BigQuery ML (a framework that is integrated into BigQuery) and also in combination with AutoML and the pre-trained ML APIs. There is now an option for developers, data analyst and data scientists to approach machine learning from different angles.
3. Automation of machine learning pipelines
After point 1 and 2, the next level of abstraction is reached via the managed experience of Kubleflow pipelines on the AI platform. Kubeflow pipelines orchestrate the BigQuery and BigQuery ML as well as multiple other steps within the complete machine learning workflow (data preprocessing, feature engineering, model training and deployment), making it possible to productionalise a machine learning solution that is both scalable and reusable.
Note: You receive only a small flavour of this during the course, but the prospect of achieving this without having to manage Kubernetes clusters is certainly promising.
5. Practical Tips
1. Gain offline access.
If you want to work through the courses offline and undistracted, while also building a library with all relevant content, you can use Coursera-dl, an open source tool that automatically downloads all the available videos and slides.
2. Access to code.
If you don’t have time to complete the entire specialisation, but you still wish to see the examples and the code used, you can find all code used in the labs– laid out much like a library of recipes– in this GitHub repo.
3. Access to resources
You are given 1.5–2 hours for each lab, which is typically more than enough time, and the remaining time can be used to experiment with Google Cloud without charging your credit card.
Combine slides, videos, demos, labs and code from the GitHub repos and organise it in a way that makes sense to you. This way, you can build a library of recipes that will help you tackle a number of common use cases — very helpful, especially during the first steps after the course.
4. 2 Certifications in 1
Do you prefer your final certificate to say, “Data Engineering, Big Data, and Machine Learning on GCP” instead of “Data Engineering”? These two specialisations are practically the same. The only difference is that the data engineering specialisation contains one extra course that is meant to prepare you for the Professional Exam. Once you are done with the data engineering specialisation, you can enrol in the other one, and you will instantly receive the other certification as well (no extra cost and no extra courses or labs to take).
5. The first month is free.
If you don’t have a sponsor, Google often provides the 1st month free or deeply discounted via CloudOnAir webinar offers or other promotions– look out for them. Alternatively, you can always edit a course before buying it (look for this option when you enrol, as it might not be very visible). In any case, it is great value for the money, if you consider that the same content offered in an offline setting can be priced at over £/$ 2,000.
6. Next Steps — Certification Exam
Tips for success in the Professional Exam
The tips in this section were kindly contributed by certified engineer Suraj Pabari.
Passing the 6 courses of the specialisation certainly doesn’t guarantee that you will be proficient in the topic.
After finishing the specialisation and accumulating substantial hands-on experience, a possible next step is to prepare for the Professional Data Engineer Exam. This is a formal qualification that demonstrates your experience to both your clients and employers.
How do you prepare for the exam after completing the specialisation?
Three great tips follow:
1. Understand why: When doing practice exam questions, rather than jumping to the answer that ‘feels right’, try to understand why the answer might be right. e.g. Should you use BigTable instead of BigQuery? In what situations might you use BigQuery? Why not use Google Cloud Storage? Look through all the options given, and try to understand where each might be applicable. If you get one wrong, try to understand why you were wrong so that you don’t make the same mistake twice.
2.Use case studies: The specialisation includes a few useful case studies: review these and try to develop an infrastructure that will work given the constraints. This will test your understanding and also, be very applicable to potential real life challenges you might face. Think about the tradeoffs at each stage, and come up with options. You can also think about the requirements for companies that you work with, and try to come up with the most appropriate infrastructure.
3. Practice applying your knowledge: To make sure you truly understand the concepts, set yourself challenges and test whether you can complete the challenges using Google Cloud products. Kaggle has some great examples (with notebook answers as well). Think about useful examples with public data sources: e.g. can you use BQML to predict the relationship between the number of COVID cases and the stock price, or build a propensity to purchase model using Google Analytics data.
For additional tips, I‘d recommend checking out Panagiotis Tzamtzis’ blog post and Vinoaj Vijeyakumaar ‘s SlideShare.
Note also that as of May 2020 an official Google Cloud Certified Professional Data Engineer study guide is available in print format.
7. Next Steps — More suggestions
Alternatively, you can follow up with another specialisation from the data track, such as From Data to insights or Machine learning with TensorFlow. Some topics, such as BigQuery or the AI platform, are covered more in-depth in these other specialisations.
Another option is to go for depth rather than breadth and to focus on the aspects of data engineering that you find most relevant to your own work, while also being aware of the full picture.
You can follow communities and individual people who share content related to your specific areas of interest within data engineering.
For example, if you are interested in BigQuery and BigQuery ML, you ‘ll find great content shared by Googler Lak Lakshmanan, author of the BigQuery Definitive Guide and an instructor in the specialisation, as well as Google developer advocates Felipe Hoffa(along with his BigQuery twitter list ) and Polong Lin.
More suggestions:
- Check out the GCP slack community (there is a channel dedicated to #data-engineering).
- Sign up for the private Coursera community of professional certificate holders (you’ll receive communication from Coursera after passing all courses).
- Look for local Google developer communities and explore the training options that are exclusively offered through these groups.
- With product updates coming out on a weekly basis, it’s a good idea to follow the Google Cloud blog or other related resources to stay up to date (some aspects of the training will become obsolete over time).
In Closing
The data engineering specialisation is a great option for anyone wanting to learn how to design and develop data pipelines on Google Cloud.
Whether you want to become a data engineer or you just want to gain a better understanding of this exciting field, my recommendation is to begin by sampling content from one of the courses to get a feel for what it looks like in practice. Check out the detailed syllabus for the remaining courses and decide if one or more, or all, of the courses in the specialisation are a good fit for you.
Big thanks to experienced GCP developers Suraj Pabari and Panagiotis Tzamtzis for their contributions to the guide and their feedback.
Special thanks to Mark Edmondson whose open source software and articles introduced me to Google Cloud for the digital marketing domain.
Do you have more tips regarding the specialisation or the path to the Certification Exam? Please leave them in the comments section below.
Alex Papageorgiou
I’m an experienced market research consultant and ex-Googler, helping venture builders, founders, and marketers uncover new market opportunities by unlocking the full potential of consumer search data. I share my perspectives about emerging consumer needs and market trends on my blog and via Twitter and LinkedIn.
Stories I’ve published on Medium that you might also like: