Learning Path: Google Cloud Professional Data Engineer Certification

Yusuke Enami(Kishishita)
4 min readAug 20, 2023

--

My Certification: https://google.accredible.com/96bb25a3-c1eb-459c-9cdf-9c2a8ba0498e#gs.42q7mq

In this article, I will share my journey in passing the Google Cloud Professional Data Engineer Certification exam in August 2023.

For a related read, check out:

Introduction

Google Cloud offers a series of exams to evaluate and certify one’s proficiency in their cloud technology. The Professional Data Engineer (PDE) certification is one of them.

The PDE certification evaluates a candidate’s ability to:

  • Design data processing systems
  • Operationalize machine learning models
  • Ensure solution quality
  • Build and operationalize data processing systems

Scope of the Certification and My Impressions After Taking the Exam

Design data processing systems

  • Select Data Processing Architecture: Determine the appropriate architecture for ETL (Extract/Transform/Load) processing. The architecture should align with the client’s specific needs, whether it’s “Low code/No code”, “Orchestration”, a focus on “SQL knowledge”, or “Using Python”, among other preferences.
  • Identify the Nature of Data: Understand the type of data you’re working with. Is it time-series? Transactional? Your choice of database systems, whether it’s Cloud SQL, Bigtable, Firestore, Spanner, or others, should align with the data’s nature.
  • Implement a Data Warehouse: If clients want to establish a data warehouse, BigQuery is the answer.
  • Optimize for High-speed SQL Operations: For those seeking rapid SQL operations, BigQuery is the go-to solution.
  • Analyze Big Data: When the goal is to analyze vast datasets, BigQuery proves to be the ideal tool.

BigQuery consistently emerges as a powerful tool for data analysis.

Operationalize machine learning models

  • Dealing with Overfitting: Overfitting (often referred to as “over-learning”) is a challenge in machine learning. To mitigate it, consider increasing the dataset size, normalizing the dataset, or decreasing the learning parameters.
  • Handling Missing Values: If clients prefer a no-code solution for handling missing values in a dataset, Dataprep is a solid option. Alternatively, BigQuery can be integrated into the ETL process within Dataflow.
  • Speeding Up Training Processes: If your ML model’s training process is lagging, consider using a GPU instance, especially if your model is compatible with GPU frameworks.
  • Training Frequency: Depending on the model and its application, you might need daily or weekly training operations.

Ensure solution quality

  • Alerting design. What index is suit for your system’s alert?
  • Availability of the system. Choose availability such as Zonal, Regional, and Multi-Regional by clients’ requirements.
  • How to deal with processing delay in Dataflow?
  • How do you place the huge datasets following the legal regulation? Aggregating the dataset in a project like Datalake is one of the best-practice.
  • Aggregate and preserve the logs. Use Log-Sink and Log-Bucket.

Check it out:

  • Alerting Design: Determine which indices are suitable for your system’s alerts.
  • System Availability: Based on client requirements, select the appropriate availability level: Zonal, Regional, or Multi-Regional.
  • Processing Delays in Dataflow: Address delays by switching to a high-spec instance type or enabling the auto-scaling option to add more nodes.
  • Handling Large Datasets with Legal Considerations: When dealing with substantial datasets that require compliance with legal regulations, consider aggregating the dataset in a project. Using a structure like a Data Lake is considered a best practice in such situations.
  • Log Aggregation and Preservation: Utilize tools like Log-Sink and Log-Bucket to manage and store logs efficiently.

Check out my previous insights:

Build and operationalize data processing systems

  • Service Selection: Determine the appropriate database service or Cloud Storage based on the client’s requirements.
  • Data Access Control: Utilize IAM for granular access management. Options include Project-level IAM, Dataset-level IAM, and Table-level IAM.
  • Protecting PII: Safeguard Personally Identifiable Information (PII) using Cloud DLP. Choose the encryption method that aligns with system requirements.

My Learning Path

For my preparation, I relied on two primary learning materials:

Google Cloud Skills Boost for Partners

I enrolled in the following three courses below:

These courses offer hands-on experience using Qwiklabs. For instance,

  • Learn about data cleansing with Dataprep or DataFusion, which demonstrates how to remove missing values from CSV-formatted datasets.
  • Develop a data-flow with BigQuery and BigQueryML that combines ETL and predictive processes.
  • Create a machine learning pipeline using Cloud Dataflow.

Udemy

While the certification exam doesn’t test practical skills directly, Udemy’s mock examinations are great for brushing up on what you’ve learned.

https://www.udemy.com/course/google-cloud-professional-data-engineer-practice-tests-2023-g/?persist_locale=&locale=en_US

Note: This is paid course.

Given the plethora of resources on Udemy, it might be challenging to decide which material is the most relevant. I’d suggest opting for content that has been recently released. This is because Google Cloud undergoes frequent updates.

Aim for a score of 90% or higher in each mock examination for the best preparation.

Reflections After Passing the Exam

Earning this certification provided me with a valuable opportunity to delve deeper into best practices in data processing and system architecture. Prior to this, I wasn’t completely familiar with the features of services like Cloud Composer, Dataflow, and Dataproc.

Securing this certification isn’t just about enhancing your understanding of data processing within Google Cloud. Upon successful completion, you’re also rewarded with Google Cloud Certified merchandise. I wholeheartedly recommend embarking on this journey and taking the exam!

--

--

Yusuke Enami(Kishishita)

DevOps engineer in Japanese company. I love Google Cloud/Kubernetes/Machine Learning/Raspberry Pi and Workout🏋️‍♂️ https://bigface0202.github.io/portfolio/