Data Engineering on Google Cloud Platform

Elias Papachristos
Feb 8 · 3 min read
Japanese gardening — Architecture

On December 2018 I attended the DevFest OnAir and I got a one free month of Google Cloud Specializations on Coursera. I had already finished the “ML with TensorFlow on GCP”, “Advanced ML with TensorFlow on GCP”, “IIoT” so I decided to start the “Data Engineering on GCP”.

There are 5 courses in this Specialization and the Level is Intermediate. That means that there are some pre-requests that the student/participant should have before enrolling. The main points, for me, is to have knowledge of Python, experience using SQL, familiarity with ML and a little bit of Java.

This specialization course covers structured, unstructured and streaming data and it prepares you for the Google Certified Professional Data Engineering certification exam.

The first course is about Big Data and ML Fundamentals. How to use CloudSQL and Cloud Dataproc in order to be able to migrate existing MySQL and Hadoop/Spark/Hive/Pig workloads to GCP, how to employ BigQuery and Cloud Datalab to carry out interactive analysis, what are the differences and how to choose between CloudSQL, BigTable and Datastore, how to train and use a NN with TensorFlow and finally you learn how to choose between the different data processing products on GCP.

The Leveraging Unstructured Data with Dataproc on GCP is a great combination of video lectures and hands-on qwiklabs where I’ve learned how to create and compute clusters to run Hadoop, Spark jobs on GCP. In the qwiklabs I’ve worked with Dataproc (create and manage clusters).

If you have questions about BigQuery then this is the course for your answers. Questions like: What is BigQuery, When to use it, How to Use it, How to Query, Structure, Subqueries, Tables…, all are here. And at the end is the connection with Dataflow. The autoscaling of the data processing pipelines.

The fourth course was not so interesting to me. That’s because I have already attended the two specializations that I’ve mentioned above. But when you hear it for the first time, it’s amazing! It’s all you need/should know when you are dealing with ML. From getting starting with ML, to building your ML model with TensorFlow (of course), to scaling with Cloud ML Engine and doing feature engineering. With a lot of qwiklabs and quizzes.

The final course is about Building Resilient Stream Systems on GCP. In the 4 previous courses, you could choose between Python and Java. But in this course, you’ll have to use Java. The final course helped me understand the use-cases for real-time streaming analytics. I had to use Pub/Sub to manage data events with asynchronous messaging, write streaming pipelines and run the transformations (where it was necessary) and got familiar with both the production and consumption (different sides of the streaming pipeline).

Conclusion

It is another great specialization from Google Cloud on Coursera. If you have the pre-requests that I’m mentioning above or if you are planning to take the exams, then go for it.

Have fun!

Elias Papachristos

Written by

Full-Time Family Man, Ex-Military Helicopter Pilot, Kendo Practitioner, Developer, Lead GDG Cloud Greece, Beta Tester Coursera(Volunteer), AI Enthusiast!

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade