Best cheatsheet to answer “What is Dataproc?”

All you need to know about Google Cloud Dataproc

Managed Hadoop & Spark #GCPSketchnote

If you are using Hadoop ecosystem and want to make it easier to manage then Dataproc is the tool to checkout.

Dataproc is a managed Spark and Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning.

Dataproc automation helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don’t need them. With less time and money spent on administration, you can focus on what matters the most — your DATA!

In this video I summarize the what Dataproc offers in 2 mins.

In this video I summarize the what Dataproc offers in 2 mins. #GCPSketchnote

Erin and Sam are part of growing data science team using Apache Hadoop ecosystem and are dealing with operational inefficiencies! So, they are looking at Dataproc which installs a Hadoop cluster in 90 seconds, making it simple, fast and cost effective to gain insights as compared to a traditional cluster management activities. It supports:

  • Open source tools — Hadoop, Spark ecosystem
  • Customizable virtual machines that scale up and down as needed
  • On demand ephemeral clusters to save cost
  • Tightly integrates with other Google Cloud services.

To move you Hadoop/Spark jobs, all you do is copy your data into Google Cloud Storage, update your file paths from HDFS to GS and you are are ready!

Dataproc cheatsheet #GCPSketchnote

Brief explanation of how does Dataproc works:

It disaggregates storage & compute. Say an external application is sending logs that you want to analyze, you store them in a data source. From Cloud Storage(GCS) the data is used by Dataproc for processing which then stores it back into GCS, BigQuery or Bigtable. You could also use the data for Analysis in a notebook and send logs to Cloud Monitoring and Logging.

Since storage is separate, for a long-lived cluster you could have one cluster per job but to save cost you could use ephemeral clusters that are grouped and selected by labels. And finally, you can also use the right amount of memory, CPU and Disk to fit the needs of your application.

Next steps

If you like this #GCPSketchnote then subscribe to my YouTube channel where I post a sketchnote on one topic every week! And, if you have thoughts or ideas on other topic that you might find helpful in this format, please drop them in comments below!

Here is the website for downloads and prints👇




A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Recommended from Medium

Tis me, schmedium sized Shady

Was Morpheus from the Matrix the best ever Product Owner in a movie?

The 4 Most Common Geometries in KML

[Announcement] 2022 Upcoming updates

Problems in loaddata

Week 10 Blog: Kyzer Polzin

How to password protect AWS Elasticsearch’s Kibana

My Star Wars CLI Project

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Priyanka Vergadia

Priyanka Vergadia

Developer Advocate @Google, Artist & Traveler! Twitter @pvergadia

More from Medium

An Agile Architecture for Analytics and AI on Google Cloud

How we are building Multi-Cloud Portable Workspaces at NetBook

GCP Operations Suite Alerts into Google Chat

Mercado Libre Goes Big(Query), Plus Cloud Moneyball and Plugging a $300B Retail Search Black Hole