Best cheatsheet to answer “What is Dataproc?”

All you need to know about Google Cloud Dataproc

Managed Hadoop & Spark #GCPSketchnote

Priyanka Vergadia
Google Cloud - Community

--

If you are using Hadoop ecosystem and want to make it easier to manage then Dataproc is the tool to checkout.

Dataproc is a managed Spark and Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning.

Dataproc automation helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don’t need them. With less time and money spent on administration, you can focus on what matters the most — your DATA!

In this video I summarize the what Dataproc offers in 2 mins.

In this video I summarize the what Dataproc offers in 2 mins. #GCPSketchnote

Erin and Sam are part of growing data science team using Apache Hadoop ecosystem and are dealing with operational inefficiencies! So, they are looking at Dataproc which installs a Hadoop cluster in 90 seconds, making it simple, fast and cost effective to gain insights as compared to a traditional cluster management activities. It supports:

  • Open source tools — Hadoop, Spark ecosystem
  • Customizable virtual machines that scale up and down as needed
  • On demand ephemeral clusters to save cost
  • Tightly integrates with other Google Cloud services.

To move you Hadoop/Spark jobs, all you do is copy your data into Google Cloud Storage, update your file paths from HDFS to GS and you are are ready!

Dataproc cheatsheet #GCPSketchnote

Brief explanation of how does Dataproc works:

It disaggregates storage & compute. Say an external application is sending logs that you want to analyze, you store them in a data source. From Cloud Storage(GCS) the data is used by Dataproc for processing which then stores it back into GCS, BigQuery or Bigtable. You could also use the data for Analysis in a notebook and send logs to Cloud Monitoring and Logging.

Since storage is separate, for a long-lived cluster you could have one cluster per job but to save cost you could use ephemeral clusters that are grouped and selected by labels. And finally, you can also use the right amount of memory, CPU and Disk to fit the needs of your application.

Next steps

If you like this #GCPSketchnote then subscribe to my YouTube channel where I post a sketchnote on one topic every week! And, if you have thoughts or ideas on other topic that you might find helpful in this format, please drop them in comments below!

Here is the website for downloads and prints👇

--

--

Priyanka Vergadia
Google Cloud - Community

Developer Advocate @Google, Artist & Traveler! Twitter @pvergadia