Creating a Dataproc cluster: considerations, gotchas & resources

Published in

Google Cloud - Community

6 min readJul 30, 2021

Google Cloud Dataproc is a fully managed and highly scalable service for running Apache Spark, Apache Flink, Presto, and 30+ open source tools and frameworks. This powerful and flexible service comes with various means by which to create a cluster. This article discusses focus areas users should consider in their efforts to successfully create a reliable, reproducible and consistent cluster.

Before stepping through considerations, I would first like to provide a few pointers. Keep in mind that the Cloud Dataproc service comes with tremendous flexibility and therefore much complexity can be encountered. Take advantage of iterative test cycles, plentiful documentation, quickstarts, and the GCP Free Trial offer. A couple great features I recommend trying are APIs Explorer and UI functionality. For example, in the GCP console -> Dataproc -> CREATE CLUSTER you can configure your cluster and, for your convenience, have the ability to auto-generate the equivalent command line or equivalent REST (without having to build the cluster):

This can assist you in automating test cycles. Let’s now step through our focus areas.

Considerations:

Networking
Identity and Access Management (IAM)
Versioning
Components
Logging and Monitoring
Configuration (Security, Cluster properties, Initialization actions, Auto Zone placement)
Dataproc Quotas
Dataproc Hadoop Data Storage

Networking-

The Compute Engine Virtual Machine instances in a Dataproc cluster, consisting of master and worker VMs, must be able to communicate with each other using ICMP, TCP (all ports), and UDP (all ports). Users often implement restrictive network policies in order to adhere to organizational requirements. Be sure to cross reference your network implementation against the implementation requirements outlined here.

Gotchas-

Missing firewall rules
Missing IAM permission or roles
Missing a route to the internet (for non-default VPC network)

Resources-

Network Intelligence Center:Connectivity Tests

Identity and Access Management (IAM)-

Dataproc permissions allow users, including service accounts, to perform specific actions on Dataproc clusters, jobs, operations, and workflow templates. This focus area gets a lot of attention as users sometimes remove roles and permissions in an effort to adhere to least privilege policy. It is imperative to cross reference IAM implementation strategies against documented requirements.

Gotchas-

Missing IAM permission or roles, EX. Network User role for Shared VPC networks.
Deleted Service Accounts (SAs), EX. deleted Service Agent account.
Misunderstanding User, Control Plane and Data Plane Identities.

Resources-

Policy Troubleshooter

Versioning-

Dataproc uses images to tie together useful Google Cloud Platform connectors and Apache Spark & Apache Hadoop components into one package that can be deployed on a Dataproc cluster. Although it is recommended to specify the major.minor image version for production environments or when compatibility with specific component versions is important, users sometimes forget this guidance. Examples of how to select versions:

Gotchas-

Not explicitly setting versions resulting in conflicts with initialization actions or missing dependencies.
Varying image versions from Infrastructure as Code (IAC) resulting in slow performance of jobs.
Ensuring supportability date.

Resources-

Dataproc Image version list

Components-

When you create a cluster, standard Apache Hadoop ecosystem components are automatically installed on the cluster (see Dataproc Version List). You can install additional components, called “optional components” on the cluster when you create the cluster. The list is significant as it includes many commonly used components such as JUPYTER.

Gotchas-

Avoid Security Vulnerabilities when enabling Cluster web interfaces.

Resources-

Dataproc Component Gateway

Logging and Monitoring-

Dataproc job and cluster logs can be viewed, searched, filtered, and archived in Cloud Logging. Cloud Monitoring provides visibility into the performance, uptime, and overall health of cloud-powered applications. Robust logging is often at the heart of troubleshooting a variety of errors and performance related issues.

Gotchas-

Enabling job driver logs in Logging must be implemented when creating the cluster.
You must understand the costs associated with enabling logging for Dataproc services.
Job history can be lost on deletion of Dataproc cluster.
VM memory usage and disk usage metrics are not enabled by default. Enable.

Resources-

Dataproc Persistent History Server

Configuration (Security, Cluster properties, Initialization actions, Auto Zone placement)-

Keep in mind that I’m highlighting focus areas to be aware of that have impeded successful cluster creation. Each of these subcategories deserve careful consideration and testing.

Gotchas-

Security:Cross-realm trust Used to establish a cross-realm trust with an external KDC or Active Directory server that holds user principals.
Security:High-Availability Mode Kerberos does not natively support real-time replication or automatic failover if the master KDC is down.
Security:Network Configuration Kerberos requires reverse DNS to be properly set up. Also, for host-based service principal canonicalization, make sure reverse DNS is properly set up for the cluster’s network.
Cluster properties:Cluster vs. Job Properties The Apache Hadoop YARN, HDFS, Spark, and other file-prefixed properties are applied at the cluster level when you create a cluster. Many of these properties can also be applied to specific jobs. When applying a property to a job, the file prefix is not used.
Cluster properties:Dataproc service properties These properties can be used to further configure the functionality of your Dataproc cluster. These cluster properties are specified at cluster creation. They cannot be specified or updated after cluster creation.
Initialization actions often set up job dependencies, such as installing Python packages, so that jobs can be submitted to the cluster without having to install dependencies when the jobs are run. Initialization actions run as the root user. You should use absolute paths in initialization actions. You can use Dataproc custom images instead of initialization actions to set up job dependencies.
We encourage use of Auto Zone placement in an effort to balance resources as well as avoiding resource contention.

Resources-

MIT Kerberos Documentation
File-prefixed properties table
Dataproc service properties table
Sample initialization action scripts: GitHub repository, or Cloud Storage — in the regional gs://goog-dataproc-initialization-actions-<REGION> buckets

Dataproc Quotas-

Dataproc has API quota limits that are enforced at the project and region level. The quotas reset every sixty seconds (one-minute).

Gotchas-

If you exceed a Dataproc quota limit, a RESOURCE_EXHAUSTED (HTTP code 429) is generated, and the corresponding Dataproc API request will fail. However, since your project’s Dataproc quota is refreshed every sixty seconds, you can retry your request after one minute has elapsed following the failure.

Resources-

Increasing Resource Quota Limits: Open the Google Cloud IAM & admin→Quotas page, and select the resources you want to modify. Then click Edit Quotas at the top of the page to start the quota increase process. If the resources you are trying to increase aren’t displayed on the page and the current filtering is “Quotas with usage,” change the filtering to “All quotas” by toggling the “Quota type” dropdown.

Dataproc Hadoop Data Storage-

Dataproc integrates with Apache Hadoop and the Hadoop Distributed File System (HDFS). Dataproc automatically installs the HDFS-compatible Cloud Storage connector, which enables the use of Cloud Storage in parallel with HDFS. Data can be moved in and out of a cluster through upload/download to HDFS or Cloud Storage.

Gotchas-

PD size and type affect performance and VM size, whether using HDFS or Cloud Storage for data storage.
VM Boot disks are deleted when the cluster is deleted.

Resources-

For PD-Standard without local SSDs, we strongly recommend provisioning 1TB or larger to ensure consistently high I/O performance.
Be certain to review performance impact when configuring disk. See https://cloud.google.com/compute/docs/disks/performance for information on disk I/O performance.
Configure your persistent disks and instances

I am hopeful this summary of focus areas helps in your understanding of the variety of issues encountered when building reliable, reproducible and consistent clusters. Thank you to the folks that helped add content and review this article.

Creating a Dataproc cluster: considerations, gotchas & resources

Considerations:

Written by Michael Reed