No More Credential Chaos! Simplify Databricks & GCP Integration with Service Accounts

Learn how to securely integrate Databricks with GCP using Service Accounts for effortless authentication and automation

Jitendra Gupta
Google Cloud - Community
5 min readSep 23, 2024

--

Introduction: The Credential Management Dilemma

When tasked with integrating Google Cloud Platform (GCP) services like BigQuery with Databricks, you’re often faced with a critical challenge — managing credentials securely. Manually handling API keys and service credentials across multiple environments is not only cumbersome but also prone to security risks.

If you’re looking to avoid the headache of manual credential management and ensure seamless, automated access to GCP resources from Databricks, the solution lies in Google Cloud Service Accounts.

In this article, we’ll explore how to use Service Accounts to securely authenticate Databricks with GCP. By the end, you’ll be able to integrate Databricks and GCP effortlessly, avoiding credential chaos and ensuring a smooth, scalable data workflow.

Problem Definition: The Pitfalls of Manual Credential Management

Integrating cloud services usually involves:

  • API keys: Often hardcoded in scripts or environment variables, which can easily be exposed or mishandled.
  • Access tokens: Expire frequently, requiring constant renewal and monitoring.
  • Manual intervention: Manually managing keys and tokens becomes a maintenance nightmare, especially in a production environment.

This is where Service Accounts come into play. They eliminate the need for manual credential handling and allow seamless access to GCP services through automated, secure authentication.

Step-by-Step Solution: Secure Authentication Using Service Accounts

Let’s dive into how you can set up Databricks to securely authenticate with GCP using Service Accounts.

1. Create a Service Account in GCP

To get started, we need to create a service account that will allow Databricks to access GCP services like BigQuery.

  • Go to the GCP Console: Navigate to IAM & Admin > Service Accounts.
  • Create a new Service Account: Name it (e.g., databricks-bigquery-access), and give it a description.
  • Assign roles:
  • BigQuery Data Viewer to read data from BigQuery.
  • BigQuery Data Editor to write processed data back to BigQuery.
  • Storage Admin for handling temporary data in Google Cloud Storage (GCS).
  • Save the Service Account Email: This will be needed in the Databricks cluster configuration (e.g., databricks-bigquery-access@project-id.iam.gserviceaccount.com).

2. Configure Databricks Cluster to Use the Service Account

Once the service account is created, we need to configure the Databricks cluster to use it for all interactions with GCP.

  • Open Databricks Console: Navigate to Compute in your Databricks workspace and choose the cluster you’re working with.
  • Edit the Cluster Settings:
  • Scroll down to Advanced Options.
  • Enter the Service Account Email: Paste the service account email (e.g., databricks-bigquery-access@project-id.iam.gserviceaccount.com) into the Google Service Account field.
  • Save and Restart the cluster to apply the configuration.

3. Set Up Google Cloud Storage (GCS) for Temporary Data Storage

When writing large datasets back to BigQuery, Databricks will need to store intermediate data in GCS. Let’s configure that:

  • Create a GCS Bucket: Go to Google Cloud Storage > Create Bucket.
  • Name the bucket (e.g., databricks-temp-bucket).
  • Ensure the service account has Storage Admin permissions on this bucket.

4. Read Data from BigQuery in Databricks

With the service account configured and GCS set up, you can now pull data from BigQuery into Databricks.

Here’s a simple Spark code to load data:

from pyspark.sql import SparkSession
# Define project ID and BigQuery table
project_id = "your-gcp-project-id"
table = "your-dataset.your-table"
# Create Spark session
spark = SparkSession.builder.appName("BigQueryIntegration").getOrCreate()


# Read data from BigQuery
df_bq = spark.read.format("bigquery") \
.option("table", table) \
.option("project", project_id) \
.load()
# Show the data
df_bq.show()

This code uses the service account to securely access data from BigQuery and load it into Databricks for further processing.

5. Write Processed Data Back to BigQuery

After processing your data in Databricks, writing it back to BigQuery is just as easy:

# Define GCS bucket for temporary storage and BigQuery table
gcs_bucket = "your-gcs-bucket"
destination_table = "your-dataset.processed-table"

# Write the DataFrame to BigQuery
df_bq.write.format("bigquery") \
.mode("append") \
.option("temporaryGcsBucket", gcs_bucket) \
.option("table", destination_table) \
.save()

This code ensures that your processed data is securely written back to BigQuery without any manual credential handling, using the configured service account.

Key Benefits: Why Use Service Accounts?

By using Service Accounts, you’ll enjoy several key benefits that simplify your workflow:

  1. Improved Security: No more hardcoding credentials in scripts or environment variables. The service account handles authentication automatically, minimizing exposure to security risks.
  2. Automation: Once set up, service accounts handle all authentication, reducing the need for manual intervention. This is especially useful when scaling your infrastructure.
  3. Ease of Maintenance: You don’t need to worry about refreshing tokens or managing multiple API keys. Service accounts simplify access control across your GCP services.
  4. Scalability: As your workloads increase, service accounts ensure that you can securely manage access to GCP resources without added complexity.

Compared to traditional methods of managing API keys or access tokens, service accounts offer a secure, scalable, and automated solution that can adapt to both small and large environments.

Practical Example: Using Service Accounts for a Real-World Use Case

Let’s say you have a dataset in BigQuery that contains millions of rows of customer transactions. You want to analyze this data in Databricks using PySpark, process it, and then write the transformed data back into a different BigQuery table for further reporting.

Using Service Accounts, this entire workflow becomes streamlined:

  • Authentication: Managed automatically between Databricks and GCP, ensuring secure access to BigQuery.
  • Data Transfer: Large datasets are seamlessly moved between GCP and Databricks without the hassle of managing temporary credentials or tokens.
  • Scalability: The solution scales automatically with your environment, allowing you to handle larger workloads as they come in.

This method not only simplifies cloud integration but also makes your data pipeline more secure and easier to maintain in the long run.

Conclusion: Simplify, Secure, and Scale Your Databricks-GCP Integration

Similar to me, If you’ve been struggling with managing credentials between Databricks and GCP, Service Accounts provide a game-changing solution. By automating the authentication process, you reduce the risk of security vulnerabilities, improve your workflow efficiency, and ensure that your cloud infrastructure is easy to scale as your data needs grow.

Whether you’re pulling data from BigQuery, processing it in Databricks, or writing results back to GCP, service accounts offer a seamless, secure way to integrate these platforms.

Have you implemented this solution in your environment? Share your experiences in the comments, and let’s discuss how you’ve simplified your cloud workflows!

Don’t forget to clap and follow for more articles on cloud integrations and best practices!

About me — I am a Multi-Cloud Enterprise Architect with over 12 years of experience in IT industry. A multi-cloud certified professional. Past few months I wrote 20+ cloud certification (10x GCP).

My current engagements are helping customer migrate their workloads from on-prem datacenter and other cloud providers to Google Cloud.

If you got any question, you can reach me on LinkedIn and twitter @jitu028 and DM, I’ll be happy to help!!

You can also schedule 1:1 discussion with me on https://www.topmate.io/jitu028 for any Cloud Integration related support.

--

--

Jitendra Gupta
Google Cloud - Community

Manager - GCP Engineering, Fully GCP-certified, helping customers migrate workloads to Google Cloud, career guidance, Tech-Philosopher, Empathy, Visionary