Coding Apache Beam in your Web Browser and Running it in Cloud Dataflow

Your web browser is hiding a secret

Most developers have an IDE (Integrated Development Environment) of choice for developing Apache Beam code, which they install locally on their laptops/desktops (like PyCharm for Python or IntelliJ for Java). But what happens when you’re on a computer which doesn’t support your IDE of choice, or you’re using someone else’s computer? Google has you covered! Google’s Cloud Shell comes with a built-in Code Editor for developing/modifying code (it’s based on Eclipse’s Orion). It’s not as full featured as an IDE but it does beat using Vim or Emacs to edit code! Yes, yes, I know these are fighting words for some people, but I’m speaking to those who agree with me (aka the majority 😜).

✌️ Here’s a peace offering tip to Vim and Emacs users, Code Editor supports your key bindings via a change in Settings > Editor Settings > Keys > Key Bindings.

Code Editor (shown in dark theme)

🔆 😎 🔆 If you don’t like bright light beaming at your face, then switch Code Editor to the dark theme

Just go to the code editor settings and change both the Editor Theme and the IDE Theme categories to Dark theme (shown in gif below).

Setting Code Editor to Dark Theme

So how do I use the Code Editor with Cloud Dataflow?

Glad you asked! To keep things simple, you’re not going create an Apache Beam pipeline from scratch. You’ll instead use Code Editor to modify one of the open source Dataflow templates that Google provides for your convenience. Specifically, you’ll modify the Cloud Pub/Sub to BigQuery template (PubSubToBigQuery.java) to take a Pub/Sub subscription as an input instead of a Pub/Sub topic.

Summary of what you’ll do:

  • Clone the GitHub DataflowTemplates repo to your Cloud Shell $HOME directory *(see note below)
  • Modify the Cloud Pub/Sub to BigQuery template using Code Editor
  • Build the modified template using Maven and stage it in Cloud Storage
  • Run a job in Cloud Dataflow using your custom built template

*Note: Since Code Editor is a feature built into Cloud Shell, any code that you want to modify in Code Editor must first be present in Cloud Shell’s free 5 GB persistent disk storage (this is mounted as your $HOME directory). If you do not access Cloud Shell regularly, the $HOME directory persistent storage may be recycled. You will receive an email notification before this occurs. Starting a Cloud Shell session will prevent its removal.


Setting up Code Editor

1. Open the Code Editor

You can use the direct link to navigate to a page where the upper half is the code editor, and the lower half is the cloud shell. You’ll use the shell to execute all the commands in the following steps.

Cloud Shell / Code Editor

You can also open the Code Editor directly from any cloud shell instance.

Cloud Shell’s Editor Button

2. Enable the necessary APIs for this example and clone the GitHub DataflowTemplates repo in your cloud shell home directory

  • Click this helper link to bulk enable all necessary APIs: Dataflow — Compute Engine — Logging — Cloud Storage — Cloud Storage JSON — BigQuery — PubSub
  • Click this helper link to auto-clone the GitHub DataflowTemplates repo in your cloud shell home directory
# Set env variables
gcloud config set project [YOUR_PROJECT_ID]
# Clone DataflowTemplates repo
cd ~ && git clone https://github.com/GoogleCloudPlatform/DataflowTemplates.git

🔑 Key Point❗️ : Make sure you enable the APIs in the same project that you use in the “gcloud config set project” command.

3. Open the PubSubToBigQuery.java file in the Code Editor

You’ll edit this file in the next section

cloudshell edit ~/DataflowTemplates/src/main/java/com/google/cloud/teleport/templates/PubSubToBigQuery.java

Modifying the Cloud Pub/Sub to BigQuery template

1. First change the input value provider variable from topic to subscription

(changes shown in bold below)

Replace lines 141–144:

@Description("Pub/Sub topic to read the input from")
ValueProvider<String> getInputTopic();
void setInputTopic(ValueProvider<String> value);

With:

// Modified Template input parameter
@Description
("Pub/Sub subscription to read the input from")
ValueProvider<String> getInputSubscription();
void setInputSubscription(ValueProvider<String> value);

2. And then change its usage in the pipeline’s first apply method

(changes shown in bold below)

Replace lines 202–204:

.apply("ReadPubsubMessages",PubsubIO.readMessagesWithAttributes().fromTopic(options.getInputTopic()))

With:

.apply("ReadPubsubMessages",PubsubIO.readMessagesWithAttributes().fromSubscription(options.getInputSubscription()))

Building and Running your modified Cloud Pub/Sub to BigQuery template

1. Create a Cloud Storage Bucket for staging and running the Dataflow template

You’re creating a regional bucket to gain better performance for data-intensive computations which are common in Dataflow pipelines.

BUCKET_NAME=gs://[YOUR_STORAGE_BUCKET]
gsutil mb -c regional -l us-central1 $BUCKET_NAME

Note: If not specified, us-central1 is the default region for Dataflow jobs. For best performance and to avoid network egress charges, keep your Dataflow jobs in the same region as the data being processed.

2. Use Cloud Shell’s built-in Maven tool to build the PubSubToBigQuery.java code which in turn also creates and stores a Dataflow template file in Google Cloud Storage

Note: The resulting Dataflow template file is stored in the location specified by the --templateLocation flag.

cd ~/DataflowTemplates && mvn compile exec:java \
-Dexec.mainClass=com.google.cloud.teleport.templates.PubSubToBigQuery \
-Dexec.cleanupDaemonThreads=false \
-Dexec.args=" \
--project=${GOOGLE_CLOUD_PROJECT} \
--stagingLocation=${BUCKET_NAME}/staging \
--tempLocation=${BUCKET_NAME}/temp \
--templateLocation=${BUCKET_NAME}/template \
--runner=DataflowRunner"

3. Create a Pub/Sub subscription and BigQuery dataset to use with your newly modified Dataflow template

gcloud pubsub topics create test_topic && \
SUBSCRIPTION=$(gcloud pubsub subscriptions create test_subscription --topic=test_topic --format="value(name)") && \
bq mk pubsub_to_bigquery_dataset

4. Lastly, run a Dataflow job using the template file that you created in step 2

You can view your streaming job in the Dataflow page

# Job Names must be unique for every run
JOB_NAME=pubsub-to-bigquery-$USER-`date +"%Y%m%d-%H%M%S%z"` && \
# Run Dataflow job and store job id for easy cleanup after
JOB_ID=$( \
gcloud dataflow jobs run ${JOB_NAME} \
--gcs-location=${BUCKET_NAME}/template \
--parameters \
"inputSubscription=${SUBSCRIPTION},outputTableSpec=${GOOGLE_CLOUD_PROJECT}:pubsub_to_bigquery_dataset.pubsub_to_bigquery_output,outputDeadletterTable=${GOOGLE_CLOUD_PROJECT}:pubsub_to_bigquery_dataset.pubsub_to_bigquery_deadletter" \
--format="value(id)")

Cleaning up to avoid recurring billing!

gcloud dataflow jobs cancel $JOB_ID
gcloud pubsub subscriptions delete $SUBSCRIPTION
gcloud pubsub topics delete test_topic
bq rm -r -f -d $GOOGLE_CLOUD_PROJECT:pubsub_to_bigquery_dataset