Setting up a Java Development Environment for Apache Beam on Google Cloud Platform

Introduction

These instructions will show you how to get a development environment up and running to start developing Java Dataflow jobs. By the end you’ll be able to run a Dataflow job locally in debug mode, execute code in a REPL to speed your development cycles, and submit your job to Google Cloud Dataflow.

On the go and feeling the need for Beam? Checkout Coding Apache Beam in your Web Browser and Running it in Cloud Dataflow.

More comfortable in Python? Checkout Python Development Environments for Apache Beam on Google Cloud Platform.

What tooling will stand up your Java Development Environment?

You will use the following components:

  • IntelliJ : Java IDE from JetBrains.
  • Maven : Project management and comprehension tool from ASF.
  • Google Cloud SDK : Command line tools for managing your Google Cloud Platform project and submitting you Apache Beam Pipelines as Dataflow Jobs.

Looking for more? Check out official documentation for IntelliJ, Maven, Google Dataflow.

What you’ll need

  • A recent version of Chrome.
  • The Java 8 JDK
  • Basic knowledge of Java
  • Your favorite CLI (ie. Terminal for Mac)
  • An existing Google Cloud Project

Installing the Appropriate Software

Development Tools for your environment.

Follow these Installation guides (in this order)

  1. Java 8 JDK

Make sure to set the JAVA_HOME environment variable.

If you are on a Mac add this to your bash profile. You can run these commands in Terminal.

echo “export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_xxx.jdk/Contents/Home/” >> ~/.bash_profile
source ~/.bash_profile

2. Google Cloud SDK

Visit the Cloud SDK Downloads Page and extract the google-cloud-sdk to your /Applications directory.

Applications/google-cloud-sdk/install.sh
export PATH=/Applications/google-cloud-sdk/bin:$PATH

Open a new Terminal, run this command, and enter the necessary information in the prompts that follow.

gcloud init

Ensure your version of the SDK is up to date.

gcloud components update
  1. IntelliJ (Community Edition)

You’ll use IntelliJ in this setup guide for it’s increasing popularity and intuitive debugging features.

  1. Apache Maven

Maven is familiar for many Java developers as a helpful tool for testing, packaging and compiling java projects.

unzip apache-maven-3.5.3-bin.zip -d /opt/apache-maven/
echo PATH=/opt/apache-maven/bin:\$PATH >> ~/.bash_profile
source ~/.bash_profile

Check the maven installation was successful.

mvn -v

You should see an output like:

Apache Maven 3.5.3 (138edd61fd100ec658bfa2d307c43b76940a5d7d; 2017–10–18T08:58:13+01:00)
Maven home: /opt/apache-maven-3.5.3
Java version: 1.8.0_45, vendor: Oracle Corporation
Java home: /Library/Java/JavaVirtualMachines/jdk1.8.0_45.jdk/Contents/Home/jre
Default locale: en_US, platform encoding: UTF-8
OS name: “mac os x”, version: “10.8.5”, arch: “x86_64”, family: “mac”

Enable BigQuery, Dataflow, and Google Cloud Storage APIs if not already enabled in the API manager. This will take a few minutes.

Manually in the console this could be done from the left side menu, go to API’s & Services -> Dashboard and click on ENABLE APIS AND SERVICES

Google Cloud Project Setup

Open the Cloud Shell from your Google Cloud Platform Console page:

Set a variable equal to your project id:

export PROJECT=$(gcloud config get-value core/project)

Create Cloud Storage Bucket

Use the make bucket command to create a new regional bucket in us-central1 within your project

gsutil mb -c regional -l us-central1 gs://$PROJECT
export BUCKET=$PROJECT

Create the BigQuery Dataset

Create a dataset in BigQuery. This is where all of your tables will be loaded within BigQuery.

bq mk DataflowJavaSetup

Configuring IntelliJ

Cloning the Dataflow Templates Repo

For this walk through you’ll be using the source code from the Dataflow Templates Repo:

  1. Open IntelliJ.
  2. Choose Check out from Version Control.
  3. Choose Git.
  4. In the URL paste this repo link: https://github.com/GoogleCloudPlatform/DataflowTemplates.git
  5. Click Clone.
  6. Click Yes .

Setting up the Maven Project

  1. Choose Maven.
  2. Click Next.
  3. Check “Search for projects recursively”.
  4. Check “Import Maven projects automatically”.
  5. (Click Next through the next three screens as the defaults will serve your purposes). Click Finish.

Finding the Pipeline Source Code

  1. Traverse the file system tree in this order:
  2. DataflowTemplates/src/main/java/templates/TextIOToBigQuery.java

Running the Dataflow Pipeline

Run the Apache Beam Pipeline Locally

First you’ll do a bit of environment setup, then you’ll run a dataflow template that reads from GCS and applies a javascript UDF and then writes to BigQuery.

Setup

  1. Open the terminal.

Create a BigQuery dataset for this example.

bq mk java_quickstart

Compile the maven project.

mvn clean && mvn compile

Create a Run/Debug configuration for the class that defines this Apache Beam pipeline

This configuration defines for IntelliJ how to run your code locally. This will be essential for early stages of testing and debugging your code without waiting (and paying) for Dataflow to spin up workers.

  1. In the drop down in the upper right hand corner select Edit configurations.
  2. Click the + button and choose Application.
  3. Give your configuration a name. Typically this is the same name as the class the configuration runs but might be more descriptive if you have separate configurations for different arguments.
  4. Paste your program arguments in the designated text box. Let’s use the TestDataflowRunner for now. (See example Arguments below this section)

5. In the working directory drop down select %MAVEN_REPOSITORY%

6. Click Apply and then OK

--project=sandbox
--stagingLocation=gs://sandbox/staging
--tempLocation=gs://sandbox/temp
--templateLocation=gs://sandbox/templates/GcsToBigQueryTemplate.json
--runner=TestDataflowRunner
--javascriptTextTransformGcsPath=gs://sandbox/resources/UDF/sample_UDF.js
--JSONPath=gs://sandbox/resources/schemas/sample_schema.json
--javascriptTextTransformFunctionName=transform
--inputFilePattern=gs://sandbox/data/data.txt
--outputTable=sandbox:java_quickstart.colorful_coffee_people
--bigQueryLoadingTemporaryDirectory=gs://sandbox/bq_load_temp/

Set a break point and run the debugger

The breakpoint will pause the execution of the program allowing you to inspect variables or run code snippets in a REPL.

  1. Put a breakpoint at line 93 of TextIOToBigQuery.java, by clicking to the left of the code on the margin. This area is sometimes referred to as the left gutter.
  2. Right Click > “Debug TextIOToBigQuery”
  3. The pipeline will pause execution when it reaches this line.
  4. Inspect the variables at this point.
  5. Right Click and choose “Evaluate expression” to type Java code snippets (with all the syntax highlighting and type ahead suggestions features you’d expect from IntelliJ) to run to help diagnose the bug and explore the execution state at this breakpoint.

For more details checkout the IntelliJ debugger introduction and debugger documentation!

Run the Apache Beam Pipeline on the Google Cloud Dataflow Runner

To run this pipeline on the Google Cloud Dataflow workers you need to stage the following data file in GCS.

This template requires JSON file with a BigQuery schema and a javascript UDF to transform the text. Use the examples from the documentation.

Create the following two files:

sample_schema.json:

{
“BigQuery Schema”: [
{
“name”: “location”,
“type”: “STRING”
},
{
“name”: “name”,
“type”: “STRING”
},
{
“name”: “age”,
“type”: “STRING”
},
{
“name”: “color”,
“type”: “STRING”
},
{
“name”: “coffee”,
“type”: “STRING”
}
]
}

sample_UDF.js:

function transform(line) {
var values = line.split(‘,’);
var obj = new Object();
 obj.location = values[0];
obj.name = values[1];
obj.age = values[2];
obj.color = values[3];
obj.coffee = values[4];
var jsonString = JSON.stringify(obj);

return jsonString;
}

data.txt

US,joe,18,green,late
CAN,jan,33,red,cortado
MEX,jonah,56,yellow,cappuccino

Stage these files in Google Cloud Storage:

gsutil cp ./sample_UDF.js gs://$BUCKET/resources/UDF/
gsutil cp ./sample_schema.json gs://$BUCKET/resources/schemas/
gsutil cp ./data.txt gs://$BUCKET/data/

This Maven command creates and stages a template at the Cloud Storage location specified with templateLocation. After you create and stage a template, the staging location contains additional files that are necessary to execute your template. If you delete the staging location, template execution will fail. For more information checkout the documentation for Dataflow Templates.

mvn compile exec:java -Dexec.mainClass=com.google.cloud.teleport.templates.TextIOToBigQuery -Dexec.cleanupDaemonThreads=false -Dexec.args=” \
--project=$PROJECT \
--stagingLocation=gs://$BUCKET/staging \
--tempLocation=gs://$BUCKET/temp \
--templateLocation=gs://$BUCKET/templates/GcsToBigQueryTemplate.json \
--runner=DataflowRunner”

Finally you’ll use gcloud command-line tool to submit a job which will run your staged Dataflow Template.

gcloud dataflow jobs run colorful-coffee-people-gcs-test-to-big-query \
--gcs-location=gs://$BUCKET/templates/GcsToBigQueryTemplate.json \
--zone=us-central1-f \
--parameters=javascriptTextTransformGcsPath=gs://$BUCKET/resources/UDF/sample_UDF.js,JSONPath=gs://$BUCKET/resources/schemas/sample_schema.json,javascriptTextTransformFunctionName=transform,inputFilePattern=gs://$BUCKET/data/data.txt,outputTable=$PROJECT:java_quickstart.colorful_coffee_people,bigQueryLoadingTemporaryDirectory=gs://$BUCKET/bq_load_temp/

Off to the Races!

Now you have a development environment set up to start creating pipelines with the Apache Beam Java SDK and submit them to be run on Google Cloud Dataflow.