Python Development Environments for Apache Beam on Google Cloud Platform

These instructions will show you how to set up a development environment for Python Dataflow jobs. By the end you’ll be able to run a Dataflow job locally in debug mode, and execute code in a REPL to speed your development cycles.

Download and install Pycharm

Pycharm is an Integrated Development Environment (IDE) for Python. Pycharm has a free, open source version and contains features including integration with git and a nice debugger.

  1. Download and install Pycharm, the Community edition, by following the link and instructions here.
Download Pycharm here and follow the instructions for installing.

Checkout example code

  1. Open up Pycharm.
  2. Click on create project from version control.
  3. If you’ve already opened Pycharm before, you can choose “VCS” from the pulldown menu -> Checkout from Version Control. This is shown in the below screenshot.
  4. In the Git Repository URL, paste a link to the Google Cloud Professional Services

Github Repo:Git Repository URL:

5. Click on Clone

Create a new project by checking out from the Google Cloud professional services repo.

6. Pycharm will prompt you asking if you’d like to open the professional-services directory. Choose ok

Note: Pycharm may prompt you to add the “vcs.xml” file to Git. Choose “Don’t ask me again” and No. This is a file containing your preferences for Pycharm. Generally files specific to IDEs are not checked into version control as other developers may use different IDEs.

Setup Dependencies

The example code requires the dataflow python library in order to run. These steps will set up the required libraries in Pycharm. This helps Pycharm catch syntax errors when calling the libraries, as well as allows us to run the scripts locally in the IDE in debug mode.

  1. Choose Pycharm -> Preferences
  2. Choose Project: professional-services -> Project Interpreter
  3. Click the Gear Icon -> Add
  4. Choose “New Environment” -> Click Ok
  5. Click the + sign
  6. Type google-cloud-dataflow
  7. Click Install Package
  8. Click OK

Setup run and debug configurations

Setting up the run and debug configuration will allow you to test and debug your pipeline locally. This is typically quicker than running in the dataflow service, as you don’t need to wait for a virtual machine to spin up before your code runs.

  1. Navigate to data-analytics/dataflow_python_examples/data_ingestion.py
  2. Right click on data_ingestion.py and choose Create data_ingestion…
  3. Choose parameters

Paste this into Parameters and replace $PROJECT with your project ID (or set this environment variable)

--project=$PROJECT
--runner=DirectRunner
--staging_location=gs://$PROJECT/test
--temp_location=gs://$PROJECT/test

Setup your project, bucket and other GCP dependencies

  1. replace $PROJECT with your project ID
  2. hit OK
  3. Use gsutil on the command line to create a bucket. You can use your project name as the name of the bucket.

replace $PROJECT with your project ID (or set this environment variable)

gsutil mb -c regional -l us-central1 gs://$PROJECT

4. Use the gsutil command to copy files to the GCS bucket you just created

gsutil cp gs://python-dataflow-example/data_files/usa_names.csv gs://$PROJECT/data_files/
gsutil cp gs://python-dataflow-example/data_files/head_usa_names.csv gs://$PROJECT/data_files/

5. Create the BigQuery Dataset by running this at the command line

bq mk lake

Set a breakpoint and run debugger

The breakpoint will pause the execution of the program allowing you to inspect variables or run code snippets in a REPL.

  1. Put a debug point at line 59 of data_ingestion.py, by clicking to the left of the code on the margin
  2. Run by clicking the debug button next to data_transformation.py
  3. Run the debugger by clicking the debug button

Launch evaluate expression window

Evaluate Expression is a tool within Pycharm which lets you execute code snippets in the middle of debugging your python script. You can inspect variables, set variables, and generally do anything you can write in Python in this window.

  1. Choose Run -> Evaluate Expression
  2. Evaluate Expression will launch a REPL, allowing you to run any python code in the middle of your running program.

Play with the debugger — print strings

Any print statements executed in the Evaluate Expression window will be printed to the console in Pycharm. Try printing some variables and see them show up in the output.

  1. Type print(string_input) in the code fragment window
  2. Click Evaluate
  3. Navigate to the console to see your string printed

Play with the debugger — show variables automatically

Pycharm will always display the last variable in the evaluate expression window as a result. This can be even more convenient than printing the variable. You will also be able to explore structures like nested dictionaries as trees in this window.

  1. Type string_input
  2. Click Evaluate (or hit ctrl + enter)