Configure Python to run Dataflow Jobs in Cloud Shell

Arun Kumar
Cloud Techies
2 min readMay 3, 2021

--

Steps

  1. You need to update the requirements.txt to specify the Python modules that are required to deploy the Dataflow jobs using virtualenv in Cloud Shell.
  2. Paste the following list of modules into requirements.txt.
  • This list ensures that the correct Python modules will be installed to allow you to deploy the Python Dataflow jobs.
  • The list also includes the Faker modules and some dependencies that are required where you deploy and test a streaming Dataflow job.
vi requirements.txtapache-beam==2.14.0
google-api-core==1.14.2
google-apitools==0.5.28
google-auth==1.6.3
google-cloud==0.34.0
google-cloud-bigquery==1.17.0
google-cloud-bigtable==0.32.2
google-cloud-core==1.0.0
google-cloud-datastore==1.9.0
google-cloud-pubsub==0.42.1
google-cloud-storage==1.17.0
google-cloud-vision==0.38.0
httplib2==0.12.0
mock==2.0.0
numpy==1.17.0
six==1.12.0
Faker==2.0.0
faker-schema==0.1.4
Cython==0.29.13
fastavro==0.21.24

3. Enter the following command in the Cloud Shell to create a virtualenv environment.

virtualenv -p `which python3.7` dataflow-env

4. Enter the following command in the Cloud Shell to activate the virtualenv environment.

source dataflow-env/bin/activate

5. Enter the following command in Cloud Shell to install the Python modules in your virtualenv environment using the requirements.txt file.

pip install -r /home/$USER/professional-services/examples/dataflow-python-examples/requirements.txt

6. Verify the installed version for Apache Beam.

pip list

The output should indicate that the Apache Beam version is v2.14.0 which is the current latest.

--

--

Arun Kumar
Cloud Techies

Cloud Architect | AWS, GCP, Azure, Python, Kubernetes, Terraform, Ansible