Configure Python to run Dataflow Jobs in Cloud Shell
Published in
2 min readMay 3, 2021
Steps
- You need to update the requirements.txt to specify the Python modules that are required to deploy the Dataflow jobs using virtualenv in Cloud Shell.
- Paste the following list of modules into requirements.txt.
- This list ensures that the correct Python modules will be installed to allow you to deploy the Python Dataflow jobs.
- The list also includes the Faker modules and some dependencies that are required where you deploy and test a streaming Dataflow job.
vi requirements.txtapache-beam==2.14.0
google-api-core==1.14.2
google-apitools==0.5.28
google-auth==1.6.3
google-cloud==0.34.0
google-cloud-bigquery==1.17.0
google-cloud-bigtable==0.32.2
google-cloud-core==1.0.0
google-cloud-datastore==1.9.0
google-cloud-pubsub==0.42.1
google-cloud-storage==1.17.0
google-cloud-vision==0.38.0
httplib2==0.12.0
mock==2.0.0
numpy==1.17.0
six==1.12.0
Faker==2.0.0
faker-schema==0.1.4
Cython==0.29.13
fastavro==0.21.24
3. Enter the following command in the Cloud Shell to create a virtualenv environment.
virtualenv -p `which python3.7` dataflow-env
4. Enter the following command in the Cloud Shell to activate the virtualenv environment.
source dataflow-env/bin/activate
5. Enter the following command in Cloud Shell to install the Python modules in your virtualenv environment using the requirements.txt file.
pip install -r /home/$USER/professional-services/examples/dataflow-python-examples/requirements.txt
6. Verify the installed version for Apache Beam.
pip list
The output should indicate that the Apache Beam version is v2.14.0 which is the current latest.