How to quickly experiment with Dataflow (Apache Beam Python)

One of my colleagues showed me this trick to quickly experiment with Cloud Dataflow/Apache Beam, and it’s already saved me a couple of hours. [Dataflow is Google’s autoscaling, serverless way of processing both batch and streaming data. It runs Apache Beam pipelines. If you haven’t used it, you should try it out.]

To try out some bit of Python Dataflow code, this is what I would do: I would create a Pipeline, read some data from a CSV file, transform it with the code I was trying out, write out the result to a text file and then look at it. Very, very sssslow process.

The cool new way takes advantage of the Python REPL (the command-line interpreter) and the fact that Python lists can function as a Dataflow source.

If necessary, install the Apache Beam package on your machine:

$pip install 'apache-beam[gcp]'

Start the Python interpreter on the command-line:

$ python

Import the Apache Beam package:

>>> import apache_beam as beam

Now, you are ready to roll. You can create a example list and pass it in to a transform:

>>> [3, 8, 12] | beam.Map(lambda x : 3*x)[9, 24, 36]

How cool is that? No pipelines, no input/output files. Just a simple list piped to the Transform code you want to try out.

Here’s an example of trying something on a key-value pair (represented as a 2-tuple in Python Dataflow):

>>> [(‘Jan’,3), (‘Jan’,8), (‘Feb’,12)] | beam.GroupByKey()[(‘Jan’, [3, 8]), (‘Feb’, [12])]

You can keep appending transforms:

>>> [(‘Jan’,3), (‘Jan’,8), (‘Feb’,12)] | beam.GroupByKey() | beam.Map(lambda (mon,days) : (mon,len(days)))[(‘Jan’, 2), (‘Feb’, 1)]

Hope this trick saves you as much time as it saved me.

Happy coding!




A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Lak Lakshmanan

Lak Lakshmanan

Operating Executive at a technology investment firm; articles are personal observations and not investment advice.

More from Medium

Working with JSON data in BigQuery

Deploying Airflow on Google Kubernetes Engine with Helm

Timeseries databases demystified

How to Automate Dataset Comparison Using Terraform And BigQuery