How to quickly experiment with Dataflow

One of my colleagues showed me this trick to quickly experiment with Cloud Dataflow, and it’s already saved me a couple of hours. [Dataflow is Google’s autoscaling, serverless way of processing both batch and streaming data. If you haven’t used it, you should try it out.]

To try out some bit of Python Dataflow code, this is what I would do: I would create a Pipeline, read some data from a CSV file, transform it with the code I was trying out, write out the result to a text file and then look at it. Very, very sssslow process.

The cool new way takes advantage of the Python REPL (the command-line interpreter) and the fact that Python lists can function as a Dataflow source.

If necessary, install the google-cloud-dataflow package on your machine:

$pip install google-cloud-dataflow

Start the Python interpreter on the command-line:

$ python

Import the Apache Beam package:

>>> import apache_beam as beam

Now, you are ready to roll. You can create a example list and pass it in to a transform:

>>> [3, 8, 12] | beam.Map(lambda x : 3*x)
[9, 24, 36]

How cool is that? No pipelines, no input/output files. Just a simple list piped to the Transform code you want to try out.

Here’s an example of trying something on a key-value pair (represented as a 2-tuple in Python Dataflow):

>>> [(‘Jan’,3), (‘Jan’,8), (‘Feb’,12)] | beam.GroupByKey()
[(‘Jan’, [3, 8]), (‘Feb’, [12])]

You can keep appending transforms:

>>> [(‘Jan’,3), (‘Jan’,8), (‘Feb’,12)] | beam.GroupByKey() | beam.Map(lambda (mon,days) : (mon,len(days)))
[(‘Jan’, 2), (‘Feb’, 1)]

Hope this trick saves you as much time as it saved me.

Happy coding!

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.