How to quickly experiment with Dataflow

One of my colleagues showed me this trick to quickly experiment with Cloud Dataflow, and it’s already saved me a couple of hours. [Dataflow is Google’s autoscaling, serverless way of processing both batch and streaming data. If you haven’t used it, you should try it out.]

To try out some bit of Python Dataflow code, this is what I would do: I would create a Pipeline, read some data from a CSV file, transform it with the code I was trying out, write out the result to a text file and then look at it. Very, very sssslow process.

The cool new way takes advantage of the Python REPL (the command-line interpreter) and the fact that Python lists can function as a Dataflow source.

If necessary, install the google-cloud-dataflow package on your machine:

$pip install google-cloud-dataflow

Start the Python interpreter on the command-line:

$ python

Import the Apache Beam package:

>>> import apache_beam as beam

Now, you are ready to roll. You can create a example list and pass it in to a transform:

>>> [3, 8, 12] | beam.Map(lambda x : 3*x)
[9, 24, 36]

How cool is that? No pipelines, no input/output files. Just a simple list piped to the Transform code you want to try out.

Here’s an example of trying something on a key-value pair (represented as a 2-tuple in Python Dataflow):

>>> [(‘Jan’,3), (‘Jan’,8), (‘Feb’,12)] | beam.GroupByKey()
[(‘Jan’, [3, 8]), (‘Feb’, [12])]

You can keep appending transforms:

>>> [(‘Jan’,3), (‘Jan’,8), (‘Feb’,12)] | beam.GroupByKey() | beam.Map(lambda (mon,days) : (mon,len(days)))
[(‘Jan’, 2), (‘Feb’, 1)]

Hope this trick saves you as much time as it saved me.

Happy coding!