How to quickly experiment with Dataflow (Apache Beam Python)
--
One of my colleagues showed me this trick to quickly experiment with Cloud Dataflow/Apache Beam, and it’s already saved me a couple of hours. [Dataflow is Google’s autoscaling, serverless way of processing both batch and streaming data. It runs Apache Beam pipelines. If you haven’t used it, you should try it out.]
To try out some bit of Python Dataflow code, this is what I would do: I would create a Pipeline, read some data from a CSV file, transform it with the code I was trying out, write out the result to a text file and then look at it. Very, very sssslow process.
The cool new way takes advantage of the Python REPL (the command-line interpreter) and the fact that Python lists can function as a Dataflow source.
If necessary, install the Apache Beam package on your machine:
$pip install 'apache-beam[gcp]'
Start the Python interpreter on the command-line:
$ python
Import the Apache Beam package:
>>> import apache_beam as beam
Now, you are ready to roll. You can create a example list and pass it in to a transform:
>>> [3, 8, 12] | beam.Map(lambda x : 3*x)[9, 24, 36]
How cool is that? No pipelines, no input/output files. Just a simple list piped to the Transform code you want to try out.
Here’s an example of trying something on a key-value pair (represented as a 2-tuple in Python Dataflow):
>>> [(‘Jan’,3), (‘Jan’,8), (‘Feb’,12)] | beam.GroupByKey()[(‘Jan’, [3, 8]), (‘Feb’, [12])]
You can keep appending transforms:
>>> [(‘Jan’,3), (‘Jan’,8), (‘Feb’,12)] | beam.GroupByKey() | beam.Map(lambda (mon,days) : (mon,len(days)))[(‘Jan’, 2), (‘Feb’, 1)]
Hope this trick saves you as much time as it saved me.
Happy coding!