While there are many advantages in moving to a cloud platform, the promise that captivates me is the idea of serverless infrastructure that automatically allocates compute power per the data being processed by a pipeline. In particular, I see this as essential for cloud-based machine learning, and even more specifically, the analysis of raster data like satellite images.
The Hadoop ecosystem sat — and still sits — at the heart of manipulating massive amounts of data. Installing and maintaining Hadoop-based clusters makes for a lot of work just from an infrastructure perspective — never mind the data analytics. Just setting up proper environments can take months, and ensuring continued success in an enterprise environment takes a managed services team — even for a modest-sized cluster.
Serverless infrastructure, which is relatively new, can effectively address many issues in that time- and resource-consuming processes. In short, we are starting to see something that does a bit of everything very well , something that might be called a “unicorn” in some circles. However, let’s be real — even imaginary unicorns fart…
In this article, I write about how I used Dataflow to develop a pipeline that ingests and transforms Sentinel 2 satellite images into EVI rasters.
Use Case: ML Applied to Remote Sensing
I’ve previously talked about my interest in remote sensing at PyConAU 2017. That domain can greatly benefit from serverless infrastructure, particularly if you are trying to leverage high-resolution images and/or capture features over a large geographical area.
The project I discussed at PyConAU leveraged Landsat 8 satellite images at 30m resolution.
However, for my latest side project, I chose to use Sentinel 2, which has images at 10m resolution, which can make for better remote sensing insights.
I built a custom Python library to aggregate and extract the URIs of the Sentinel 2 raster images hosted in Google Cloud Storage (GCS). This library would then feed the URIs into Dataflow where the images are pulled from GCS, and geospatially corrected, merged, and transformed into an Enhanced Vegetation Index (EVI) using Rasterio — a Python library used to perform geospatial transformations on remote sensing data.
The aim of this Sentinel 2 based project is to reproduce that Landsat-based project at a higher resolution, which requires a scalable application like Dataflow.
Dataproc is generally really good, and my previous criticism of it was the lack of autoscaling. However, autoscaling on Dataproc is now in alpha. That said, I wouldn’t have been able to use Rasterio in conjunction with Dataproc easily. With image analysis in mind, Dataflow was a good option.
Also, if you’re going to do unbounded data processing (see more on what that is below), using the `streaming` and `autoscaling_algorithm` arguments for Dataflow can help reduce cost. Why pay for 200 nodes all the time if you only need 20 for a short period of time?
As an alternative to Dataflow, I could use GCP Cloud Functions or create an interesting Terraform script to obtain my goal. However, Cloud Functions has substantial limitations that make it suited for smaller tasks and Terraform requires a hands-on approach.
The Python SDK for Dataflow lacks feature parity to and is less stable than the Java SDK. This means it’s a good idea to check that the Python SDK suits your requirements and works before going to production with Dataflow for anything that is mission critical.
That said, I had been looking for a good excuse to truly play around with Dataflow. In particular, I wanted to understand the limitations of using Python to develop Dataflow pipelines.
Imagine you just hired a bunch of highly competent junior data scientists — because of course you did. More than likely they can code in Python, which is increasing in popularity because it’s easy to learn and versatile, unlike that gem MATLAB and dare I say R. Goodness, it’s even useful for data engineers and software developers, which to be honest will probably be the shoes your young data scientists start to fill anyways. Also, more specific to Dataflow, as well as Spark, the Python-based implementations of those frameworks require fewer lines of code.
Getting to the point, my preference comes down to ease of use and getting more teams across a powerful cloud the technology.
Handily, the Python SDK for Dataflow has the ability to leverage Python libraries along with an existing custom code already built in Python. It is possible to package your Dataflow deployment in such a way that it leverages custom code as user-defined functions (UDFs). Also, you can quickly and easily prototype Dataflow pipelines using Python lists, which have rather recently been added as an acceptable input type.
Ultimately, you’ll have to take a hard look at your use case and requirements. If what you’re building is mission critical, requires connectors to third-party applications such as Kafka, or necessitates Python 3 (experimental in Dataflow), you may want to reconsider your Python choice. Hopefully, these limitations will change in the near future, but for now, the Python SDK is a useful tool for rapid prototyping and experiments, particularly for ML applications.
Unbounded Data Approach
There are really two options in Dataflow — batch processing and unbounded data processing (AKA stream processing). The semantics of “unbounded data” processing along with why one would take that approach even when they could easily accomplish the task via batch processing is well laid out in Tyler Akidau’s The world beyond batch: Streaming 101 — you should read it.
I chose the unbounded data processing route because doing so was most valuable to expanding my own thought process.
With that in mind, I did a few key things. I ensured I was using the `streaming` parameter.
# Arguments Used by Dataflowargv = [
I built my Dataflow package so it reads from one PubSub Topic and writes to another.
# Reading from and Writing to a PubSub Topic in Dataflowimport apache_beam as beamp = beam.Pipeline(argv=argv)(p
| 'ReadFromPubSub' >> beam.io.ReadFromPubSub(topic=None
| 'Transformations' >> beam.Map(lambda line: Transform(line))
| 'WriteStringsToPubSub-EVIReceiver' >> beam.io.WriteToPubSub(topic=full_receiver_topic_name)
Finally, I chose a window type. Naturally, I experimented with several different options, but a session window made the most sense as there was a definitive end to the data I was loading. Session-based windowing also helped ensure I was able to efficiently transform my data in a way that was desirable to my use case, and it also ensured that if the size of the input data grows in the future that I don’t need to rewrite my code.
# Windowning in Dataflowfrom apache_beam import windowp = beam.Pipeline(argv=argv)K = beam.typehints.TypeVariable('K')
V = beam.typehints.TypeVariable('V')(p
| 'ReadFromPubSub' >> beam.io.ReadFromPubSub(topic=None
| 'Encode' >> beam.Map(lambda line: line.encode("utf-8"))
#| 'Window' >> beam.WindowInto(window.FixedWindows(5.0))
| 'Window' >> beam.WindowInto(window.Sessions(60.0))
| 'GroupBy' >> beam.GroupByKey().with_output_types(beam.typehints.Tuple[K, V])
Limitations aka those “ Unicorn Farts”
Like with all software, or many things in life, limitations do crop up.
The Python implementation of Dataflow, specifically the streaming components are largely in beta. You may read some material that suggests, for example, that the autoscaling parameter doesn’t work. That’s not true, it does, but it’s unclear how that may change, if at all, and if what you write today will support backward compatibility in the near or distant future.
Also, as mentioned above, several third-party connectors are not supported and Python 3 is experimental, which was not an issue for my use case but may be for yours.
Connectivity to BigQuery
While reading from PubSub was incredibly easy I found writing to BigQuery to not be as straight forward. When I figured it out, it dawned on me how easy it actually is. I wish I’d found a better guide at the time. Additionally, reading from BigQuery is its own little adventure — each row consists of a column header and content JSON key-pair. If your schema is simple it doesn’t pose a substantial challenge, but if your schema is constantly fluctuating or lengthy — which can happen in ML applications — you’ll need to develop UDFs to handle your dynamic and fat dataset.
Limited Compute Engine vCPUs
You’re likely limited by the quota of Compute Engines that can be created in your GCP environment. During my project, I could only get 18 Compute Engines with 8 cores each allocated to my Dataflow job. I could increase the quota but going through that process is really not worth the effort at this very moment.
It’s not You, it’s Me?
The only true gotcha that ran into had nothing to do with Dataflow. It had everything to do with one of the dependencies underpinning Rasterio. That library has a GDAL dependency, which supposedly allows you to read from GCS.
However, I found that GDAL is using OAuth 2.0 for API authentication instead of a service account. This, unfortunately, forces me to write a file to disk, and every time you write something to disk, Dataflow generates a new container. Generate enough containers and you quickly run out of disk space, which will bring Dataflow to a grinding halt.
Rasterio doesn’t access GCP in the preferred manner through a service account, which really threw a spanner into this project. It’s best to acknowledge that “serverless” doesn’t mean “pain-free” just yet. You still have to deal with the reality of your use case along with resources like time and money.
To take this experiment further, I have two options. One, build some mechanism that allows me to authenticate via a service account, or two, not use Dataflow and gravitate to a Terraform implementation. However, both will eat up time that I’ve quickly run out of.
The ease and convenience of using Dataflow with Python impressed me. I’m totally a fan. Moreover, the idea of thinking of using an unbounded data processing approach first and foremost is very appealing. Despite some minor limitations — the biggest hurdle had more to do with my specific use case.
I hope Rasterio and/or GDAL move to support GCS through service account authentication. It would be better practice and remote sensing analysis leveraging Rasterio in Dataflow would be spectacular.
More on Dataflow
More on Serverless
You might also like CI/CD in a serverless Google Cloud world and Serverless Data Processing with AWS Step Functions — An Example written by other fellow Servianites.
More from me
Like what I wrote? Check out my last article No Scrum about the value of using Scrum.