Google Cloud Dataflow with Python for Satellite Image Analysis

Byron Allen
Mar 14 · 8 min read
Landsat 8 mosaic of Australia’s southeast coast and the tip of Tasmania created using Rasterio

While there are many , the promise that captivates me is the idea of serverless infrastructure that automatically allocates compute power per the data being processed by a pipeline. In particular, I see this as essential for cloud-based machine learning, and even more specifically, the analysis of raster data like satellite images.

The Hadoop ecosystem sat — and still sits — at the heart of manipulating massive amounts of data. Installing and maintaining Hadoop-based clusters makes for a lot of work just from an infrastructure perspective — never mind the data analytics. Just setting up proper environments can take months, and ensuring continued success in an enterprise environment takes a managed services team — even for a modest-sized cluster.

Serverless infrastructure, which is relatively new, can effectively address many issues in that time- and resource-consuming processes. In short, we are starting to see something that does a bit of everything very well , something that might be called a “unicorn” in some circles. However, let’s be real — even imaginary unicorns fart…

In this article, I write about how I used Dataflow to develop a pipeline that ingests and transforms Sentinel 2 satellite images into EVI rasters.


Use Case: ML Applied to Remote Sensing

I’ve previously talked about my interest in remote sensing at PyConAU 2017. That domain can greatly benefit from serverless infrastructure, particularly if you are trying to leverage high-resolution images and/or capture features over a large geographical area.

The project I discussed at PyConAU leveraged satellite images at 30m resolution.

However, for my latest side project, I chose to use, which has images at 10m resolution, which can make for better remote sensing insights.

I built a custom Python library to aggregate and extract the URIs of the Sentinel 2 raster images hosted in . This library would then feed the URIs into where the images are pulled from GCS, and geospatially corrected, merged, and transformed into an using — a Python library used to perform geospatial transformations on remote sensing data.

The aim of this Sentinel 2 based project is to reproduce that Landsat-based project at a higher resolution, which requires a scalable application like Dataflow.


Why Dataflow?

If you want to perform transformations in GCP across a large unbounded dataset you have two key options — Dataflow or via .

Structure of Transformations in Dataflow

Dataproc is generally really good, and my previous criticism of it was the lack of autoscaling. However, autoscaling on . That said, I wouldn’t have been able to use Rasterio in conjunction with Dataproc easily. With image analysis in mind, Dataflow was a good option.

Also, if you’re going to do unbounded data processing (see more on what that is below), using the `streaming` and `autoscaling_algorithm` arguments for Dataflow can help reduce cost. Why pay for 200 nodes all the time if you only need 20 for a short period of time?

As an alternative to Dataflow, I could use GCP or create an interesting Terraform script to obtain my goal. However, Cloud Functions has substantial limitations that make it suited for smaller tasks and Terraform requires a hands-on approach.

Why Python?

The to and is less stable than the . This means it’s a good idea to check that the Python SDK suits your requirements and works before going to production with Dataflow for anything that is mission critical.

That said, I had been looking for a good excuse to truly play around with Dataflow. In particular, I wanted to understand the limitations of using Python to develop Dataflow pipelines.

Imagine you just hired a bunch of highly competent junior data scientists — because of course you did. More than likely they can code in Python, which is increasing in popularity because it’s easy to learn and versatile, unlike that gem MATLAB and dare I say R. Goodness, it’s even useful for data engineers and software developers, which to be honest will probably be the shoes your young data scientists start to fill anyways. Also, more specific to Dataflow, as well as Spark, the Python-based implementations of those frameworks require fewer lines of code.

Getting to the point, my preference comes down to ease of use and getting more teams across a powerful cloud the technology.

Source: StackOverflow — Expected Traffic of Programming Languages

Handily, the Python SDK for Dataflow has the ability to leverage Python libraries along with an existing custom code already built in Python. It is possible to your Dataflow deployment in such a way that it leverages custom code as user-defined functions (UDFs). Also, you can quickly and easily prototype Dataflow pipelines , which have rather recently been added as an acceptable input type.

Ultimately, you’ll have to take a hard look at your use case and requirements. If what you’re building is mission critical, requires connectors to third-party applications such as Kafka, or necessitates Python 3 (), you may want to reconsider your Python choice. Hopefully, these limitations will change in the near future, but for now, the Python SDK is a useful tool for rapid prototyping and experiments, particularly for ML applications.

Unbounded Data Approach

There are really two options in Dataflow — batch processing and unbounded data processing (AKA stream processing). The semantics of “unbounded data” processing along with why one would take that approach even when they could easily accomplish the task via batch processing is well laid out in — you should read it.

I chose the unbounded data processing route because doing so was most valuable to expanding my own thought process.

With that in mind, I did a few key things. I ensured I was using the `streaming` parameter.

# Arguments Used by Dataflowargv = [
'--project={}'.format(PROJECT_ID)
, '--job_name={}'.format(TEST_NAME)
, '--save_main_session'
, '--staging_location=gs://{}/staging/'.format(BUCKET)
, '--temp_location=gs://{}/tmp/'.format(BUCKET)
, '--runner=DataflowRunner'
, '--streaming'
]

I built my Dataflow package so it reads from one PubSub Topic and writes to another.

# Reading from and Writing to a PubSub Topic in Dataflowimport apache_beam as beamp = beam.Pipeline(argv=argv)(p
| 'ReadFromPubSub' >> beam.io.ReadFromPubSub(topic=None
, subscription=full_subscription_name
, id_label=None
, timestamp_attribute=None
).with_output_types(bytes)
| 'Transformations' >> beam.Map(lambda line: Transform(line))
| 'WriteStringsToPubSub-EVIReceiver' >> beam.io.WriteToPubSub(topic=full_receiver_topic_name)
)

Finally, I chose a window type. Naturally, I experimented with several different options, but a session window made the most sense as there was a definitive end to the data I was loading. Session-based windowing also helped ensure I was able to efficiently transform my data in a way that was desirable to my use case, and it also ensured that if the size of the input data grows in the future that I don’t need to rewrite my code.

# Windowning in Dataflowfrom apache_beam import windowp = beam.Pipeline(argv=argv)K = beam.typehints.TypeVariable('K')
V = beam.typehints.TypeVariable('V')
(p
| 'ReadFromPubSub' >> beam.io.ReadFromPubSub(topic=None
, subscription=full_subscription_name
, id_label=None
, timestamp_attribute=None
).with_output_types(bytes)
| 'Encode' >> beam.Map(lambda line: line.encode("utf-8"))
#| 'Window' >> beam.WindowInto(window.FixedWindows(5.0))
| 'Window' >> beam.WindowInto(window.Sessions(60.0))
| 'GroupBy' >> beam.GroupByKey().with_output_types(beam.typehints.Tuple[K, V])
)

Limitations aka those “ Unicorn Farts”

Like with all software, or many things in life, limitations do crop up.

Beta

The Python implementation of Dataflow, specifically the streaming components are largely. You may read some material that suggests, for example, that the autoscaling parameter doesn’t work. That’s not true, it does, but it’s unclear how that may change, if at all, and if what you write today will support backward compatibility in the near or distant future.

Also, as mentioned above, several third-party connectors are not supported and Python 3 is experimental, which was not an issue for my use case but may be for yours.

Connectivity to BigQuery

While reading from PubSub was incredibly easy I found writing to BigQuery to not be as straight forward. When I figured it out, it dawned on me how easy it actually is. I wish I’d found a better guide at the time. Additionally, reading from BigQuery is its own little adventure — each row consists of a column header and content JSON key-pair. If your schema is simple it doesn’t pose a substantial challenge, but if your schema is constantly fluctuating or lengthy — which can happen in ML applications — you’ll need to develop UDFs to handle your dynamic and fat dataset.

Limited Compute Engine vCPUs

You’re likely limited by the quota of Compute Engines that can be created in your GCP environment. During my project, I could only get 18 Compute Engines with 8 cores each allocated to my Dataflow job. I could increase the quota but going through that process is really not worth the effort at this very moment.

It’s not You, it’s Me?

The only true gotcha that ran into had nothing to do with Dataflow. It had everything to do with one of the dependencies underpinning Rasterio. That library has a GDAL dependency, which supposedly allows you to read from GCS.

However, I found that for API authentication instead of a service account. This, unfortunately, forces me to write a file to disk, and every time you write something to disk, Dataflow generates a new container. Generate enough containers and you quickly run out of disk space, which will bring Dataflow to a grinding halt.

Source: Arthur Osipyan via Unsplash

Conclusion

Rasterio doesn’t access GCP in the preferred manner through a service account, which really threw a spanner into this project. It’s best to acknowledge that “serverless” doesn’t mean “pain-free” just yet. You still have to deal with the reality of your use case along with resources like time and money.

To take this experiment further, I have two options. One, build some mechanism that allows me to authenticate via a service account, or two, not use Dataflow and gravitate to a Terraform implementation. However, both will eat up time that I’ve quickly run out of.

The ease and convenience of using Dataflow with Python impressed me. I’m totally a fan. Moreover, the idea of thinking of using an unbounded data processing approach first and foremost is very appealing. Despite some minor limitations — the biggest hurdle had more to do with my specific use case.

I hope Rasterio and/or GDAL move to support GCS through service account authentication. It would be better practice and remote sensing analysis leveraging Rasterio in Dataflow would be spectacular.

Source: NASA via Unsplash

Want More?

More on Dataflow

Check out’s and.

More on Serverless

You might also like and written by other fellow Servianites.

More from me

Like what I wrote? Check out my last article about the value of using Scrum.

WeAreServian

The Cloud and Data Professionals

Thanks to Graham Polley.

Byron Allen

Written by

Texan transplant to Australia turned Australian transplant to the UK. Evangelist of continuous experimentation in ML. Senior Consultant at Servian.

WeAreServian

The Cloud and Data Professionals