Juno for Jupyter: Taking Your Analysis to the Data

As we’ve worked with Jupyter notebooks one of our frustrations has been moving around large data sets to a local or hosted notebook environment. One of the most challenging datasets to move around and analyze is satellite imagery. Each individual image can be gigabytes in size. This can make downloading and analyzing imagery slow and painful.

To make life a bit easier the Timbr.io team has created a new project called Juno. Using Juno you can connect to, execute commands on and get output from remote Jupyter kernels from within a local or hosted Jupyter Notebook, while also sending the outputs to the app for sharing with others. The remote kernels you spin up in Juno can be accessed from your notebook via our new, open source Jupyter Notebook extension called juno-magic.

The Challenges of Satellite Imagery

Using satellite imagery as an example, Juno allows us to dispense with moving the imagery altogether and execute our analysis remotely from our notebook. The philosophy being “if the imagery will not come to the algorithm, then the algorithm must go to the imagery”. The bonus with Juno is that we can compute against remote data in the same familiar, and interactive way that we work with local data, within a Jupyter notebook, and the results are immediately shareable. So, no code orchestration or moving gigabytes of data.

A reasonable question of the approach is why wouldn’t you just host your notebook where the data is? This approach is great if all your data lives in one place, and you have the ability to deploy and secure a notebook there. But if you have multiple large datasets you are working against they are often not all in the same place. Juno let’s you orchestrate computation across multiple kernels regardless of where the data resides, from wherever is most convenient to drive the execution. This can be local, Microsoft’s Cortana Intelligence Gallery, Google’s Cloud Data Lab, Continuum.io’s Wakari, and other hosted notebooks. The pattern can even be inverted allowing you to expose kernels running in any of those platforms over Juno for access from somewhere else.

Dynamically Updating Jupyter Output

Kicking off remote computations that run persistently allows you to update analyses based on new data coming in. With Juno, these updates would be reflected in shareable, dashboard-like, output views available in the application. This could be a “once a day” database update or a live streaming pipeline — like we provide in Timbr.io. We think this aspect of Juno is particularly exciting since it enables Jupyter notebooks to become truly operational assets.

Landsat on AWS as a Use Case

Being long time geogeeks we wanted to break Juno in with a tough and popular big open dataset. AWS in their infinite awesomeness have been hosting a menagerie of open data sources for public computation. Among these is one of our favorite public sources — LANDSAT. As open data projects go Landsat is a trail blazer. It is the longest running acquisition of satellite imagery of earth with 44 years of history. LANDSAT 8 has 30m resolution and eleven spectral bands including thermal infrared. The downside is each image is 1GB compressed and 2GB uncompressed.

That is a lot of data to move around. Especially when you are doing interactive analysis. So, we thought LANDSAT would be a great dataset to use as an example for Juno. We also let you play along instead of just getting some static code snippets. To that end — we set up http://juno.timbr.io, which will run preconfigured kernels setup to process LANDSAT 8 data on AWS.

Getting Started with Juno

Juno-Magic is an open source library for exposing and accessing Jupyter kernels over the Juno routing hub. We have directions for installing the library, generating access tokens and a test LANDSAT kernel all here. A pip install plus a few commands and you’ll be on your way. Also be sure to install gist-magic, which will help everything run smoothly for the examples below. In order to provide a few ideas on how Juno can be used we’ve run a few sample analytics with the LANDSAT data.

LANDSAT imagery doesn’t come out of the box in true color, like we are used to seeing in Google Maps et. al. To get started we’ll set up an algorithm, based on this excellent ShadedRelief tutorial, to generate true color images from the raw LANDSAT 8 imagery on AWS. If we wanted to just do this for one image, downloading and processing would not be a big deal. On the other hand if we wanted to grab a long time series and create a true color movie of change over time — well that would be better to do close to the data.

Using Juno we can make this a lot easier by writing our algorithm locally in the notebook, run it in AWS, and send us back a movie of the results, just as if we had run it locally. As an increasing number of satellites are being launched into space in amazing quantity the opportunity to visualize and measure change is particularly exciting. Dealing with this volume of imagery also opens up a whole host of challenges and we hope Juno is a start in helping tackle them. While animated movies aren’t exactly scientifically rigorous analyses they are a great way to engage the public in how dynamic our planet is.

As an example we selected an area where the Colorado River empties into Lake Powell causing a fascinating pattern of sediment flow. Juno allows us to quickly process a series of raw LANDSAT 8 imagery into a true-color composite time series. For this example we used 24 LANDSAT8 images from January 2015 to December 2015, and dispatched a basic color-correction and compositing algorithm inside a Juno kernel. The entire time-series was then combined into a time-lapse movie showing the sediment flow over the whole year. The notebook for all this can be found here.

Animation of a stack of LANDSAT data

One of the downsides of Jupyter is it can be a little challenging to share output with a non technical audience. For Juno we decided to add a dashboard for notebook output to populate, which could then be easily shared. If your data source is dynamically updating Juno will also allow notebook powered dashboard to automatically update with new data as well. As an example here is the dashboard for a few animations from the LANDSAT data.

Quantifying Burn Impacts from Wildfires

While seeing the change over time is cool, ideally we’d like to quantify change more rigorously. To do so we will switch context and jump into a more applied example of change analysis. Wildfires are persistent threats and calculating the impact of them is critical for fire/risk management, as well as for insurance companies in helping land owners recover from fires.

A true-color composite of a wildfire burning in Northern California in August 2015

To start we’ll want to find specific images in the LANDSAT archive that would be good candidates for doing wildfire change detection. In order to perform the analysis we’ll need the location of some wildfires and imagery before and after the fire. To help accomplish this we can leverage the MODIS Active Fire Detections for CONUS (2015) dataset. The MODIS dataset has several hundred thousands fires that were detected in 2015, so we want to begin by whittling that down the specific fires we are interested in. By exploring our fire location data visually in a variety of graphs, maps, and charts in order we can get a sense for how the data are structured and distributed.

We will use the “confidence” and “temperature” variables from the MODIS data to filter the several thousand fire events down to a manageable number. If we select a value of 90 for confidence and a temperature value of 400 we can get a dataset that is down to 598 fires. Next, we need to find out which of these fires have clear (low-cloud cover) images before and after the date the fire started. Here we can leverage an open-source project called landsat-util inside our notebook to search for LANDSAT imagery from AWS. We search for images with less than 5% cloud coverage that intersect the longitude/latitude of the fire.

In order for a location to be classified as “suitable” we must have at least one image before, and at least one image after the fire’s start date. Inside the notebook we employ some logic to sift through our search results, find suitable areas, and create of list of images for each.

The analysis identifies 22 areas of interest that we can explore in greater detail, and know that we have adequate imagery to explore the area impacted by the fire. Traditionally we might grab a few of these images and experiment with some change detection to create a baseline. With Juno we can use a kernel deployed to Amazon’s AWS, closer to the data, and run our change detection algorithm on LANDSAT data accessed remotely. As with the movies we can generate a dashboard with all our output nicely laid out for an easy, non-technical display.

Before and after Normalized Burn Ratio Indices (NBRI) with the difference of the two showing the burn scars (right).

This helps solve our initial problem with compute on large datasets — we now have the ability to compute across the data to find imagery of interest and create a one off analysis. The next step is running our image processing tasks across all those identified images to generate results without having to move 140GB of imagery around. To illustrate this capability we’ll calculate a burn index to generate an estimate of the number of square kilometers burned by the fires we’ve detected. In order to find the total area burnt we threshold the difference from the change detection and count the number of pixels in the resulting mask.

We can apply the algorithm to each of the suitable areas and determine the spatial impact of each fire. As a result we now have both a map and a chart to visualize where the most impactful fires of 2015 occurred. Using Juno we were able to do all of this interactively running all our analysis at the source of the data — saving many hours of data sherpa work.

Next Steps

One big downside to LANDSAT is the data does not update frequently, so showing off the dynamic compute capabilities of Juno is tough to do the LANDSAT. As cubesats open the possibility of imagery updating multiple times per day Juno can allow us to perpetually update our analysis whenever new imagery is made available. That was a big motivation for the dashboard output approach. As new data rolls in we can update these analyses in real time to provide a true operational capability. This opens the door for alerting and automated assessment/analysis. For instance we could monitor MODIS for fire detection. Then every time a new fire is detected automate the collection of the most recent image before the fire and be on watch of the first image after the fire. This would then roll into an automated change detection, burn index and collated damage assessment. Further, we aren’t limited to any one data set for our analysis. When we get our first notice of a fire from MODIS we could use imagery to detect impervious surface, hit OSM for nearby infrastructure and Census for population data. That gives us all the ingredients we need for a threat index for emerging fire risk.

Play in the Juno Sandbox

The best part is we’ve set up an infrastructure so you can start playing with Juno now. There are directions to install juno-magic in your notebook and spin up a LANDSAT kernel to run computations against. Just select “external kernel” and give it a name.

Then click “add new kernel” and you’ll get the token for your kernel that you’ll reference in the notebook. Wait till the kernel status says “active” and you are good to go.

Kernel Tokens and Status

Next click usage instructions and you’ll get a link to pre-configured containers you can provision. Including a kernel pre-configured with everything you need to do LANDSAT analysis.

Container instructions for Docker

We’ll be adding more kernel types soon, as well as the ability to provision them automatically in a hosted environment. Feel free to use the code from the 1) true color animations and 2) burn area detection notebooks to test it out, or dive in and write your own methods. Lastly we’d be remiss not include a shout out to GDAL and Rasterio for the key role they play enabling all these analytics. Please hit us up with questions or feedback @timbr_io on Twitter.