Predicting Malaria Vector Populations Using Google Earth Engine and Cloud AutoML

A workflow to generate machine learning models using structured data from satellite imagery

Published in

Geek Culture

7 min readApr 28, 2021

Malaria is a parasitic disease caused by plasmodium parasites that are transmitted through the bite of the female Anopheles mosquito. In 2019 there were over 229 million cases of the virus worldwide. The illness can lead to death if it is not treated within 24 hours. By using precise mosquito data from the United States, mosquito population data can be paired with remotely sensed data to generate a mosquito population model to be used in any part of the world. In 2019, Africa had 94% of the world’s cases of Malaria.

To begin, you’ll need a Google Earth Engine account, a Google Cloud Platform account, and a standard Google email account to access Google drive. Fortunately you will not need to download any software for this workflow.

The Mosquitos

First we’ll download mosquito data that was compiled by a team of researchers who are trying to create a national vector database. Here is the link. The data comes from local, state, and federal government mosquito traps around the country. There are over 200,000 rows of trap data. For this example we will filter the data to Anopheles mosquitos for the year 2015; there will be about 4,000 rows remaining. Each row has latitude and longitude coordinates showing where the trap was located.

The mosquito data used in this workflow. Image by author.

The data is distributed across eight states.

Data distribution by state. Image by author.

The Satellite Data

From the Earth Engine data catalog. Image by author.

For this workflow we will be using Google Earth Engine to extract satellite data for each mosquito trap point. Because of this, the dataset must be available in the Google Earth Engine Data Catalog. Any global dataset can be used for this, but MODIS Terra Surface Reflectance 8-Day Global 500m was chosen for this example. You may be asking “Why MODIS instead of Sentinel or Landsat? Aren’t they higher resolutions?” Yes, they are. The reason MODIS 8-day was chosen is because the 8-day composite, generated by LP DAAC, is generally very near cloud-free. Landsat has a temporal frequency(revisit time) of 16 days, and even then the image may have clouds. Sentinel 2 data availability begins in 2017, and our data is from 2015. Finally, MODIS 8-day 500m is the best choice for this project because we will be buffering our points and extracting the mean value anyway, so spatial resolution does not matter that much. Rather than using an index such as NDVI, we will be using all 7 bands. More data is usually best for machine learning.

Google Earth Engine

Google Earth Engine is a powerful tool for geospatial analysis and visualization. Without Earth Engine in this workflow, every image would need to be downloaded to extract its values.

Copy and paste the following code into Earth Engine:

This first block of code defines functions for buffering the points and extracting band values.

function bufferPoints(radius, bounds) {
 return function(pt) {
 pt = ee.Feature(pt);
 return bounds ? pt.buffer(radius).bounds() : pt.buffer(radius);
 };
}function zonalStats(ic, fc, params) {
 // Initialize internal params dictionary.
 var _params = {
 reducer: ee.Reducer.mean(),
 scale: null,
 crs: null,
 bands: null,
 bandsRename: null,
 imgProps: null,
 imgPropsRename: null,
 datetimeName: ‘datetime’,
 datetimeFormat: ‘YYYY-MM-dd HH:MM:ss’
 };// Replace initialized params with provided params.
 if (params) {
 for (var param in params) {
 _params[param] = params[param] || _params[param];
 }
 }// Set default parameters based on an image representative.
 var imgRep = ic.first();
 var nonSystemImgProps = ee.Feature(null)
 .copyProperties(imgRep).propertyNames();
 if (!_params.bands) _params.bands = imgRep.bandNames();
 if (!_params.bandsRename) _params.bandsRename = _params.bands;
 if (!_params.imgProps) _params.imgProps = nonSystemImgProps;
 if (!_params.imgPropsRename) _params.imgPropsRename = _params.imgProps;// Map the reduceRegions function over the image collection.
 var results = ic.map(function(img) {
 // Select bands (optionally rename), set a datetime & timestamp property.
 img = ee.Image(img.select(_params.bands, _params.bandsRename))
 .set(_params.datetimeName, img.date().format(_params.datetimeFormat))
 .set(‘timestamp’, img.get(‘system:time_start’));// Define final image property dictionary to set in output features.
 var propsFrom = ee.List(_params.imgProps)
 .cat(ee.List([_params.datetimeName, ‘timestamp’]));
 var propsTo = ee.List(_params.imgPropsRename)
 .cat(ee.List([_params.datetimeName, ‘timestamp’]));
 var imgProps = img.toDictionary(propsFrom).rename(propsFrom, propsTo);// Subset points that intersect the given image.
 var fcSub = fc.filterBounds(img.geometry());// Reduce the image by regions.
 return img.reduceRegions({
 collection: fcSub,
 reducer: _params.reducer,
 scale: _params.scale,
 crs: _params.crs
 })
 // Add metadata to each feature.
 .map(function(f) {
 return f.set(imgProps);
 });
 }).flatten().filter(ee.Filter.notNull(_params.bandsRename));return results;
}

Next we will buffer every point by 50 meters, and for every point, for every image in the year 2015 (we will find the correct image for each point later) , calculate the mean band values within the buffer for all 7 MODIS bands and log it within the buffer polygons.

var pts = ee.FeatureCollection(‘users/spm0004/Mosquitos_Anopheles_2015’);var ptsModis = pts.map(bufferPoints(50, true));var modisref = ee.ImageCollection(‘MODIS/006/MOD09A1’)
 .filterDate(‘2015–01–01’, ‘2016–01–01’)
 .filter(ee.Filter.calendarRange(0, 365, ‘DAY_OF_YEAR’));
 
// Define parameters for the zonalStats function.var paramsref = {
 reducer: ee.Reducer.median(),
 scale: 500,
 crs: ‘EPSG:5070’,
 bands: [‘sur_refl_b01’, ‘sur_refl_b02’, ‘sur_refl_b03’,’sur_refl_b04',’sur_refl_b05',’sur_refl_b06',’sur_refl_b07'],
 
 datetimeName: ‘date_MODIS’,
 datetimeFormat: ‘MM/dd/YYYY’
};// Extract zonal statistics per point per image.
var ptsModisStats = zonalStats(modisref, ptsModis, paramsref);print(ptsModisStats.limit(50));

Finally, we will export the mosquito polygon point buffers, now combined with MODIS bands, to Google drive.

Export.table.toDrive({
 collection: ptsModisStats,
 folder: ‘Malaria’,
 description: ‘Mosquitos_NorthAmerica’,
 fileFormat: ‘CSV’
});

The result of this workflow will be a model that will be able to predict mosquito populations from 7 MODIS bands anywhere in the world. Lets generate 10,000 random points over Africa to see the model in action. For simplicity, a polygon of Africa was drawn within Earth Engine by tracing the continent.

var geometry1=ee.FeatureCollection(Africa);var pts2 = ee.FeatureCollection.randomPoints(geometry1, 10000)//randomly generating 10000 points within the outlinevar ptsModis2 = pts2
var ptsModis3 = pts2.map(bufferPoints(50, true));var modisref2 = ee.ImageCollection(‘MODIS/006/MOD09A1’)
 .filterDate(‘2021–04–10’, ‘2021–04–27’) //change this to the past 8 days
 .filter(ee.Filter.calendarRange(0, 365, ‘DAY_OF_YEAR’));var paramsref2 = {
 reducer: ee.Reducer.median(),
 scale: 500,
 crs: ‘EPSG:5070’,
 bands: [‘sur_refl_b01’, ‘sur_refl_b02’, ‘sur_refl_b03’,’sur_refl_b04',’sur_refl_b05',’sur_refl_b06',’sur_refl_b07'],
 
 datetimeName: ‘date_MODIS’,
 datetimeFormat: ‘MM/dd/YYYY’
};// Extract zonal statistics per point per image.
var ptsModisStats2 = zonalStats(modisref2, ptsModis2, paramsref2);print(ptsModisStats2.limit(50));Export.table.toDrive({
 collection: ptsModisStats2,
 folder: ‘Malaria’,
 description: ‘Africa’,
 fileFormat: ‘CSV’
});Map.addLayer(pts2, {color: ‘white’}, ‘geodesic polygon’);

Cleaning the Data

The next step involves sorting the images and points so that each point is using imagery that was collected less than 8 days after the trap was collected(currently every trap has data for every image in 2015). To do this we will use Google Colab. Google Colab is a Jupyter notebook that operates completely online, no download necessary. As an added benefit it is connected to your Google drive account, which is where our data is. The following python code will extract the correct image for each point.

from google.colab import drivedrive.mount(“/content/drive”)from google.colab import authimport pandas as pdimport numpy as npimport randomfrom datetime import datetimedate_format = “%m/%d/%Y”modisdf=pd.read_csv(r’/content/drive/MyDrive/Mosquitos_Anopheles_2015.csv’)modisdf[‘date’] = pd.to_datetime(modisdf[‘date’], format=date_format, errors=’ignore’)modisdf[‘date_MODIS’] = pd.to_datetime(modisdf[‘date_MODIS’], format=date_format, errors=’ignore’)modisdf[‘Difference’] = (modisdf[‘date_MODIS’] — modisdf[‘date’])modisdf[‘Difference’]=modisdf[‘Difference’].dt.daysmodisdf[‘Difference’]=pd.to_numeric(modisdf[‘Difference’])df = modisdf[modisdf[‘Difference’] > -1]modisdf = df[df[‘Difference’] < 8]modisdf=modisdf.drop([‘system:index’, ‘.geo’], axis=1)modisdf=modisdf.reset_index()print(modisdf.columns)modisdf.to_csv(r’/content/drive/MyDrive/Mosquitos_Anopheles_2015_cleaned.csv’)

Google Cloud AutoML

Now time for model creation. Thanks to Cloud AutoML, this task is simple. Cloud AutoML is a service on Google Cloud Platform to make it easy for anybody to use machine learning in their work. Given data, it actually attempts multiple models simultaneously to determine the best match for your specific dataset. First you’ll need to enable the Cloud AutoML API in your Google Cloud Platform account. Next, create a new dataset and upload the “cleaned” .csv that was generated in Colab. Once it loads, you’ll be able to pick the prediction column. You can click on the columns to see the percentage of unique variables in the form of a chart for each column.

Data distribution for band 7. Image by author.

We will use the “abundance” column. Once the prediction column is selected, AutoML populates the Cramér’s V correlation statistic between each column and the target column to help you decide which columns to include in the model. We will be using the 7 MODIS bands from Earth Engine. By default, 80% of your data rows are for training, 10% for validation, and 10% for testing. Click on “Train Model”, select the 7 bands as input features, and then enter training budget hours. Training takes 2 hours minimum, but you’ll get an email when it’s complete.

Google Cloud AutoML model. Image by author.

Once you’re satisfied with a model, click on “Test and Use” “Online Prediction” and then “Deploy Model”. Once the model is deployed we will use our 10,000 random points over Africa to make predictions.