Random Samples With Buffering
Earth Engine by Example
A common concern when doing land cover classifications is the risk of spatial autocorrelation in sampled data skewing the prediction results or accuracy assessment. One method that can help with this issue is ensuring the training and validation samples are sufficiently spaced using some form of buffering. This example will demonstrate one way to do that.
A brute-force method for generating a random sample with a buffer might be to take a large set of samples and filter those down to a smaller number by discarding close neighbors. However, picking which points to start from can be challenging, and computing the cross-product of distances between points can be expensive, so that approach is unlikely to scale well. The general rule for Earth Engine is “the more work that can be done in image space (using tiles and pixels), the better a solution is likely to scale.” To that end, this example will demonstrate buffering points by generating grid cells of a specified size and sampling one point from each grid cell.
Generating raster-based grid cells is very straightforward; you simply reproject any image into the desired projection+scale using reproject()
. This example will eventually rely on neighboring cells having unique integer values, so a good starting point is to reproject a random image generated by ee.Image.random()
.
The next step is to pick a random point within each of the grid cells. That can be accomplished (still in image space) using reduceConnectedComponents()
on the cells result plus a second random image, selecting the maximum random value in each grid cell. The reduceConnectedComponents
function applies a reducer to the values (the random
image) covered by each patch of homogenous values in a label band (the cells
). This example uses a grid for the second random image 1/16th the original grid’s size, meaning there are 256 random points generated inside each grid cell. The location where random == maximum
within each grid cell gets marked with a 1 value and the rest of the values are masked.
This configuration doesn’t quite guarantee that each point is distance
meters away from the nearest neighbor, only that they are that far apart on average. This is a “loose” idea about point spacing. If “strict” adherence to the buffering distance is needed, then every other row and column of cells in the grid can be masked out using ee.Image.pixelCoordinates()
and some math. Removing these cells guarantees each point will be a minimum of distance
from its nearest neighbor and 2*distance
on average.
The final result of points can be extracted to a FeatureCollection using reduceToVectors
. The images below display the extracted points with a buffer of radius distance/2
for visualization purposes. Note, in the 50km version (left, orange) that there are points that nearly touch in the lower left and upper right of the image, but none that overlap.
With everything built into a callable function, the complete script can be found at https://goo.gle/3tsFpa7 along with a utility for displaying the pixel grid of a projection.
Caveats
- Use
clip()
beforereproject()
so individual cells on the coastline don’t get split into separate parts (and become multiple points). - Using reproject can often be problematic when displaying results on a map, since it overrides Earth Engine’s normal scaling behaviors. It will only be an issue in this example if you use a small
cellSize
and then zoom out really far. There should be no issues displaying (or using) the final FeatureCollection, since all the reprojects are map-independent by then. - I was able to get this to scale to >300,000 points in the Code Editor. To use more points than that, you might need to run it as a table export, or use multiple passes. But split things spatially; otherwise, points might not maintain the desired spacing.
- You can add bands to the inputs going into
reduceToVectors
to sample covariates at the same time (use afirst
reducer in that case). If you run out of memory, try exporting the points (without covariates) to a table first. - To do stratified sampling, you can simply replace
reduceToVectors
withstratifiedSample
, however, you will need to mask the class band with thepoints
image. - I elected to use an Albers projection because both Mercator and plate carrée have distance distortion as you move away from the origin, so it’s harder to ensure the minimum distance guarantee with fixed-size grid cells in those projections. Note: the projection you use to generate your points doesn’t have to match the projection you use to sample your covariates.
- Suppose you already have points and just want to select a subset that meets the buffering criteria. In that case, you can use
reduceRegions
with a max reducer on therandom
image, grouping by thecells
image. Themax
reducer will allow you to specify additional inputs (e.g.: covariates or pixel coordinates) to carry along with whatever maximum it finds. - If you’re going to take multiple samples for e.g.: k-fold cross-validation, you should offset the grid each time so you’re not using the exact same sampling grid for each fold. You can do that with something like this:
See Also
I did a video on object-based image analysis at the 2018 Geo For Good. There’s a lot more about reduceConnectedComponents
in there.