Interpolation using KNN and IDW

11 min readApr 3, 2023

Spatial Data Analysis and Visualization with Geopandas and Python

In this article I will explain a method to create a density map from a limited number of coordinates, by estimating missing values using interpolation.

In this approach we first make a grid and then assign interpolated weights to each coordinate.

For the interpolation, three methods are explored:

Creating a KNN algorithm using BallTree
Using sklearn KNeighborsRegressor
Nearest neighbours with griddata from scipy

Then we briefly summarize more advanced interpolation methods such as linear and cubic interpolation, and krigin.

Obtaining the data

I will use as example data that is only available in GeoTiff format. After first extracting the weights for a limited number of coordinates, I will then try to reconstruct the original map.

Adding GeoTIFF data as a feature

One of the projects you might have come across, is building a machine learning model to predict real estate prices.

While this can involve a few specific methods, my project basically came down to scraping an Immo website and extracting all the features of the property.

Using concurrency and requests I could scrape 156,000 properties in a matter of hours. With this data, which included both categorical, binary and numerical features, I then implemented an XGBoost model.

Fairly happy with the results my model achieved an r2_score of about 85%.

I was able to tweak this score by including the Mobiscore. This is a measure that tells a property owners how easy it is to reach certain infrastructure such as bus and train stations, schools, hospitals, and so on.

Map access

The data is only available for Flanders and Brussels. We will work with the map that is offered on the website of the Flemish government.

You can download this map from here.

Reading the map

Geographic information systems use GeoTIFF and other formats to organize and store gridded raster datasets such as satellite imagery and terrain models.

The Rasterio python library reads and writes these formats and provides a Python API based on Numpy N-dimensional arrays and GeoJSON.

With this function we can read all the data from the first layer.

import rasterio

dat = rasterio.open(r"data/ni_mobiscore_ha.tif")

# read all the data from the first layer
z = dat.read()[0]

def getval(lon, lat):
    idx = dat.index(lon, lat)    
    return dat.xy(*idx), z[idx]

Now we need the geographical coordinates in the correct CRS. Belgium uses its proprietary coordination system for its maps, so we will have to convert each coordinate to the Lambert72 CRS, which is epsg:31370

This function transforms coordinates of our real estate properties in WGS 84 projection (EPSG 4326) to the Belgian Lambert 72 CRS (EPSG 31370).

from pyproj import Transformer

def conv_to_lambert(df):
    transformer = Transformer.from_crs("epsg:4326", "epsg:31370")

    df['lambert_latitude']=0
    df['lambert_longitude']=0

    for id in df['id']: 
            coords_1 = (df['latitude'][df['id']==id].values[0],\
                        df['longitude'][df['id']==id].values[0] )

            Lambert_x= transformer.transform(coords_1[0], coords_1[1])

            df.loc[df['id']==id, 'lambert_longitude'] = Lambert_x[0]
            df.loc[df['id']==id, 'lambert_latitude'] = Lambert_x[1]

Now we can lookup the mobiscore for all properties in our database. We extract this value from the map on any coordinate and add it as a feature!

Reversing the process

While participating in a hackaton for a Belgian company, I worked on a solution to reverse this process. I wanted to produce a density map from a limited set of coordinates.

Before I knew it, I was writing the code that would achieve this using a ‘K-Nearest Neighbors’ method.

This pragmatic approach avoids the need for extrapolation, so we don’t need to mask or use extra grid points to quickly obtain data for the full region on the map.

The data

While training the model 42,559 properties were actively for sale. These are the coordinates and their weights that we will use to recreate the map.

1) Interpolation using BallTree

To turn this data into a density map we will first need a grid. Then we can assign the mobiscore value of the closest nearby property to each centroid.

Preparing the grid

To do this we start with creating a ‘bounding box’ .

The ‘total_bounds’ attribute in Geopandas gives us the minimum and maximum x and y coordinates of the MULTIPOLYGON that describes the Flemish region.

bounds=region_VL.total_bounds  
minx=bounds[0]
miny=bounds[1]
maxx=bounds[2]
maxy=bounds[3]

# Creating our bounding box
L1 = (miny,minx)  # Left Bottom
L2 = (maxy,minx)  # Left Top
R1 = (miny,maxx)  # Right Bottom
R2 = (maxy,maxx)  # Right Top

How many points will our grid have?

For that we measure the maximum distances in latitude and longitude (in metric units) between the four points forming the boundaries of the Flanders MULTIPOLYGON.

max_longitude_distance = geopy.distance.distance(L2,R2).km 
max_latitude_distance= geopy.distance.distance(R1,R2).km

We then set a desired precision for our grid, we’ll set it to 2 km.

longitude_interval = geopy.distance.distance(kilometers=2)

Then we calculate the number of points to draw. The maximum longitude distance is 273.73 km and latitude 90.86 km. With a 2 km precision this means a (135 x 45) grid will cover the whole Flemish region.

grid=longitude_interval.km

n_longitude_grid_points = int(float(max_longitude_distance)/grid)
n_latitude_grid_points  = int(float(max_latitude_distance)/grid)

Creating the grid

Now we can write a function to build the grid. For this we can use the geopy.sphere attribute ‘destination’.

def destination(point, distance, bearing)

Given a start point, initial bearing, and distance, this will calculate the destination point and final bearing travelling along a great circle arc.

With the following code we can build our grid and transform it to a GeoPandas dataframe in the correct coordinate reference system.

# Create a list of points
points = []

# Build the grid
for j in range(n_latitude_grid_points): # Vertical 
    # Start from bottom left (L1)
    start = (miny+(j*latitude_interval_degrees),minx)
    for i in range(n_longitude_grid_points): # Horizontal
        # Calculate the next point in the longitude direction 
        lon_offset = longitude_interval.destination(start, bearing=90). \
                      longitude - start[1]
        next_point = (start[0], start[1] + lon_offset)        
        points.append(next_point)        
        start = next_point

# Make a GeoDataFrame which will hold the grid
points = [(point[1], point[0]) for point in points]
points= [Point(point) for point in points] 
grid = gpd.GeoDataFrame(geometry=points, crs='epsg:4326')

Plotting the grid

Now that we have our grid as a GeoDataFrame, we can plot it on top of our regional map.

A cartesian grid plotted over a geographical projection (WGS84)

Note: the projection method causes the area on the right of the grid to adapt to the curvature of the earth. Each point in this grid is exactly 2km separated from its neighbour.

Spatial join

Now we need to shape our grid so it stays inside our region. With a spatial join on both of our maps we create a new GeoDataFrame which will only keep the points inside the Flanders area.

region_VL=gpd.read_file("data/Flanders_region.geojson")
grid = gpd.GeoDataFrame(geometry=points, crs='epsg:4326')

# Perform a spatial join between the two GeoDataFrames
joined = gpd.sjoin(grid, region_VL, how='inner')

# Keep only the points inside the multipolygon
points_within_regions = joined['geometry']

# Create a new GeoDataFrame with only the points inside the multipolygon
grid_within_regions = gpd.GeoDataFrame(geometry=points_within_regions,\
 crs='epsg:4326')

Great! Our grid shows a centroid for each 4 km² inside the Flanders region.

Assigning weighted values to the grid

BallTree

Now that we have our grid, we can assign a value to each centroid. As mentioned, I will use an algorithm that searches for the mobiscore value of the most nearby real estate.

Because this would be a time-consuming method if we did it linearly, we can use a method known as BallTree. This basically divides regions around a coordinate in n-dimensional balls to quickly find its closest neighbor.

Note: In our case the balls are actually circles, because we will use this algo on a 2 dimensional map. But the great thing about BallTree is that it can handle many dimensions.

The following code implements this by extracting the (x,y) coordinates with a lambda function and querying the BallTree to find the id of the nearest real estate coordinate.

We set the k value to 1 so only the closest neighbor is taken into account.

# Create a BallTree
tree = BallTree(df_mobis['geometry'].apply(lambda p: [p.x, p.y]).tolist(),\
 leaf_size=2)

# Query the BallTree on each centroid in 'grid_within_regions'
# to find the distance and id of 1 nearest neighbour in 'df_mobis'
 distances, indices = tree.query(
    grid_within_regions['geometry'].apply(lambda p: [p.x, p.y]).tolist(),
    k=1, # The number of nearest neighbors
)

# Add these values to our grid 
grid_within_regions['distance_nearest'] = distances
grid_within_regions['id_nearest'] = indices

# Assign the closest neighbour's mobiscore to the corresponding point
grid_within_regions['mobi_value']= \
    df_mobis.iloc[grid_within_regions['id_nearest']]['mobiscore'].values

Our ‘grid_within_regions’ dataframe now holds the values of the mobiscore of the closest neighboring property.

Changing the colormap

It would be nice to adapt our color scheme to match the official map.

A colormap can be created with LinearSegmentedColormap. It creates colormaps from linear mapping segments. We simply make a normalized array of three colors in (r,g,b) format that we obtained from the official map.

With the handy PixSpy webtool we can easily obtain the colors we’ll need.

from matplotlib.colors import LinearSegmentedColormap

color_palette=np.array([(215,24,29), (238,248,185), (41,130,185)])
color_palette=color_palette/255

cmap = LinearSegmentedColormap.from_list('mobiscore', color_palette) 
norm = plt.Normalize(vmin=0, vmax=10)

colors = [cmap(norm(mobiscore)) for mobiscore in mobiscores]

Result

Finally, we make our grid a little more dense by increasing the resolution. In the next plot our precision is 1 km, and the new color scheme is applied.

Changing the code for KNN

This output is nice, but lacks detail in certain areas, especially near the borders. So what if we tried to interpolate using more than 1 neighbor?

We will turn our code into a KNN algorithm. In the Balltree query we simply have to change the k-value, let’s use 5 nearest neighbors.

# Query the BallTree on each centroid in 'grid_within_regions'
# to find the distance and id of 1 nearest neighbour in 'df_mobis'
 distances, indices = tree.query(
    grid_within_regions['geometry'].apply(lambda p: [p.x, p.y]).tolist(),
    k=5, # The number of nearest neighbors
)

Inverse Distance Weighting (IDW)

We will obtain mobiscore values for each of these points, so then we have to assign a weight to measure its unique value. A popular method to do so is called Inverse Distance Weighting (IDW).

This basically uses the inverse of the distance as a measure of how much a neighbours values will weigh on the final value. Closer neighbors will have more impact, while farther away ones will have less impact.

First we will make a (n,5) array to hold all the mobiscore values.

mobiscore_array = np.array([[df_mobis['mobiscore'].iloc[idx] \
                              for idx in row] for row in indices])

Then we create a (n,5) array with the inverse of our calculated distances.

# Calculate the inverse of the distances
inverse_distances = 1 / distances

On these we apply weighting. Keep in mind that we want the calculated values to be in the same range as the original mobiscore values, therefore we will need the sum of the weights to be one.

# Calculate the sum of the weights 
row_sum = np.sum(inverse_distances, axis=1).reshape(-1, 1)

# Normalize each row 
normalized_weights = inverse_distances/ row_sum

Now that we have our normalized weights, we can simply multiply them with their corresponding values to obtain the weighted mobiscore values.

# Multiply the mobiscores by the corresponding weights
weighted_mobiscores = normalized_weights * mobiscore_array

# Our mobiscore will be the sum of their weighted values
idw_mobiscore = np.sum(weighted_mobiscores, axis=1)

When we plot our new grid we now use the ‘idw_mobiscore’ as weights.

Interpolation using 5 nearest neighbors and IDW

We see that more detail is appearing in previously uniform areas.

2) Using KNeighborsRegressor

This KNN method can be implemented using Python libraries.

In this example we use KNeighborsRegressor class from sklearn library.

from sklearn.neighbors import KNeighborsRegressor

# Create a KNeighborsRegressor instance with the desired number of neighbors
knn_regressor = KNeighborsRegressor(n_neighbors=5)

# Fit the KNeighborsRegressor to the input points and weights
knn_regressor.fit(points, weights)

# Perform KNN interpolation
interpolated_weights = knn_regressor.predict(grid_coordinates)

To calculate the weight from the five neighbors, KNeighborsRegressor uses uniform weights by default. This means that all points in each neighborhood are weighted equally.

We can select the method from ‘uniform’ to ‘distance’, or we can create our custom weights. This method has the advantage that we can set a power to the inverse distances weights.

This is a handy way of factoring a distance effect. Larger power values will result in a more localized interpolation, meaning less smoothing across regions.

#Custom callable for IDW weights

def idw_weights(distances):
    # Set the power parameter for IDW (e.g., 2)
    power = 2
    # Calculate the IDW weights based on distances
    with np.errstate(divide='ignore', invalid='ignore'):
        weights = 1.0 / distances ** power
    return weights

Now we simply change the parameters in our regressor:

knn_regressor = KNeighborsRegressor(n_neighbors=5, weights=idw_weights)

And we get the exact same output we calculated with our own code:

Inverse distance weighting using KNN with 5 neighbors

3) Using Scipy

Interpolation can be implemented withgriddatafrom the Scipy library.

The two things we got to take notice of, is that the arguments must be provided as a Numpy array, and that there are three methods to do the interpolation: nearest, linear and cubic.

Nearest method

Let’s try and reproduce the result with the ‘nearest’ method.

With this code we convert all our coordinates and weights to numpy arrays and interpolate these points with the grid using the mentioned method.

from scipy.interpolate import griddata

# Extract x and y coordinates from the POINT geometries in our grid
x_coords = [pt.x for pt in grid_within_regions.geometry]
y_coords = [pt.y for pt in grid_within_regions.geometry]

# Combine the x and y coordinates into a single NumPy array
grid_coordinates = np.column_stack((x_coords, y_coords))

# Create a (n,2) array from our real estate coordinates
points = np.vstack((re_x, re_y)).T 

# Extract the weights from the mobiscore data in our real estate dataframe
weights=mobis_re_vl['mobiscore'].to_numpy()  
 
interpolated_weights = griddata(points, weights, grid_coordinates,\
                                 method='nearest')

# Create a new GeoDataFrame with the interpolated weights
grid_interpolated = gpd.GeoDataFrame(geometry= \
      [Point(x, y) for x, y in grid_coordinates], crs='epsg:4326')
grid_interpolated['interpolated_weights'] = interpolated_weights

The output looks exactly the same as the balltree method.

Interpolation using scipy griddata (nearest method)

Advanced interpolation methods

As mentioned, ‘griddata ’offers two more advanced interpolation methods.

Their calculations result in a more smoothened map as the weights are calculated using geometry instead of with inverse distance weighting.

Linear method

Linear interpolation works by forming a triangle between the nearest data points and linearly interpolates the values within.

The ‘linear ’method, however, does not extrapolate values that our outside the coordinates of the housing properties.

Therefore our map is now incomplete around the borders!

Interpolation using scipy griddata (linear method)

Cubic Method

The third method uses cubic polynomials to interpolate within the convex hull formed by the nearest data points.

Again we see that areas around the borders of our region are now missing.

When using this method we will need to perform extrapolation on an extra number of grid points to obtain a complete map of our region!

Limitations

With an KNN interpolation method using inverse distance weighting, we are making predictions without saying how certain we are.

More advanced methods such as kriging will also factor in the probability of these predictions. These are however outside the scope of this article.

Conclusion

From a set of coordinates with their weights, we can create an interpolated map so that we can extract the weighted data for any other coordinate.

We can set the desired grid precision in metric units, for example 1 km².

We did not need to extrapolate, because we used the ‘nearest neighbor’ method and applied ‘Inverse Distance Weighting’.

When using more advanced methods, extrapolation will be necessary.

Repository

The full code is available on Github.

References

Ball Tree : https://en.wikipedia.org/wiki/Ball_tree

Advanced Interpolation: https://pygis.io/docs/e_interpolation.html

Krigin interpolation: https://gisgeography.com/kriging-interpolation-prediction/

Different approaches: https://towardsdatascience.com/visualizing-spatial-data-with-geojson-heatmaps-1fbe2063ab86