Advanced Machine Learning Techniques for Predicting Maize Crop Yield using Sentinel-2 Satellite Imagery

Ayomide Oraegbu
Mar 6 · 15 min read

Introduction

Prediction of within-field crop variability can support farmers to make the right decisions in different situations. The current advances in remote sensing and the availability of high resolution, high frequency, and free Sentinel-2 imageries improve the implementation of Precision Agriculture for a broader range of farmers.

The aim of this story is to create a model capable of estimating the crop-cut maize yield for fields in East Africa. Given a time series of Sentinel 2 imagery and climate variables. The model would be able to assess corn (Zea mays) grain yield spatial variability in tons per acre.
The Data used for this story was gotten from Zindi.

Why Sentinel?

Sentinel-2 data creates new opportunities for regional, as well as global agricultural monitoring, by making it possible to view Earth in 12 spectral bands in 10–20 m spatial resolution, with global coverage and a 5-day revisit frequency and is compatible with the current and historical Landsat missions. Monitoring of soil properties and crop conditions, along with tillage activity mapping, helps researchers and farmers to assess land use, predict harvests, monitor seasonal changes, and assist in implementing policies for sustainable development. With a growing number of available satellite data sources, many of them free to use, the potential is immense (Sentinel-hub, 2021).

Important Sentinel-2 Bands for Maize Crop Yield Prediction

After each model was trained and used for prediction, the feature importance of all features used for prediction was calculated. The significant sentinel-2 bands which contributed more to the accuracy of the model were discovered to be majorly Red-edge bands, especially the Red-edge 3 (Band 7). Also, the Short wave Infrared(Band 10) also known as Cirrus and the Near infrared narrow (Band 8A) were of benefit to the prediction model.

Significant Vegetation Indices for Maize Crop Yield Prediction

I used various spectral Vegetation Indices (VI) which I discovered in the Sentinel hub repository of custom scripts. However, out of 40 different VI, I discovered 25 were only significant with regards to predicting maize crop yield.

Table 1 below presents candidate VI used in estimating crop yield. The significance of these vegetation indices was identified in recent studies focusing on satellite-based estimation of crop production and yield. However, my review indicated that vegetation indices with a strong relationship to crop yield often included red-edge wavelengths or were designed to be sensitive to canopy chlorophyll content. Noticing this, I decided to use a function in a script written by the Winners of the Zindi CGIAR Crop Yield Prediction Challenge. The function derives significant features from the three red-edge bands of Sentinel-2 imagery. (Note: The function is present within the code below.)

Table 1: Significant Vegetation Indices in Predicting Maize Crop Yield

Importance of Climate Variables

In the dataset provided, climate variables were gotten from Terraclimate. TerraClimate is a dataset of monthly climate and climatic water balance for global terrestrial surfaces from 1958–2019. The TerraClimate data have monthly temporal resolution and a ~4-km (1/24th degree) spatial resolution.

The TerraClimate Variables are Maximum temperature, minimum temperature, vapor pressure, precipitation accumulation, downward surface shortwave radiation, wind-speed, Reference evapotranspiration (ASCE Penman-Montieth), Runoff, Actual Evapotranspiration, Climate Water Deficit, Soil Moisture, Snow Water Equivalent, Palmer Drought Severity Index, Vapor pressure deficit.

The TerraClimate Variables used for prediction consist of the listed variables above, except the SWE(Snow Water Equivalent).

Note: Of all Features used for prediction, climate variables proved to be of more importance. How? …The first seven(7) important features contributing to the accuracy of the model were climate variables.

Table 2: First Seven (7) important Climate Variables

Important Soil-related Features

Soil data for the competition was gotten from the ISRIC World Soil Information. Each asset of a soil grid data is an image with 6 bands, a band for each depth (0–5cm, 5–15cm,15–30cm,30–60cm,60–100cm,100–200cm). However, the soil data for the competition had only bands of 5 to 15cm depth.

The first four important soil-related features contributing to the accuracy of the crop yield model were Cation exchange capacity at pH7 (cec_mean or cec), Silt(silt_mean), Total Nitrogen(nitrogen_mean), and the Organic carbon density (ocd_mean).

Modeling

import os
import random
import sklearn
import numpy as np
import pandas as pd
import seaborn as sns
import lightgbm as lgb
from lightgbm import LGBMRegressor
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_log_error
pd.set_option(‘display.max_columns’, None)
import math as Math
from sklearn.metrics import log_loss
from sklearn.model_selection import KFold
RANDOM_STATE = 42

Note: The function reduces memory usage by converting the data type of each column to the barest minimum — for example, float 64 to float 16. This is because a higher data type consumes high memory compared to lower data types such as float16.

def reduce_mem_usage(df, verbose=True):
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
start_mem = df.memory_usage().sum() / 1024**2
for col in df.columns:
col_type = df[col].dtypes
if col_type in numerics:
c_min = df[col].min()
c_max = df[col].max()
if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)
else:
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
df[col] = df[col].astype(np.float16)
elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float64)
end_mem = df.memory_usage().sum() / 1024**2
if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
return df
#Path to Data Folder
link = 'C:/Crop Yield Prediction/'
#Training dataframe
train_df = pd.read_csv(link + 'Train.csv')
#years for the test fields
test_field_ids_years_df = pd.read_csv(link + 'test_field_ids_with_year.csv')
#additional soil and climate information
add_info_df = pd.read_csv(link + 'fields_w_additional_info.csv')
#Sample Submission File
ss = pd.read_csv(link + 'SampleSubmission.csv')
train_folder =  'C:/Crop Yield Prediction/image_arrays_train/image_arrays_train/'fid = train_df['Field_ID'].sample().values[0]fn =  train_folder + fid + '.npy' 

print(f'Loading {fn} as an array')
#Loading the data with numpy
arr = np.load(fn)
#360 bands, images 40 or 41px a side
print('Array shape:', arr.shape)
#Combine three bands for viewing
rgb_jan = np.stack([arr[4], arr[3], arr[2]], axis=-1)
#Scale band values to (0, 1) for easy image display
rgb_jan = rgb_jan / np.max(rgb_jan)
#View with matplotlib
plt.imshow(rgb_jan)
fig, axs = plt.subplots(3, 4, figsize=(12, 8), facecolor='w', edgecolor='k')
fig.subplots_adjust(hspace = .5, wspace=.001)
axs = axs.ravel()
for i in range(12):
#False colour (band 8, 4 and 3)
rgb = np.stack([arr[i*30 +8], arr[i*30 + 4], arr[i*30 + 3]], axis=-1)
#Scaling consistently
rgb = rgb / 4000
axs[i].imshow(rgb.clip(0, 1))
axs[i].set_title(str(i+1))
band_names = [l.strip() for l in open(link + 'bandnames.txt', 'r').readlines()]
print(band_names)

Feature Engineering Tips

1. Specify Bands of Interest and Average Center Points
2. Average Climate Values Over 4 Years By Each Month in Maize Season in kenya for some Climate Variables
3. Generate Statistics for vegetation indices covering a whole sentinel-2 imagery
4. Use of Significant Vegetation Indices
5. Generate Statistics from Vegetation Indices (Max, Min, Median)
6. Generate Median Features From Red Bands

Note: The Python Functions below are Feature-Engineering functions.

# Center Point 20
def process_im_20(fid, folder='C:/Crop Yield Prediction/image_arrays_train/image_arrays_train/'):

fn = f'{folder}/{fid}.npy'
arr = np.load(fn)
bands_of_interest = ['S2_B12', 'S2_B11', 'S2_B10', 'S2_B9', 'S2_B8', 'S2_B8A', 'S2_B7', 'S2_B6', 'S2_B5',
'S2_B4', 'S2_B3', 'S2_B2', 'S2_B1', 'CLIM_pr', 'CLIM_soil', 'CLIM_aet', 'CLIM_def',
'CLIM_pdsi', 'CLIM_pet', 'CLIM_ro', 'CLIM_srad', 'CLIM_swe', 'CLIM_tmmn', 'CLIM_tmmx',
'CLIM_vap', 'CLIM_vpd', 'CLIM_vs']
values = {}
for month in range(2, 9):
bns = [str(month) + '_' + b for b in bands_of_interest]
idxs = np.where(np.isin(band_names, bns)) # Index of these bands
vs = arr[idxs, 20, 20] # Sample the im at the center point
for bn, v in zip(bns, vs[0]):
values[bn] = v
return values
# Center point_30_10
def process_im_30_10(fid, folder='C:/Crop Yield Prediction/image_arrays_train/image_arrays_train/'):

fn = f'{folder}/{fid}.npy'
arr = np.load(fn)
bands_of_interest = ['S2_B12', 'S2_B11', 'S2_B10', 'S2_B9', 'S2_B8', 'S2_B8A', 'S2_B7', 'S2_B6', 'S2_B5',
'S2_B4', 'S2_B3', 'S2_B2', 'S2_B1', 'CLIM_pr', 'CLIM_soil', 'CLIM_aet', 'CLIM_def',
'CLIM_pdsi', 'CLIM_pet', 'CLIM_ro', 'CLIM_srad', 'CLIM_swe', 'CLIM_tmmn', 'CLIM_tmmx',
'CLIM_vap', 'CLIM_vpd', 'CLIM_vs']
values = {}
for month in range(2, 9):
bns = [str(month) + '_' + b for b in bands_of_interest] # Bands of interest for this month
idxs = np.where(np.isin(band_names, bns)) # Index of these bands
vs = arr[idxs, 30, 10] # Sample the im at the center point
for bn, v in zip(bns, vs[0]):
values[bn] = v
return values
# Center point_10_30
def process_im_10_30(fid, folder='C:/Crop Yield Prediction/image_arrays_train/image_arrays_train/'):

fn = f'{folder}/{fid}.npy'
arr = np.load(fn)
bands_of_interest = ['S2_B12', 'S2_B11', 'S2_B10', 'S2_B9', 'S2_B8', 'S2_B8A', 'S2_B7', 'S2_B6', 'S2_B5',
'S2_B4', 'S2_B3', 'S2_B2', 'S2_B1', 'CLIM_pr', 'CLIM_soil', 'CLIM_aet', 'CLIM_def',
'CLIM_pdsi', 'CLIM_pet', 'CLIM_ro', 'CLIM_srad', 'CLIM_swe', 'CLIM_tmmn', 'CLIM_tmmx',
'CLIM_vap', 'CLIM_vpd', 'CLIM_vs']
values = {}
for month in range(2, 9):
bns = [str(month) + '_' + b for b in bands_of_interest] # Bands of interest for this month
idxs = np.where(np.isin(band_names, bns)) # Index of these bands
vs = arr[idxs, 10, 30] # Sample the im at the center point
for bn, v in zip(bns, vs[0]):
values[bn] = v
return values
def average_variables(add_info_df, special_cols) : 

additional_ = pd.DataFrame()
climat = [‘pr’,’tmmn’,’tmmx’]
for month in range(2, 9):
for climat_ in climat :
features_ = []
for year in [2016,2017,2018,2019] :
features_.append(‘climate_’+str(year) + ‘_’ + str(month) + f’_{climat_}’ )
additional_[f’Average_for_4_Years_Variables{month-1}_{climat_}’] = add_info_df[features_].mean(axis=1)

# Soil Columns
for col in special_cols :
additional_[col] = add_info_df[col]
return additional_
# SAVI
def get_SAVI_data(fid, folder=None) :
fn = f'{folder}/{fid}.npy'
arr = np.load(fn)
bands_of_interest = ['S2_B4', 'S2_B8']
L = 0.725
values = {}

for month in range(12):
bns = [str(month) + '_' + b for b in bands_of_interest]
idxs = np.where(np.isin(band_names, bns))
B08 = arr[idxs][1].astype(int)
B04 = arr[idxs][0].astype(int)

SAVI_arr = (B08 - B04) / (B08 + B04 + L) * (1.0 + L)
values['SAVI_Max_'+str(month)] = SAVI_arr.max()

return values
# EVI
def get_EVI_data(fid, folder=None) :
fn = f'{folder}/{fid}.npy'
arr = np.load(fn)
bands_of_interest = ['S2_B4', 'S2_B8','S2_B2']
values = {}

for month in range(12):
bns = [str(month) + '_' + b for b in bands_of_interest]
idxs = np.where(np.isin(band_names, bns))
B08 = arr[idxs][1].astype(int)
B04 = arr[idxs][0].astype(int)
B02 = arr[idxs][2].astype(int)

EVI_arr = 2.5 * (B08 - B04) / ((B08 + 6.0 * B04 - 7.5 * B02) + 1.0)
values['EVI_Median_'+str(month)] = np.median(EVI_arr)

for month in range(12):
bns = [str(month) + '_' + b for b in bands_of_interest]
idxs = np.where(np.isin(band_names, bns))
B08 = arr[idxs][1].astype(int)
B04 = arr[idxs][0].astype(int)
B02 = arr[idxs][2].astype(int)

EVI_arr = 2.5 * (B08 - B04) / ((B08 + 6.0 * B04 - 7.5 * B02) + 1.0)
values['EVI_Max_'+str(month)] = EVI_arr.max()

return values
# NDVI
def get_NDVI_data(fid, folder=None) :
fn = f'{folder}/{fid}.npy'
arr = np.load(fn)
bands_of_interest = ['S2_B4', 'S2_B8']
values = {}

for month in range(12):
bns = [str(month) + '_' + b for b in bands_of_interest]
idxs = np.where(np.isin(band_names, bns))

NDVI_arr = (arr[idxs][1].astype(int)-arr[idxs][0].astype(int)) / (arr[idxs][1].astype(int)+arr[idxs][0].astype(int))
values['NDVI_Median_'+str(month)] = np.median(NDVI_arr)

for month in range(12):
bns = [str(month) + '_' + b for b in bands_of_interest]
idxs = np.where(np.isin(band_names, bns))

NDVI_arr = (arr[idxs][1].astype(int)-arr[idxs][0].astype(int)) / (arr[idxs][1].astype(int)+arr[idxs][0].astype(int))
values['NDVI_Min_'+str(month)] = NDVI_arr.min()
values['NDVI_Max_'+str(month)] = NDVI_arr.max()

return values
# NDWI
def get_NDWI_data(fid, folder=None) :
fn = f'{folder}/{fid}.npy'
arr = np.load(fn)
bands_of_interest = ['S2_B8', 'S2_B3']
values = {}

for month in range(12):
bns = [str(month) + '_' + b for b in bands_of_interest]
idxs = np.where(np.isin(band_names, bns))

NDVI_arr = (arr[idxs][1].astype(int)-arr[idxs][0].astype(int)) / (arr[idxs][1].astype(int)+arr[idxs][0].astype(int))
values['NDWI_Median_'+str(month)] = np.median(NDVI_arr)

for month in range(12):
bns = [str(month) + '_' + b for b in bands_of_interest]
idxs = np.where(np.isin(band_names, bns))

NDVI_arr = (arr[idxs][1].astype(int)-arr[idxs][0].astype(int)) / (arr[idxs][1].astype(int)+arr[idxs][0].astype(int))
values['NDWI_Min_'+str(month)] = NDVI_arr.min()
values['NDWI_Max_'+str(month)] = NDVI_arr.max()

return values
# NDMI
def get_NDMI_data(fid, folder=None) :
fn = f'{folder}/{fid}.npy'
arr = np.load(fn)
bands_of_interest = ['S2_B11', 'S2_B8']
values = {}

for month in range(12):
bns = [str(month) + '_' + b for b in bands_of_interest]
idxs = np.where(np.isin(band_names, bns))

NDVI_arr = (arr[idxs][1].astype(int)-arr[idxs][0].astype(int)) / (arr[idxs][1].astype(int)+arr[idxs][0].astype(int))
values['NDMI_Median_'+str(month)] = np.median(NDVI_arr)

for month in range(12):
bns = [str(month) + '_' + b for b in bands_of_interest]
idxs = np.where(np.isin(band_names, bns))

NDVI_arr = (arr[idxs][1].astype(int)-arr[idxs][0].astype(int)) / (arr[idxs][1].astype(int)+arr[idxs][0].astype(int))
values['NDMI_Min_'+str(month)] = NDVI_arr.min()
values['NDMI_Max_'+str(month)] = NDVI_arr.max()

return values
def vegetation_indices(train_sampled):

# NDVI = NIR-Red/NIR+Red = Band8-Band4/Band8+Band4
for i in range(2, 10):
train_sampled['NDVI_Month_' + str(i)] = (train_sampled[str(i) + '_S2_B8'] - train_sampled[str(i) + '_S2_B4'])/(train_sampled[str(i) + '_S2_B8'] + train_sampled[str(i) + '_S2_B4'])
# SAVI = ((NIR - RED)/ (NIR + RED + L))*(1 + L)
for i in range(2, 10):
train_sampled['SAVI_Month_' + str(i)] = ((train_sampled[str(i) + '_S2_B8']
- train_sampled[str(i) + '_S2_B4']) / (train_sampled[str(i) + '_S2_B8']
+ train_sampled[str(i) + '_S2_B4'] + 1)) * (1 + 1)
# ARVI = (NIR – (2 * Red) + Blue) / (NIR + (2 * Red) + Blue)
for i in range(2, 10):
train_sampled['ARVI_Month_' + str(i)] = (train_sampled[str(i) + '_S2_B8']
-(2*train_sampled[str(i) + '_S2_B4'])
+ train_sampled[str(i) + '_S2_B2']) / (train_sampled[str(i) + '_S2_B8']
+(2*train_sampled[str(i) + '_S2_B4'])
+ train_sampled[str(i) + '_S2_B2'])
# GCI = (NIR) / (Green) – 1
for i in range(2, 10):
train_sampled['GCI_Month_' + str(i)] = (train_sampled[str(i) + '_S2_B8'])/(train_sampled[str(i) + '_S2_B3'])-1
# NDWI = (NIR-SWIR)/(NIR+SWIR)
for i in range(2, 10):
train_sampled['NDWI_Month_' + str(i)] = (train_sampled[str(i) + '_S2_B3']
- train_sampled[str(i) + '_S2_B8'])/(train_sampled[str(i) + '_S2_B3']
+ train_sampled[str(i) + '_S2_B8'])
# RVI = (NIR/RED)
for i in range(2, 10):
train_sampled['RVI_Month_' + str(i)] = (train_sampled[str(i) + '_S2_B8']) / (train_sampled[str(i) + '_S2_B4'])
# OSAVI ((1.0 + 0.16) * (B08 - B04) / (B08 + B04 + 0.16))
for i in range(2, 10):
train_sampled['OSAVI_Month_' + str(i)] = ((1 + 0.16) * (train_sampled[str(i) + '_S2_B8'] - train_sampled[str(i) + '_S2_B4'])/
(train_sampled[str(i) + '_S2_B8'] + train_sampled[str(i) + '_S2_B4'] + 0.16))
# SLAVI = B08 / (B04 + B12)
for i in range(2, 10):
train_sampled['SLAVI_Month_' + str(i)] = ((train_sampled[str(i) + '_S2_B8'])/ (train_sampled[str(i) + '_S2_B4'] + train_sampled[str(i) + '_S2_B12']))
# NPCRI (b4-b2)/(b4 + b2)
for i in range(2, 10):
train_sampled['NPCRI_Month_' + str(i)] = ((train_sampled[str(i) + '_S2_B4'] - train_sampled[str(i) + '_S2_B2'])/(train_sampled[str(i) + '_S2_B4'] + train_sampled[str(i) + '_S2_B2']))
# Soil Composition Index = (B11 - B08) / (B11 + B08)
for i in range(2, 10):
train_sampled['SCI_Month_' + str(i)] = ((train_sampled[str(i) + '_S2_B11'] - train_sampled[str(i) + '_S2_B8']) / (train_sampled[str(i) + '_S2_B11'] + train_sampled[str(i) + '_S2_B8']))
# SARVI (1.0 + L) * (B08 - (Rr - y * (RB - Rr))) / (B08 + -(Rr - y * (RB - Rr)) + L) # y = 0.735; Rr = 0.740; L = 0.487; RB = 0.560;
for i in range(2, 10):
train_sampled['SARVI_Month_' + str(i)] = (1.0 + 0.487) * (train_sampled[str(i) + '_S2_B8']- (0.740 - 0.735 * (0.560 - 0.740)))/ (train_sampled[str(i) + '_S2_B8'] + - (0.740 - 0.735 * (0.560 - 0.740)) + 0.487)
# ndmi = (b8 - b11) / (b8 + b11)
for i in range(2, 10):
train_sampled['NDMI_Month_' + str(i)] = (train_sampled[str(i) + '_S2_B8'] - train_sampled[str(i) + '_S2_B11'])/ (train_sampled[str(i) + '_S2_B8'] + train_sampled[str(i) + '_S2_B11'])

# EVI_2 = 2.5 (NIR - RED)/(NIR + C2 * Blue + l)
for i in range(2, 9):
train_sampled['EVI_2_Month_' + str(i)] = 2.5* ((train_sampled[str(i) + '_S2_B8']
- train_sampled[str(i) + '_S2_B4'])/((train_sampled[str(i) + '_S2_B8']) + (2.4 * train_sampled[str(i) + '_S2_B4'])+1))

return train_sampled
def vegetation_index_statistics(train_sampled, spectral_indices = None):
for i in spectral_indices:
train_sampled[f'{i}_min'] = train_sampled.filter(regex = f'^{i}').min(axis = 1)
train_sampled[f'{i}_max'] = train_sampled.filter(regex = f'^{i}').max(axis = 1)
train_sampled[f'{i}_avg'] = train_sampled.filter(regex = f'^{i}').mean(axis = 1)
train_sampled[f'{i}_median'] = train_sampled.filter(regex = f'^{i}').median(axis = 1)
return train_sampled
def get_features_from_Bands(fid, folder='C:/Crop Yield Prediction/image_arrays_train/image_arrays_train', bands=None, Band_Names=None) :

fn = f'{folder}/{fid}.npy'
arr = np.load(fn)

final_values = {}
for bands_idx in range(len(bands)) :
bands_of_interest = bands[bands_idx]
Name = Band_Names[bands_idx]
values = {}
for month in range(12):
bns = [str(month) + '_' + b for b in bands_of_interest]
idxs = np.where(np.isin(band_names, bns))
Relation_arr = arr[idxs][1].astype(int) / arr[idxs][0].astype(int)

values[f'BANDS_FEATURES_Median_{Name}_'+str(month)] = np.median(Relation_arr)

final_values.update(values)
return final_values

Processing & Preparation — Data & Feature

1 Apply the Specify Bands of Interest and Average Center Points functions
2 Apply the Reduce Memory Usage function
3 Apply the Average Climate Values Over 4 Years function
4 Apply the Statistics for indices covering a whole sentinel-2 imagery function
5 Apply the Vegetation indices function
6 Apply the Statistic from Vegetation Indices function
7 Apply the Median Features from Red Band function
# Specify Bands of Interest and Center Point
train_sampled_10 = pd.DataFrame([process_im_10_30(fid) for fid in train_df['Field_ID'].values])
train_sampled_20 = pd.DataFrame([process_im_20(fid) for fid in train_df['Field_ID'].values])
train_sampled_30 = pd.DataFrame([process_im_30_10(fid) for fid in train_df['Field_ID'].values])
# divide the three dataframes by each other
train_sampled = 0
train_sampled = (train_sampled_10 + train_sampled_20 + train_sampled_30)/ 3
train_sampled = reduce_mem_usage(train_sampled)
# Average some specific Climate Variables for Additional Data
special_cols = add_info_df.columns[:12] # define columns needed for averaging
averaging_df = average_variables(add_info_df, special_cols)
train_sampled['Yield'] = train_df['Yield'].values # Add in the field ID and yield
train_sampled['Field_ID'] = train_df['Field_ID'].values
train_sampled = pd.merge(train_sampled,averaging_df,on='Field_ID',how='left') # merge together with train_sampled using "field_id"
train_EVI = pd.DataFrame([get_EVI_data(train_df['Field_ID'].values[fid_idx], folder = 'C:/Crop Yield Prediction/image_arrays_train/image_arrays_train/') for fid_idx in tqdm(range(len(train_df['Field_ID'].values))) ])
train_EVI['Field_ID'] = train_df['Field_ID'].values
train_SAVI = pd.DataFrame([get_SAVI_data(train_df['Field_ID'].values[fid_idx], folder = 'C:/Crop Yield Prediction/image_arrays_train/image_arrays_train/') for fid_idx in tqdm(range(len(train_df['Field_ID'].values))) ])
train_SAVI['Field_ID'] = train_df['Field_ID'].values
train_NDVI = pd.DataFrame([get_NDVI_data(train_df['Field_ID'].values[fid_idx], folder = 'C:/Crop Yield Prediction/image_arrays_train/image_arrays_train/') for fid_idx in tqdm(range(len(train_df['Field_ID'].values))) ])
train_NDVI['Field_ID'] = train_df['Field_ID'].values
train_NDWI = pd.DataFrame([get_NDWI_data(train_df['Field_ID'].values[fid_idx], folder = 'C:/Crop Yield Prediction/image_arrays_train/image_arrays_train/') for fid_idx in tqdm(range(len(train_df['Field_ID'].values))) ])
train_NDWI['Field_ID'] = train_df['Field_ID'].values
train_NDMI = pd.DataFrame([get_NDMI_data(train_df['Field_ID'].values[fid_idx], folder = 'C:/Crop Yield Prediction/image_arrays_train/image_arrays_train/') for fid_idx in tqdm(range(len(train_df['Field_ID'].values))) ])
train_NDMI['Field_ID'] = train_df['Field_ID'].values
train_sampled = pd.merge(train_sampled,train_EVI,on='Field_ID',how='left') # merge train_f together with train_sampled using "field_id"
train_sampled = pd.merge(train_sampled,train_SAVI,on='Field_ID',how='left') # merge train_f together with train_sampled using "field_id"
train_sampled = pd.merge(train_sampled,train_NDVI,on='Field_ID',how='left') # merge train_f together with train_sampled using "field_id"
train_sampled = pd.merge(train_sampled,train_NDWI,on='Field_ID',how='left') # merge train_f together with train_sampled using "field_id"
train_sampled = pd.merge(train_sampled,train_NDMI,on='Field_ID',how='left') # merge train_f together with train_sampled using "field_id"
train_sampled = vegetation_indices(train_sampled)
indices = ['NDVI', 'SARVI', 'SAVI', 'GCI', 'SCI', 'ARVI', 'SLAVI', 'RVI','OSAVI', 'NPCRI', 'NDMI', 'NDWI', 'EVI_2']train_sampled = vegetation_index_statistics(train_sampled, spectral_indices= indices)
bands = [ [‘S2_B7’,’S2_B5'] ,[‘S2_B7’,’S2_B6']] 
names = [‘FEATURES_B7_B5’,’FEATURES_B7_B6']
train_f = pd.DataFrame([get_features_from_Bands(train_df[‘Field_ID’].values[fid_idx],bands=bands,Band_Names=names)
for fid_idx in tqdm(range(len(train_df[‘Field_ID’].values))) ])
train_f[‘Field_ID’] = train_df[‘Field_ID’].values # add the field_id to the tain_f dataframe
train_sampled = pd.merge(train_sampled,train_f,on=’Field_ID’,how=’left’) # merge train_f together with train_sampled using “field_id”

Note: Due to mathematical operations carried out on the data, there might be an occurrence of missing values. To discover if missing values are present within the data — missing values were checked for.

Discover Missing Values — if missing values exists fill them with the most occurrent value in each column with missing value.

def missing_zero_values_table(df):
zero_val = (df == 0.00).astype(int).sum(axis=0)
mis_val = df.isnull().sum()
mis_val_percent = 100 * df.isnull().sum() / len(df)
mz_table = pd.concat([zero_val, mis_val, mis_val_percent], axis=1)
mz_table = mz_table.rename(
columns = {0 : 'Zero Values', 1 : 'Missing Values', 2 : '% of Total Values'})
mz_table['Total Zero Missing Values'] = mz_table['Zero Values'] + mz_table['Missing Values']
mz_table['% Total Zero Missing Values'] = 100 * mz_table['Total Zero Missing Values'] / len(df)
mz_table['Data Type'] = df.dtypes
mz_table = mz_table[
mz_table.iloc[:,1] != 0].sort_values(
'% of Total Values', ascending=False).round(1)
print ("Your selected dataframe has " + str(df.shape[1]) + " columns and " + str(df.shape[0]) + " Rows.\n"
"There are " + str(mz_table.shape[0]) +
" columns that have missing values.")
# mz_table.to_excel('D:/sampledata/missing_and_zero_values.xlsx', freeze_panes=(1,0), index = False)
return mz_table
missing_zero_values_table(train_sampled)# Fill with the most occurent value in each column
train_sampled = train_sampled.fillna(train_sampled.mode().iloc[0])

Training and Testing — with Cross-Validation

Why XGBoost?

  1. Execution Speed.
  2. Model Performance

The XGBoost is an algorithm that has recently been dominating applied machine learning and Kaggle competitions for structured or tabular data. XGBoost is an implementation of gradient boosted decision trees designed for speed and performance.

  • When discussing execution speed, the XGBoost is fast. Really fast when compared to other implementations of gradient boosting.
  • XGBoost dominates structured or tabular datasets on classification and regression predictive modeling problems. The evidence is that it is the go-to algorithm for competition winners on the Kaggle competitive data science platform.
from sklearn.model_selection import KFold
import xgboost as xgb
from sklearn.metrics import mean_squared_error
X, y = train_sampled.drop(['Field_ID'],axis=1), train_df['Yield']kf = KFold(n_splits =5,shuffle=True,random_state=160)
feats = pd.DataFrame({'features': X.columns})
gbm_predictions = []
cv_score_ = 0
oof_preds = np.zeros((train_df.shape[0],))
for i,(tr_index,test_index) in enumerate(kf.split(X,y)):
print()
print(f'######### FOLD {i+1} / {kf.n_splits} ')
X_train,y_train = X.iloc[tr_index,:],y[tr_index]
X_test,y_test = X.iloc[test_index,:],y[test_index]

gbm = xgb.XGBRegressor(eval_metric = 'rmse',n_estimators = 2000,learning_rate = 0.001,seed=162,random_state = 162,colsample_bytree=0.65)
gbm.fit(X_train,y_train,eval_set = [(X_test, y_test)],early_stopping_rounds = 200,verbose=100)cv_score_ += mean_squared_error(y_test, gbm.predict(X_test), squared=False) / kf.n_splits
oof_preds[test_index] = gbm.predict(X_test)
preds = gbm.predict(test_sampled[X_train.columns])
gbm_predictions.append(preds)
feats[f'Fold {i}'] = gbm.feature_importances_feats['Importances'] = feats.mean(axis=1)
print( ' CV RMSE : ',cv_score_)
preds_xgb = np.average(gbm_predictions, axis=0)
print(preds_xgb.shape)

Note: Other gradient boosted decision trees such as LightGBM can be implemented. However, to further improve the accuracy of the crop yield prediction model, more tricks are needed to be done to generate more significant features, and model hyperparameter tuning would also add to the predictive capability of the model.

A big thank you to Zindi and also to the Winners of the Zindi CGIAR Crop Yield Prediction Challenge.

Do well to follow me for more updated stories on the use of Deep learning in GIS.

REFERENCES

4th Place solution for the Zindi CGIAR Crop Yield Prediction Challenge.

CodeX

Everything connected with Tech & Code