Advanced Machine Learning Techniques for Predicting Maize Crop Yield using Sentinel-2 Satellite Imagery

Ayomide Oraegbu
Mar 6 · 15 min read

Introduction

Prediction of within-field crop variability can support farmers to make the right decisions in different situations. The current advances in and the improve the implementation of Precision Agriculture for a broader range of farmers.

The aim of this story is to create a model capable of estimating the crop-cut maize yield for fields in . Given a time series and . The model would be able to assess The Data used for this story was gotten from

This story investigates the use of vegetation indices (VIs) and advanced deep learning techniques in predicting maize crop yield.

Why Sentinel?

Sentinel-2 data creates new opportunities for regional, as well as global agricultural monitoring, by making it possible to view Earth in 12 spectral bands in 10–20 m spatial resolution, with global coverage and a 5-day revisit frequency and is compatible with the current and historical Landsat missions. Monitoring of soil properties and crop conditions, along with tillage activity mapping, helps researchers and farmers to assess land use, predict harvests, monitor seasonal changes, and assist in implementing policies for sustainable development. With a growing number of available satellite data sources, many of them free to use, the potential is immense

Important Sentinel-2 Bands for Maize Crop Yield Prediction

After each model was trained and used for prediction, the feature importance of all features used for prediction was calculated. The significant sentinel-2 bands which contributed more to the accuracy of the model were discovered to be majorly Red-edge bands, especially the . Also, the also known as and the were of benefit to the prediction model.

Significant Vegetation Indices for Maize Crop Yield Prediction

I used various spectral Vegetation Indices (VI) which I discovered in the . However, out of 40 different VI, I discovered 25 were only significant with regards to predicting maize crop yield.

below presents candidate VI used in estimating crop yield. The significance of these vegetation indices was identified in recent studies focusing on satellite-based estimation of crop production and yield. However, my review indicated that vegetation indices with a strong relationship to crop yield often included red-edge wavelengths or were designed to be sensitive to canopy chlorophyll content. Noticing this, I decided to use a written by theWinners of the Zindi CGIAR Crop Yield Prediction Challenge The function derives significant features from the three red-edge bands of Sentinel-2 imagery.

Table 1: Significant Vegetation Indices in Predicting Maize Crop Yield

Importance of Climate Variables

In the dataset provided, climate variables were gotten from . is a dataset of monthly climate and climatic water balance for global terrestrial surfaces from 1958–2019. The data have monthly temporal resolution and a ~4-km (1/24th degree) spatial resolution.

The TerraClimate Variables are

The TerraClimate Variables used for prediction consist of the listed variables above, except the

Of all Features used for prediction, climate variables proved to be of more importance.

This was achieved using the Feature Importance built-in the5 Xgboost algorithm.

Table 2: First Seven (7) important Climate Variables

Important Soil-related Features

Soil data for the competition was gotten from the . Each asset of a soil grid data is an image with 6 bands, a band for each depth (0–5cm, 5–15cm,15–30cm,30–60cm,60–100cm,100–200cm). However, the soil data for the competition had only bands of 5 to 15cm depth.

The first four important soil-related features contributing to the accuracy of the crop yield model were

Modeling

Import Libraries

import os
import random
import sklearn
import numpy as np
import pandas as pd
import seaborn as sns
import lightgbm as lgb
from lightgbm import LGBMRegressor
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_log_error
pd.set_option(‘display.max_columns’, None)
import math as Math
from sklearn.metrics import log_loss
from sklearn.model_selection import KFold
RANDOM_STATE = 42

Function to Reduce memory usage

Note: The function reduces memory usage by converting the data type of each column to the barest minimum — for example, float 64 to float 16. This is because a higher data type consumes high memory compared to lower data types such as float16.

def reduce_mem_usage(df, verbose=True):
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
start_mem = df.memory_usage().sum() / 1024**2
for col in df.columns:
col_type = df[col].dtypes
if col_type in numerics:
c_min = df[col].min()
c_max = df[col].max()
if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)
else:
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
df[col] = df[col].astype(np.float16)
elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float64)
end_mem = df.memory_usage().sum() / 1024**2
if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
return df

Load Datasets

#Path to Data Folder
link = 'C:/Crop Yield Prediction/'
#Training dataframe
train_df = pd.read_csv(link + 'Train.csv')
#years for the test fields
test_field_ids_years_df = pd.read_csv(link + 'test_field_ids_with_year.csv')
#additional soil and climate information
add_info_df = pd.read_csv(link + 'fields_w_additional_info.csv')
#Sample Submission File
ss = pd.read_csv(link + 'SampleSubmission.csv')

View A Satellite Imagery

train_folder =  'C:/Crop Yield Prediction/image_arrays_train/image_arrays_train/'fid = train_df['Field_ID'].sample().values[0]fn =  train_folder + fid + '.npy' 

print(f'Loading {fn} as an array')
#Loading the data with numpy
arr = np.load(fn)
#360 bands, images 40 or 41px a side
print('Array shape:', arr.shape)
#Combine three bands for viewing
rgb_jan = np.stack([arr[4], arr[3], arr[2]], axis=-1)
#Scale band values to (0, 1) for easy image display
rgb_jan = rgb_jan / np.max(rgb_jan)
#View with matplotlib
plt.imshow(rgb_jan)

View false-color images for 12 months

fig, axs = plt.subplots(3, 4, figsize=(12, 8), facecolor='w', edgecolor='k')
fig.subplots_adjust(hspace = .5, wspace=.001)
axs = axs.ravel()
for i in range(12):
#False colour (band 8, 4 and 3)
rgb = np.stack([arr[i*30 +8], arr[i*30 + 4], arr[i*30 + 3]], axis=-1)
#Scaling consistently
rgb = rgb / 4000
axs[i].imshow(rgb.clip(0, 1))
axs[i].set_title(str(i+1))

View the image band names

band_names = [l.strip() for l in open(link + 'bandnames.txt', 'r').readlines()]
print(band_names)

Feature Engineering Tips

: The Python Functions below are Feature-Engineering functions.

Specify Bands of Interest and Average Center Points

# Center Point 20
def process_im_20(fid, folder='C:/Crop Yield Prediction/image_arrays_train/image_arrays_train/'):

fn = f'{folder}/{fid}.npy'
arr = np.load(fn)
bands_of_interest = ['S2_B12', 'S2_B11', 'S2_B10', 'S2_B9', 'S2_B8', 'S2_B8A', 'S2_B7', 'S2_B6', 'S2_B5',
'S2_B4', 'S2_B3', 'S2_B2', 'S2_B1', 'CLIM_pr', 'CLIM_soil', 'CLIM_aet', 'CLIM_def',
'CLIM_pdsi', 'CLIM_pet', 'CLIM_ro', 'CLIM_srad', 'CLIM_swe', 'CLIM_tmmn', 'CLIM_tmmx',
'CLIM_vap', 'CLIM_vpd', 'CLIM_vs']
values = {}
for month in range(2, 9):
bns = [str(month) + '_' + b for b in bands_of_interest]
idxs = np.where(np.isin(band_names, bns)) # Index of these bands
vs = arr[idxs, 20, 20] # Sample the im at the center point
for bn, v in zip(bns, vs[0]):
values[bn] = v
return values
# Center point_30_10
def process_im_30_10(fid, folder='C:/Crop Yield Prediction/image_arrays_train/image_arrays_train/'):

fn = f'{folder}/{fid}.npy'
arr = np.load(fn)
bands_of_interest = ['S2_B12', 'S2_B11', 'S2_B10', 'S2_B9', 'S2_B8', 'S2_B8A', 'S2_B7', 'S2_B6', 'S2_B5',
'S2_B4', 'S2_B3', 'S2_B2', 'S2_B1', 'CLIM_pr', 'CLIM_soil', 'CLIM_aet', 'CLIM_def',
'CLIM_pdsi', 'CLIM_pet', 'CLIM_ro', 'CLIM_srad', 'CLIM_swe', 'CLIM_tmmn', 'CLIM_tmmx',
'CLIM_vap', 'CLIM_vpd', 'CLIM_vs']
values = {}
for month in range(2, 9):
bns = [str(month) + '_' + b for b in bands_of_interest] # Bands of interest for this month
idxs = np.where(np.isin(band_names, bns)) # Index of these bands
vs = arr[idxs, 30, 10] # Sample the im at the center point
for bn, v in zip(bns, vs[0]):
values[bn] = v
return values
# Center point_10_30
def process_im_10_30(fid, folder='C:/Crop Yield Prediction/image_arrays_train/image_arrays_train/'):

fn = f'{folder}/{fid}.npy'
arr = np.load(fn)
bands_of_interest = ['S2_B12', 'S2_B11', 'S2_B10', 'S2_B9', 'S2_B8', 'S2_B8A', 'S2_B7', 'S2_B6', 'S2_B5',
'S2_B4', 'S2_B3', 'S2_B2', 'S2_B1', 'CLIM_pr', 'CLIM_soil', 'CLIM_aet', 'CLIM_def',
'CLIM_pdsi', 'CLIM_pet', 'CLIM_ro', 'CLIM_srad', 'CLIM_swe', 'CLIM_tmmn', 'CLIM_tmmx',
'CLIM_vap', 'CLIM_vpd', 'CLIM_vs']
values = {}
for month in range(2, 9):
bns = [str(month) + '_' + b for b in bands_of_interest] # Bands of interest for this month
idxs = np.where(np.isin(band_names, bns)) # Index of these bands
vs = arr[idxs, 10, 30] # Sample the im at the center point
for bn, v in zip(bns, vs[0]):
values[bn] = v
return values

Average Climate Varaibles Over 4 Years for each Maize Season Month in Kenya

def average_variables(add_info_df, special_cols) : 

additional_ = pd.DataFrame()
climat = [‘pr’,’tmmn’,’tmmx’]
for month in range(2, 9):
for climat_ in climat :
features_ = []
for year in [2016,2017,2018,2019] :
features_.append(‘climate_’+str(year) + ‘_’ + str(month) + f’_{climat_}’ )
additional_[f’Average_for_4_Years_Variables{month-1}_{climat_}’] = add_info_df[features_].mean(axis=1)

# Soil Columns
for col in special_cols :
additional_[col] = add_info_df[col]
return additional_

Statistics for vegetation indices covering a whole sentinel-2 imagery

# SAVI
def get_SAVI_data(fid, folder=None) :
fn = f'{folder}/{fid}.npy'
arr = np.load(fn)
bands_of_interest = ['S2_B4', 'S2_B8']
L = 0.725
values = {}

for month in range(12):
bns = [str(month) + '_' + b for b in bands_of_interest]
idxs = np.where(np.isin(band_names, bns))
B08 = arr[idxs][1].astype(int)
B04 = arr[idxs][0].astype(int)

SAVI_arr = (B08 - B04) / (B08 + B04 + L) * (1.0 + L)
values['SAVI_Max_'+str(month)] = SAVI_arr.max()

return values
# EVI
def get_EVI_data(fid, folder=None) :
fn = f'{folder}/{fid}.npy'
arr = np.load(fn)
bands_of_interest = ['S2_B4', 'S2_B8','S2_B2']
values = {}

for month in range(12):
bns = [str(month) + '_' + b for b in bands_of_interest]
idxs = np.where(np.isin(band_names, bns))
B08 = arr[idxs][1].astype(int)
B04 = arr[idxs][0].astype(int)
B02 = arr[idxs][2].astype(int)

EVI_arr = 2.5 * (B08 - B04) / ((B08 + 6.0 * B04 - 7.5 * B02) + 1.0)
values['EVI_Median_'+str(month)] = np.median(EVI_arr)

for month in range(12):
bns = [str(month) + '_' + b for b in bands_of_interest]
idxs = np.where(np.isin(band_names, bns))
B08 = arr[idxs][1].astype(int)
B04 = arr[idxs][0].astype(int)
B02 = arr[idxs][2].astype(int)

EVI_arr = 2.5 * (B08 - B04) / ((B08 + 6.0 * B04 - 7.5 * B02) + 1.0)
values['EVI_Max_'+str(month)] = EVI_arr.max()

return values
# NDVI
def get_NDVI_data(fid, folder=None) :
fn = f'{folder}/{fid}.npy'
arr = np.load(fn)
bands_of_interest = ['S2_B4', 'S2_B8']
values = {}

for month in range(12):
bns = [str(month) + '_' + b for b in bands_of_interest]
idxs = np.where(np.isin(band_names, bns))

NDVI_arr = (arr[idxs][1].astype(int)-arr[idxs][0].astype(int)) / (arr[idxs][1].astype(int)+arr[idxs][0].astype(int))
values['NDVI_Median_'+str(month)] = np.median(NDVI_arr)

for month in range(12):
bns = [str(month) + '_' + b for b in bands_of_interest]
idxs = np.where(np.isin(band_names, bns))

NDVI_arr = (arr[idxs][1].astype(int)-arr[idxs][0].astype(int)) / (arr[idxs][1].astype(int)+arr[idxs][0].astype(int))
values['NDVI_Min_'+str(month)] = NDVI_arr.min()
values['NDVI_Max_'+str(month)] = NDVI_arr.max()

return values
# NDWI
def get_NDWI_data(fid, folder=None) :
fn = f'{folder}/{fid}.npy'
arr = np.load(fn)
bands_of_interest = ['S2_B8', 'S2_B3']
values = {}

for month in range(12):
bns = [str(month) + '_' + b for b in bands_of_interest]
idxs = np.where(np.isin(band_names, bns))

NDVI_arr = (arr[idxs][1].astype(int)-arr[idxs][0].astype(int)) / (arr[idxs][1].astype(int)+arr[idxs][0].astype(int))
values['NDWI_Median_'+str(month)] = np.median(NDVI_arr)

for month in range(12):
bns = [str(month) + '_' + b for b in bands_of_interest]
idxs = np.where(np.isin(band_names, bns))

NDVI_arr = (arr[idxs][1].astype(int)-arr[idxs][0].astype(int)) / (arr[idxs][1].astype(int)+arr[idxs][0].astype(int))
values['NDWI_Min_'+str(month)] = NDVI_arr.min()
values['NDWI_Max_'+str(month)] = NDVI_arr.max()

return values
# NDMI
def get_NDMI_data(fid, folder=None) :
fn = f'{folder}/{fid}.npy'
arr = np.load(fn)
bands_of_interest = ['S2_B11', 'S2_B8']
values = {}

for month in range(12):
bns = [str(month) + '_' + b for b in bands_of_interest]
idxs = np.where(np.isin(band_names, bns))

NDVI_arr = (arr[idxs][1].astype(int)-arr[idxs][0].astype(int)) / (arr[idxs][1].astype(int)+arr[idxs][0].astype(int))
values['NDMI_Median_'+str(month)] = np.median(NDVI_arr)

for month in range(12):
bns = [str(month) + '_' + b for b in bands_of_interest]
idxs = np.where(np.isin(band_names, bns))

NDVI_arr = (arr[idxs][1].astype(int)-arr[idxs][0].astype(int)) / (arr[idxs][1].astype(int)+arr[idxs][0].astype(int))
values['NDMI_Min_'+str(month)] = NDVI_arr.min()
values['NDMI_Max_'+str(month)] = NDVI_arr.max()

return values

Vegetation Indices

def vegetation_indices(train_sampled):

# NDVI = NIR-Red/NIR+Red = Band8-Band4/Band8+Band4
for i in range(2, 10):
train_sampled['NDVI_Month_' + str(i)] = (train_sampled[str(i) + '_S2_B8'] - train_sampled[str(i) + '_S2_B4'])/(train_sampled[str(i) + '_S2_B8'] + train_sampled[str(i) + '_S2_B4'])
# SAVI = ((NIR - RED)/ (NIR + RED + L))*(1 + L)
for i in range(2, 10):
train_sampled['SAVI_Month_' + str(i)] = ((train_sampled[str(i) + '_S2_B8']
- train_sampled[str(i) + '_S2_B4']) / (train_sampled[str(i) + '_S2_B8']
+ train_sampled[str(i) + '_S2_B4'] + 1)) * (1 + 1)
# ARVI = (NIR – (2 * Red) + Blue) / (NIR + (2 * Red) + Blue)
for i in range(2, 10):
train_sampled['ARVI_Month_' + str(i)] = (train_sampled[str(i) + '_S2_B8']
-(2*train_sampled[str(i) + '_S2_B4'])
+ train_sampled[str(i) + '_S2_B2']) / (train_sampled[str(i) + '_S2_B8']
+(2*train_sampled[str(i) + '_S2_B4'])
+ train_sampled[str(i) + '_S2_B2'])
# GCI = (NIR) / (Green) – 1
for i in range(2, 10):
train_sampled['GCI_Month_' + str(i)] = (train_sampled[str(i) + '_S2_B8'])/(train_sampled[str(i) + '_S2_B3'])-1
# NDWI = (NIR-SWIR)/(NIR+SWIR)
for i in range(2, 10):
train_sampled['NDWI_Month_' + str(i)] = (train_sampled[str(i) + '_S2_B3']
- train_sampled[str(i) + '_S2_B8'])/(train_sampled[str(i) + '_S2_B3']
+ train_sampled[str(i) + '_S2_B8'])
# RVI = (NIR/RED)
for i in range(2, 10):
train_sampled['RVI_Month_' + str(i)] = (train_sampled[str(i) + '_S2_B8']) / (train_sampled[str(i) + '_S2_B4'])
# OSAVI ((1.0 + 0.16) * (B08 - B04) / (B08 + B04 + 0.16))
for i in range(2, 10):
train_sampled['OSAVI_Month_' + str(i)] = ((1 + 0.16) * (train_sampled[str(i) + '_S2_B8'] - train_sampled[str(i) + '_S2_B4'])/
(train_sampled[str(i) + '_S2_B8'] + train_sampled[str(i) + '_S2_B4'] + 0.16))
# SLAVI = B08 / (B04 + B12)
for i in range(2, 10):
train_sampled['SLAVI_Month_' + str(i)] = ((train_sampled[str(i) + '_S2_B8'])/ (train_sampled[str(i) + '_S2_B4'] + train_sampled[str(i) + '_S2_B12']))
# NPCRI (b4-b2)/(b4 + b2)
for i in range(2, 10):
train_sampled['NPCRI_Month_' + str(i)] = ((train_sampled[str(i) + '_S2_B4'] - train_sampled[str(i) + '_S2_B2'])/(train_sampled[str(i) + '_S2_B4'] + train_sampled[str(i) + '_S2_B2']))
# Soil Composition Index = (B11 - B08) / (B11 + B08)
for i in range(2, 10):
train_sampled['SCI_Month_' + str(i)] = ((train_sampled[str(i) + '_S2_B11'] - train_sampled[str(i) + '_S2_B8']) / (train_sampled[str(i) + '_S2_B11'] + train_sampled[str(i) + '_S2_B8']))
# SARVI (1.0 + L) * (B08 - (Rr - y * (RB - Rr))) / (B08 + -(Rr - y * (RB - Rr)) + L) # y = 0.735; Rr = 0.740; L = 0.487; RB = 0.560;
for i in range(2, 10):
train_sampled['SARVI_Month_' + str(i)] = (1.0 + 0.487) * (train_sampled[str(i) + '_S2_B8']- (0.740 - 0.735 * (0.560 - 0.740)))/ (train_sampled[str(i) + '_S2_B8'] + - (0.740 - 0.735 * (0.560 - 0.740)) + 0.487)
# ndmi = (b8 - b11) / (b8 + b11)
for i in range(2, 10):
train_sampled['NDMI_Month_' + str(i)] = (train_sampled[str(i) + '_S2_B8'] - train_sampled[str(i) + '_S2_B11'])/ (train_sampled[str(i) + '_S2_B8'] + train_sampled[str(i) + '_S2_B11'])

# EVI_2 = 2.5 (NIR - RED)/(NIR + C2 * Blue + l)
for i in range(2, 9):
train_sampled['EVI_2_Month_' + str(i)] = 2.5* ((train_sampled[str(i) + '_S2_B8']
- train_sampled[str(i) + '_S2_B4'])/((train_sampled[str(i) + '_S2_B8']) + (2.4 * train_sampled[str(i) + '_S2_B4'])+1))

return train_sampled

Statistics from Vegetation Indices

def vegetation_index_statistics(train_sampled, spectral_indices = None):
for i in spectral_indices:
train_sampled[f'{i}_min'] = train_sampled.filter(regex = f'^{i}').min(axis = 1)
train_sampled[f'{i}_max'] = train_sampled.filter(regex = f'^{i}').max(axis = 1)
train_sampled[f'{i}_avg'] = train_sampled.filter(regex = f'^{i}').mean(axis = 1)
train_sampled[f'{i}_median'] = train_sampled.filter(regex = f'^{i}').median(axis = 1)
return train_sampled

Median Features from Red Bands

def get_features_from_Bands(fid, folder='C:/Crop Yield Prediction/image_arrays_train/image_arrays_train', bands=None, Band_Names=None) :

fn = f'{folder}/{fid}.npy'
arr = np.load(fn)

final_values = {}
for bands_idx in range(len(bands)) :
bands_of_interest = bands[bands_idx]
Name = Band_Names[bands_idx]
values = {}
for month in range(12):
bns = [str(month) + '_' + b for b in bands_of_interest]
idxs = np.where(np.isin(band_names, bns))
Relation_arr = arr[idxs][1].astype(int) / arr[idxs][0].astype(int)

values[f'BANDS_FEATURES_Median_{Name}_'+str(month)] = np.median(Relation_arr)

final_values.update(values)
return final_values

Processing & Preparation — Data & Feature

1 Apply the functions
2 Apply the function
3 Apply the function
4 Apply thefunction
5 Apply the function
6 Apply the function
7 Apply the function

Apply Specify Bands of Interest and Average Center Points functions

# Specify Bands of Interest and Center Point
train_sampled_10 = pd.DataFrame([process_im_10_30(fid) for fid in train_df['Field_ID'].values])
train_sampled_20 = pd.DataFrame([process_im_20(fid) for fid in train_df['Field_ID'].values])
train_sampled_30 = pd.DataFrame([process_im_30_10(fid) for fid in train_df['Field_ID'].values])
# divide the three dataframes by each other
train_sampled = 0
train_sampled = (train_sampled_10 + train_sampled_20 + train_sampled_30)/ 3

Apply the Reduce Memory Usage function

train_sampled = reduce_mem_usage(train_sampled)

Apply the Average Climate Values Over 4 Years function

# Average some specific Climate Variables for Additional Data
special_cols = add_info_df.columns[:12] # define columns needed for averaging
averaging_df = average_variables(add_info_df, special_cols)
train_sampled['Yield'] = train_df['Yield'].values # Add in the field ID and yield
train_sampled['Field_ID'] = train_df['Field_ID'].values
train_sampled = pd.merge(train_sampled,averaging_df,on='Field_ID',how='left') # merge together with train_sampled using "field_id"

Apply the Statistics for indices covering a whole sentinel-2 imagery function

train_EVI = pd.DataFrame([get_EVI_data(train_df['Field_ID'].values[fid_idx], folder = 'C:/Crop Yield Prediction/image_arrays_train/image_arrays_train/') for fid_idx in tqdm(range(len(train_df['Field_ID'].values))) ])
train_EVI['Field_ID'] = train_df['Field_ID'].values
train_SAVI = pd.DataFrame([get_SAVI_data(train_df['Field_ID'].values[fid_idx], folder = 'C:/Crop Yield Prediction/image_arrays_train/image_arrays_train/') for fid_idx in tqdm(range(len(train_df['Field_ID'].values))) ])
train_SAVI['Field_ID'] = train_df['Field_ID'].values
train_NDVI = pd.DataFrame([get_NDVI_data(train_df['Field_ID'].values[fid_idx], folder = 'C:/Crop Yield Prediction/image_arrays_train/image_arrays_train/') for fid_idx in tqdm(range(len(train_df['Field_ID'].values))) ])
train_NDVI['Field_ID'] = train_df['Field_ID'].values
train_NDWI = pd.DataFrame([get_NDWI_data(train_df['Field_ID'].values[fid_idx], folder = 'C:/Crop Yield Prediction/image_arrays_train/image_arrays_train/') for fid_idx in tqdm(range(len(train_df['Field_ID'].values))) ])
train_NDWI['Field_ID'] = train_df['Field_ID'].values
train_NDMI = pd.DataFrame([get_NDMI_data(train_df['Field_ID'].values[fid_idx], folder = 'C:/Crop Yield Prediction/image_arrays_train/image_arrays_train/') for fid_idx in tqdm(range(len(train_df['Field_ID'].values))) ])
train_NDMI['Field_ID'] = train_df['Field_ID'].values
train_sampled = pd.merge(train_sampled,train_EVI,on='Field_ID',how='left') # merge train_f together with train_sampled using "field_id"
train_sampled = pd.merge(train_sampled,train_SAVI,on='Field_ID',how='left') # merge train_f together with train_sampled using "field_id"
train_sampled = pd.merge(train_sampled,train_NDVI,on='Field_ID',how='left') # merge train_f together with train_sampled using "field_id"
train_sampled = pd.merge(train_sampled,train_NDWI,on='Field_ID',how='left') # merge train_f together with train_sampled using "field_id"
train_sampled = pd.merge(train_sampled,train_NDMI,on='Field_ID',how='left') # merge train_f together with train_sampled using "field_id"

Apply the Vegetation Indices function

train_sampled = vegetation_indices(train_sampled)

Apply the Statistics from Vegetation Indices function

indices = ['NDVI', 'SARVI', 'SAVI', 'GCI', 'SCI', 'ARVI', 'SLAVI', 'RVI','OSAVI', 'NPCRI', 'NDMI', 'NDWI', 'EVI_2']train_sampled = vegetation_index_statistics(train_sampled, spectral_indices= indices)

Apply the Median Features from Red Bands function

bands = [ [‘S2_B7’,’S2_B5'] ,[‘S2_B7’,’S2_B6']] 
names = [‘FEATURES_B7_B5’,’FEATURES_B7_B6']
train_f = pd.DataFrame([get_features_from_Bands(train_df[‘Field_ID’].values[fid_idx],bands=bands,Band_Names=names)
for fid_idx in tqdm(range(len(train_df[‘Field_ID’].values))) ])
train_f[‘Field_ID’] = train_df[‘Field_ID’].values # add the field_id to the tain_f dataframe
train_sampled = pd.merge(train_sampled,train_f,on=’Field_ID’,how=’left’) # merge train_f together with train_sampled using “field_id”

Due to mathematical operations carried out on the data, there might be an occurrence of missing values. To discover if missing values are present within the data — missing values were checked for.

if missing values exists fill them with the most occurrent value in each column with missing value.

def missing_zero_values_table(df):
zero_val = (df == 0.00).astype(int).sum(axis=0)
mis_val = df.isnull().sum()
mis_val_percent = 100 * df.isnull().sum() / len(df)
mz_table = pd.concat([zero_val, mis_val, mis_val_percent], axis=1)
mz_table = mz_table.rename(
columns = {0 : 'Zero Values', 1 : 'Missing Values', 2 : '% of Total Values'})
mz_table['Total Zero Missing Values'] = mz_table['Zero Values'] + mz_table['Missing Values']
mz_table['% Total Zero Missing Values'] = 100 * mz_table['Total Zero Missing Values'] / len(df)
mz_table['Data Type'] = df.dtypes
mz_table = mz_table[
mz_table.iloc[:,1] != 0].sort_values(
'% of Total Values', ascending=False).round(1)
print ("Your selected dataframe has " + str(df.shape[1]) + " columns and " + str(df.shape[0]) + " Rows.\n"
"There are " + str(mz_table.shape[0]) +
" columns that have missing values.")
# mz_table.to_excel('D:/sampledata/missing_and_zero_values.xlsx', freeze_panes=(1,0), index = False)
return mz_table
missing_zero_values_table(train_sampled)# Fill with the most occurent value in each column
train_sampled = train_sampled.fillna(train_sampled.mode().iloc[0])

Training and Testing — with Cross-Validation

The XGBoost algorithm was implemented.

Why XGBoost?

  1. Execution Speed.
  2. Model Performance

The XGBoost is an algorithm that has recently been dominating applied machine learning and for . XGBoost is an implementation of gradient boosted decision trees designed for speed and performance.

  • When discussing execution speed, the XGBoost is fast. Really fast when compared to other implementations of gradient boosting.
  • XGBoost dominates structured or tabular datasets on classification and regression predictive modeling problems. The evidence is that it is the go-to algorithm for competition winners on the Kaggle competitive data science platform.
from sklearn.model_selection import KFold
import xgboost as xgb
from sklearn.metrics import mean_squared_error
X, y = train_sampled.drop(['Field_ID'],axis=1), train_df['Yield']kf = KFold(n_splits =5,shuffle=True,random_state=160)
feats = pd.DataFrame({'features': X.columns})
gbm_predictions = []
cv_score_ = 0
oof_preds = np.zeros((train_df.shape[0],))
for i,(tr_index,test_index) in enumerate(kf.split(X,y)):
print()
print(f'######### FOLD {i+1} / {kf.n_splits} ')
X_train,y_train = X.iloc[tr_index,:],y[tr_index]
X_test,y_test = X.iloc[test_index,:],y[test_index]

gbm = xgb.XGBRegressor(eval_metric = 'rmse',n_estimators = 2000,learning_rate = 0.001,seed=162,random_state = 162,colsample_bytree=0.65)
gbm.fit(X_train,y_train,eval_set = [(X_test, y_test)],early_stopping_rounds = 200,verbose=100)cv_score_ += mean_squared_error(y_test, gbm.predict(X_test), squared=False) / kf.n_splits
oof_preds[test_index] = gbm.predict(X_test)
preds = gbm.predict(test_sampled[X_train.columns])
gbm_predictions.append(preds)
feats[f'Fold {i}'] = gbm.feature_importances_feats['Importances'] = feats.mean(axis=1)
print( ' CV RMSE : ',cv_score_)
preds_xgb = np.average(gbm_predictions, axis=0)
print(preds_xgb.shape)

Other gradient boosted decision trees such as can be implemented. However, to further improve the accuracy of the crop yield prediction model, more tricks are needed to be done to generate more significant features, and model hyperparameter tuning would also add to the predictive capability of the model.

A big thank you to and also to the Winners of the Zindi CGIAR Crop Yield Prediction Challenge.

CodeX

Everything connected with Tech & Code