Predicting Crop Production in Punjab, Pakistan for Top 5 Most Produced Crops

Muhammad Bhatti
11 min readJun 22, 2023

--

Time Series Machine Learning Model using Python — Muhammad Bhatti and Wajih Sami Siddiqui

Inspiration

Agriculture has long remained the backbone of the Pakistani economy with accounting for almost a fifth of Pakistan’s total GDP, with Punjab being the most prominent contributor. The importance of this sector is felt by millions of Pakistani’s on a daily basis. Hence, trying to accurately predict production outputs can be a vital tool for not just farmers, but for officials as a whole. From a policy perspective, being able to forecast output for coming years can be utilised to ensure which areas of the province require extra attention or subsidization. Moreover, being able to predict the yearly output from farms across Punjab would aid getting an estimate into what the price of crops would be like for the future.

The inspiration for this machine learning project stemmed from the challenges faced by farmers and policymakers in effectively planning and managing crop production. Furthermore, technological advancements in data science and machine learning continue to expand their application potential. By utilising these tools, it becomes possible to gain valuable insights into the factors that influence crop production and develop more accurate forecasts.

Literature Review

Aslam et al. conducted a thorough analysis of wheat production in Pakistan to determine how production was impacted by numerous indicators. He revealed that the most important features in wheat production includes production prevailing trends were momentum and volatility. This study served as an inspiration for us, as we wanted to conduct a deep dive into the Punjab region and see how trends varied across different crops across different districts.

Dataset

The dataset chosen for this article was taken from the data resource centre at the Lahore University of Management Science(LUMS). Two datasets, one for various farm inputs, and one for farm outputs, were chosen to be used for the project.

Input Dataset

The input dataset had a total of 918 rows and 9 columns. The dataset contained values for all 34 of Punjab’s district between 1993 and 2019. The main input features included Tube_wells (number of tube-wells used for aggricultural purposes), Tractors (number of tractors used for aggricultural purposes), Fertilizer_off_take (sale of fertilizers in tonnes), Land (total area sown during the Rabi and Kharif season), and Labor (yearly man days in thousands dedicated to performing agricultural tasks.

Output Dataset

The output dataset had a total of 45,356 rows and 9 columns. The dataset also contained values for all 34 of Punjab’s district between 1990 and 2018. The data for each year was sorted for each district and then for each of the 46 individual crops. In addition, it also housed our dependent variable, production in tonnes, and yearly price.

Data Cleaning

Merging Datasets

Owing to the difference in the number of rows of both datasets, the two could not be merged without making some changes. As, the inputs dataset contained the inputs for each year and each district, we made the assumption that the they would be the same for each year across every district for every crop. A total of 46 duplicate rows were created for each instance in the input datasets. Moreover, every observation in the years 1990,1991,1992, and 2019 were dropped from the output dataset to finally merge the two datasets together. Redundant features, such as duplicate columns for year and district were also removed. This merged dataset had 40604 rows and 9 columns.

Removing Unnecessary Features

The yearly price column added very little value to our dataset and was removed from the it alongside the province and crop category columns.

Dummy Encoding Categorical Variables

We had two main features that needed to be encoded as dummies, Districts and Crops. Since there were 46 total crops and 34 unique districts, 45 dummies were created for crops and 33 for districts following the n-1 rule.

Adding Features

A dummy variable was created for flood affected districts. Floods have historically been responsible for the greatest loss in agricultural output. Therefore, data was collected from official Pakistani sources including the Provincial Disaster Management Authority of Punjab, and post-disaster reports published by the Government of Punjab. Years when major flooding impacted parts of Punjab include 1995, 2010, 2011, 2012, 2013, 2014, and 2016. Only districts heavily impacted by floods were assigned a value of 1.

Changing the Year Column

The year column contained years from 1993 to 2018. A simple map function was used to change the values into a categorically encoded variable to allow for better results in the time series model. The encoding was as follow: 1993 assigned 1, 1994 assigned 2, all the way till 2018 assigned 26.

Final Dataset

The final cleaned dataset had 40604 rows and 83 columns.

Exploratory Data Analysis

We wanted to see whether we could analyse the historical data to deduce reasonable patterns from it. To do this we created a number of graphs to aid visualisation of our data.

Visualisations

First we wanted to see what the most produced crops were across all the districts in Punjab. To do this, we summed up the production in tonnes across all years and grouped them first by each district and then by each crop. Next, the crop with highest total production levels for each district was added to a seperated data frame which was used to create a pie chart.

The pie chart below shows that for all the 34 districts, the highest produced total crops for each district were either Sugarcane or Wheat representing the 52.9% and 47.1% of the total 34 districts where they were most produced, respectively. Sugarcane was produced the most in 18 districts, whereas Wheat was produced the most in 16 districts.

Most Popular Crops for All Districts (Summed up)

Next, we wanted to see how the total production in tonnes for all districts changed between 1993 and 2018. The production values were summed and grouped according to each year and plotted against years.

The line plot below shows that production generally increased each year and was the highest in 2017.

Change in Total Production between 1993 and 2018

We also wanted to visualise how production of all the various crops was different at both ends of our time series. Two plots were created, one showing the individual crop levels summed across all districts for 1993, and the second showing the same for 2018.

The plots below show that for both years, production for both Sugarcane and Wheat have dominated production amongst other crops. In addition, it shows us the scale to which these two crops are produced relative to other crops. Most crops experienced a big increase in 2018 relative to their production levels in 1993, which include Maize, Potato, Rice, Sugarcane, and Wheat.

Crop Production Summed for all Districts for 1993 and 2018

To further illustrate the production levels for all the crops over time, a line plot was created. The graph below shows, that more or less the same trend continued as in the previous plot. Sugarcane and Wheat were dominant amongst all other crops for all years, and a general positive trend was observed for all the crops.

Crop Production Summed for all Districts between 1993 and 2018

Filtering Top 5 Crops

The previous two visualisations helped deduce that the top 5 crops, i.e. Sugarcane, Wheat, Rice, Potato, and Maize, are significantly more produced in all parts of Punjab and across all years. Hence, we removed all the crops other than the ones mentioned before.

The graph below illustrates the change in crop production for the top 5 most produced crops between 1993 and 2018.

Production for Top 5 Crop Summed for all Districts between 1993 and 2018

Furthermore, by only keeping the top 5 most produced crops, the correlations between the target variable and the feature variables significantly improved.

Correlations for Feature Variables with Target Variable (Production)

Removing Missing Values for Target Variable

The final step before proceeding with the Machine Learning model was to remove any rows which included missing values for our target variable. A total of 186 rows were removed.

The final dataset had 4325 rows and 46 columns.

Final Dataset Used for Machine Learning Models

Machine Learning

The modelling process was broken down into two main parts:

  1. Using Models like Linear Regression, Elastic Net, Tree Regression, Random Forest Regressor and Stochastic Gradient Descent on a simple time series train-test split of the data
  2. Using the above models on a Rolling Origin Cross Validation Approach.

This approach allowed us to holistically pick out the model that worked best for our objective. For the first part, we engaged in the following simple time series train-test split involving standardization and scaling down of our target variable.

Using Models without Rolling Origin Cross Validation

Time Series Train-Test Split

The training dataset was chosen to be from 1993 to 2013 and the testing dataset was chosen to be from 2014 to 2018.

Standardising Training and Testing Datasets

Both the training and testing datasets were further split into x_train, y_train, x_test, and y_train, with x_train and x_test containing our features, and y_train and y_test containing our target variable. The Standard Scaler was chosen from the sklearn library to compute the scaled datasets. All continuous variables in our x_train and x_test, i.e. all columns excluding the dummies and year column were separated, standardised, and concatenated with the remaining columns for both the training and the dataset. The y_train and y_test datasets were also divided by 10000 to achieve production in ten thousand tonnes.

Evaluation Metrics

All five machine learning models were fitted using the scaled training datasets and evaluated using Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and R-Squared Score(R2 Score).

Results

The Random Forest Regressor model yielded the best results with an R2 Score of 0.991, RMSE of 7.018, and MAE of 2.109 when predicting on the training dataset, and an R2 Score of 0.526, RMSE of 82.759, and MAE of 20.505 when predicting on the testing dataset. The worst performing model was the Simple Linear Regression with n R2 Score of 0.421, RMSE of 56.682, and MAE of 32.125 when predicting on the training dataset, and an R2 Score of 0.249, RMSE of 104.203, and MAE of 45.502 when predicting on the testing dataset

Using Models with Rolling Origin Cross Validation

As our data modelled a time series, the regular Cross Validation using the sklearn library was not possible. Taking this into account, we chose a Rolling Origin Cross Validation technique.

This method works by splitting the dataset into two parts, the training portion, and the validation set. The validation set is decided using the number of steps we set that we decide to predict in the future. In the visualisation below, the Rolling Origin Cross Validation is using a minimum training size of 6 and steps equal to one. With every iteration, the number of training instances increases and includes the previously used training set. It is different from regular Cross Validation as it does not randomly split the data into training and validation sets.

Rolling Origin Cross Validation with One Step

Defining the Rolling Origin Cross Validation Function

As there is no inbuilt function for the Rolling Origin Cross Validation, we had to create the entire function from scratch. Moreover, we decided to include fitting all five of our models, as well as storing the average results of all 3 metrics mentioned previously in a dictionary within the function.

The function takes the training dataset, minimum train size, steps, and the number of unique years in our training dataset as inputs. Next it creates empty lists for scores for all five models and all three evaluation metrics. Finally, it creates a loop which generates the training and validation sets for each iteration.

Rolling Origin Cross Validation Function (Part 1/6)

Next it scales both the training and validation datasets in each fold using Standard Scaler, alongside diving the target variable arrays by 10000.

Rolling Origin Cross Validation Function (Part 2/6)

Next, it fits all five of our models using the training sets and predicts the value of the target variable using the validation set.

Rolling Origin Cross Validation Function (Part 3/6)

Once the models, are fitted, it calculates the results from all three evaluation metrics using the predicted target variable values and stores them in seperate arrays.

Rolling Origin Cross Validation Function (Part 4/6)

The mean of all our results are calculated withing the function but outside the initial for loop as shown below.

Rolling Origin Cross Validation Function (Part 5/6)

The results are then finally stored in a dictionary which is returned once the function ends.

Rolling Origin Cross Validation Function (Part 6/6)

Tuning Parameters of Cross Validation Function

Once the function was defined, we needed to decide the minimum training size and the number of steps our function would take. After repeated testing, we found the minimum training size of 5 years alongside steps equal to 2 years to yield sufficient results. Choosing number of steps equal to 2 years helps us make prediction 2 years ahead with fairly strong results.

Results

The Random Forest Regressor model yielded the best results again an average R2 Score of 0.857, average RMSE of 29.125, and average MAE of 10.164. Our worst performing model was Stochastic Gradient Descent with an average R2 Score of 0.333, average RMSE of 64.173, and average MAE of 36.585.

Visualising Results

Finally, we settled on choosing R2 Score, and RMSE as our chosen evaluation metrics. We then went on to compare R2 Score and RMSE for all models involved in cross validation. We found that among these, Random Forest Regressor yielded the best results, with the highest R-Squared and lowest RMSE, which is why we decided to proceed with it as our final model. Although RFG without cross validation yielded a lower RMSE and higher R2, the cross-validation model is still more generalizable to unseen data.

R2 Scores and RMSE of Random Forest Regressor Model using Rolling Origin Cross Validation with Minimum Training Size = 5 and Steps = 2

Fine Tuning the Model

We wanted to see if changing the parameters (max_depth) of the Random Forest Regressor model would yield any significant results. To test our hypothesis, we created a simple for loop that changed the value of our max_depth from 1 to 9 and compared the results with the value of the default parameter (max_depth = None). We found that the default parameters yielded the best results and required no tuning.

Results with Test Dataset

Finally, we fitted our Random Forest Regressor model with the training set, and predicted the value for our target variable using the testing set. The final scores were stored and represented in the graph shown below.

RMSE and R2 Scores(%) for Training and Testing Datasets Using Random Forest Regressor Model

Conclusion

As shown, although the model had a relatively high R2 Score and low RMSE, the difference between RMSE­(Test) and RMSE(Train)­ along with a lower R-Squared shows that our model currently lacks generalizability. Therefore, in future iterations, bias needs to be added to make the model more generalizable, or more data is needed on these specific crops to make our model more robust.

--

--