Deep learning for Crop Yield Prediction (Pt.1 — Model)
Boost crop yield and optimize irrigation: A deep learning approach to multivariate analysis
--
In this article, I present a project on Crop Yield Prediction and Irrigation Optimization using deep learning techniques.
Deep learning is a powerful approach for multivariate analysis, especially when dealing with complex datasets with many variables. This technique can capture intricate patterns in the data, providing a robust solution for problems involving multiple factors and interactions.
The purpose of this project is to provide a complete example of how to apply deep learning in a practical scenario, step by step, covering everything from data preparation to model building and evaluation.
We will explore each stage together, focusing on strategic decisions and technical justifications that support model development.
If you enjoy this article and find the content useful, please leave your feedback and give 👏. This helps to value the work I am doing here and allows more people to access this knowledge. Let’s get started…
Data Dictionary:
The data we are working with in this project is fictitious and was created to demonstrate how to apply deep learning in practice for crop yield prediction and irrigation optimization.
Implementation:
When we run the project, it will generate the model file with the extension .keras
. For this project, we will use TensorFlow and, in addition, create a scaler for the data.
TensorFlow Setup:
When working with deep learning, we typically need a framework that allows us to build the layers of an artificial neural network. If you want to work with deep learning, the best options are TensorFlow and PyTorch.
We define an environment variable to set the log level of TensorFlow:
%env TF_CPP_MIN_LOG_LEVEL=3
TensorFlow is verbose, meaning it generates a lot of output messages, cluttering the notebook.
By setting the log level to 3
, we configure it to show only error messages. If you prefer to see all messages, including warnings, do not run this cell.
Importing Necessary Packages:
# Imports
import joblib
import sklearn
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
import warnings
warnings.filterwarnings('ignore')
- joblib: Used to save the scaler to disk.
- sklearn: train_test_split to divide data into training and testing sets, and StandardScaler for scaling.
- tensorflow and keras: The Keras API simplifies model building with TensorFlow.
About TensorFlow and PyTorch:
Keras was originally developed as an independent API to simplify model building over TensorFlow, which was complex to code directly.
The Google team, which maintains TensorFlow, eventually integrated Keras into the framework. Therefore, we will use it to create the sequence of layers for our model, working with dense
, dropout
, and callbacks.
PyTorch is excellent for building sequences simply and quickly, making it ideal for experimentation. TensorFlow, however, stands out in terms of performance, being more suitable when production and performance are critical. Choose the ideal tool based on your project needs.
Loading the Data:
For this project, we will work with fictitious data. The dataset is not very large but sufficient to carry out our work.
Let’s start by loading the dataset using the pandas read_csv
function:
# Load the dataset
df = pd.read_csv('dataset.csv')
print(df.shape)
df.head()
The purpose of this project is to use a series of measurements related to the field, soil, and agribusiness environment to measure soil moisture.
In Machine Learning, we work with historical data, that is, data that has already occurred. We collect this data, define the target variable, and the other variables can be used as predictors.
In our case, the dataset has 12 columns. One of these columns will be the target variable, which is humidity
, while the other 11 are potential predictor variables.
Important Note:
Not all variables will necessarily act as predictors. We need to make additional checks or decide on the use of each one. For now, we know that one column is the target variable and the other 11 are candidates as predictors.
Exploratory Analysis:
In this project, my focus is on building the deep learning model, so I won’t spend much time on exploratory analysis, as I’ve already covered this step in several other projects. Nevertheless, let’s carry out some basic checks.
# Check the data types
df.dtypes
We see that we have one variable of type Object
, which is the date
column, two variables of type Int
(integer), and most of the variables are of type Float
(decimal). Next, let’s check the column names:
# Display the columns of the dataset
df.columns
'''
Index(['date', 'veg_index', 'soil_capacity', 'co2_level', 'nutrient_level',
'fertilizer_index', 'root_depth', 'solar_radiation', 'precipitation',
'growth_stage', 'yield_history', 'humidity'],
dtype='object')
'''
Identifying Non-Numeric Columns
# Identify non-numeric columns
non_numeric_columns = df.select_dtypes(include=['object']).columns
print(f'Non-numeric columns: {non_numeric_columns}')
At this point, we observe that the only non-numeric column is the date
.
In Machine Learning, we essentially work with numbers, so we need to transform Object
type variables (strings) into a numeric format; otherwise, we cannot use them to train the model.
Int
and Float
columns can be kept as they are, with only scaling applied later.
Treating the date
Column:
The date
column must be processed in some way. My decision for this project is to remove this column, and I’ll explain why.
The data ranges from 2012 to 2023, with each row representing the first day of a month, without repeating dates.
Although we could convert the column to datetime
type and then to an integer, in this context, the date
acts as a kind of ID, uniquely identifying each row.
If we were conducting a time series analysis, this column would be essential because the date
would be part of the primary information. However, as we are dealing with a supervised learning problem where the input and output data do not depend on chronological order, the date
column behaves just like an ID, without relevance to the model’s learning.
IDs should not be included in models because they do not provide useful information for learning.
Therefore, my decision is to remove the date
column. This is a decision that you should also make in each project, always justifying it. It’s not an issue if you later find that the decision wasn’t the best—the important thing is to document and move forward. If necessary, you can always go back and adjust.
Reflection on Decision-Making in Analysis:
Decisions like this are part of data science. There is no guarantee that all initial decisions will be the best, as experimentation is a fundamental part of the process.
Learning comes from practice and correcting possible mistakes.
So, make your decisions based on the available information, document your justifications, and continue the process. If something doesn’t work well, review and adjust as needed.
Cleaning and Transformation:
Let’s now clean and transform the data. First, we remove the date
variable, which we identified as non-numeric.
This action is simple and requires only one line of code:
# Remove non-numeric columns (if not needed)
df = df.drop(columns=non_numeric_columns)
Next, we check if the humidity
column contains only numeric values, as this will impact the type of model we will build:
# Check if the 'humidity' column contains numeric values
if df['humidity'].dtype == 'object':
df['humidity'] = pd.to_numeric(df_dsa['humidity'], errors='coerce')
Here, we use an if
block to check if the humidity
column is of type object
. If so, we convert it to numeric.
After execution, if everything went well, the humidity
variable should be in the float64
format, indicating that it contains numeric values.
Model Decision: Regression or Classification?
The decision to transform the humidity
variable to numeric is important because it directly impacts the type of Machine Learning model we will build.
Looking at the target variable humidity
, we can define that the problem is a regression since we want to predict a numerical value, not a class. Therefore, we will create a deep learning model geared toward regression.
Removing Missing Values:
Although there are no missing values in the dataset, we ensure everything is clean by removing any rows that may contain NA
:
# Remove rows with missing values
df = df.dropna()
Even if the dataset does not contain missing values, it is always good practice to ensure that no missing rows are present to avoid problems in model training.
With these steps, the data is in the proper format to proceed with the modeling process.
Standardization:
Many Machine Learning algorithms benefit from data being on the same scale because it makes training smoother.
Are our data on the same scale? No. So, since I will work with deep learning for regression, I need to put the data on the same scale.
Otherwise, we will make the model training extremely unstable, which can lead to problems and, consequently, result in finding a suboptimal model. That is, you do not reach the ideal model.
The training ends up finding a model with intermediate performance simply due to the scale difference in the data.
Which Variables Will Be Used as Predictors?
This also leads us to another question: which variables will I use as predictors? I could do some feature selection work or apply dimensionality reduction.
Since I want to emphasize deep learning, I will use all predictor variables except the date
variable as input variables. So, I will put the humidity
variable in y
, which is the target variable. Everything else will go into X
, and I have already deleted the date
variable.
Do I know at this point in the project if all these variables will help me predict the humidity
variable? I don't know; there's no way to know—I haven't created the model yet.
When I create the model and analyze its performance, that’s when I will have a first idea. If the model has good performance, these variables seem to explain y
. If the performance is poor, there are likely problems with these variables. Then I go back, apply variable selection, and apply many of the techniques I’ve taught you in several previous projects.
# Separar preditores e a variável-alvo
X = df.drop(columns='humidity')
y = df['humidity']
Splitting Data into Training and Testing Sets:
I will split X
and y
, separating the input variable (predictors) and the target variable. I will make the split into training and testing sets, with an 80–20 ratio: 80% for training and 20% for testing. I used random_state
so you can reproduce the same results.
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
Creating the Scaler:
Then, I create the scaler using StandardScaler
.
# Create the scaler
scaler = StandardScaler()
Next, you know what I need to do, right?
Fit and transform the training data, and only transform the testing data.
# Scale the data
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Thus, we scale the data.
Also save the scaler to disk. If the file already exists, it will overwrite it.
# Save the scaler to disk
joblib.dump(scaler, 'scaler.joblib')
Importance of the Scaler in Deployment:
Every time you run the project, it will overwrite the file. Why am I saving the scaler to disk? Because I will deploy this model later.
Therefore, every time I use the model, I will need to prepare the data for the model in the same format as I prepared the training data.
Any transformation applied to the training data must be applied to the testing data and new data.
So, when I use the model in deployment, I will provide it with new data. This new data will arrive in its raw format, as shown above. I then have to standardize the data and deliver it to the model in deployment.
How will I standardize the data? By using this scaler. This scaler, once created and trained with training data, is the one you will use not only on the testing data but also on new data. And so, I save the file to disk so that we can deploy it.
Considerations on Deep Learning:
We now have everything we need regarding the data. We can now work on the deep learning architecture. deep learning is the leading technique today for artificial intelligence applications. In general machine learning projects, you don’t necessarily need to use deep learning. it will be useful when you have complexity in the data and a high volume of data.
If you have a small volume of data, deep learning may perform worse than other simpler models. That is, deep learning is not the solution to all problems. You have a complex problem, complexity exists in the relationship between the data, and you have a large amount of data to train an artificial neural network, then deep learning can be a good option.
You can always test along with the other algorithms you test. For example, in some cases, you have a problem in the medical field, and the data is complex. You create a model, and it achieves a performance of 85% accuracy, for example. For the same dataset, for the same problem, you apply Naive Bayes, a much simpler probabilistic algorithm. You achieve 95% performance. Yes, this can happen. Why? Because the volume of data is not large enough to train the neural network. Why? Because the neural network did not manage to capture the data pattern. So perhaps deep learning is not the best option.
It’s always important to test alternatives. If, on the other hand, you are going to work on very complex problems, such as computer vision and natural language processing, then it’s very likely that no other technique will be as effective as deep learning. But that comes with a price: you need large amounts of data to get good performance.
For our project, I want to bring an example for you. Since our data volume is small, no problem; I’ll build the architecture and achieve good performance, as the objective is to give and explain the entire process.
Model Construction:
We have a sequence of layers for constructing our deep learning model:
# Define the model architecture
model = Sequential([
Dense(64, activation='relu', input_shape=(X_train.shape[1],)),
Dropout(0.3),
Dense(32, activation='relu'),
Dropout(0.3),
Dense(16, activation='relu'),
Dense(1)
])
The artificial neural network is composed of layers, and each layer is a set of mathematical operations.
There are various types of layers, and here I am using dense
(fully connected) layers and dropout
layers in a sequential manner.
It is worth noting that there are other types of layers, such as CNN (Convolutional Neural Networks), LSTM (Long Short-Term Memory networks), and transformers for attention modules, which are some of the most recent and advanced architectures in the field of artificial intelligence.
Challenges in Constructing Deep Learning Architectures:
Building a deep learning architecture is not a trivial task. Unlike other algorithms, such as Random Forest, which do not require the construction of layers, deep learning requires you to configure the layers and define hyperparameters.
With Random Forest, you only need to adjust the hyperparameters and train the model. However, in deep learning, in addition to tuning hyperparameters, you also need to define the sequence of layers, which significantly increases the complexity of the model design process.
Defining Layers and Hyperparameters:
In deep learning, you define layers and hyperparameters, which are Python functions.
For instance, the number of neurons is a hyperparameter, as is the input_shape
, which specifies the number of columns (i.e., the features feeding into the neural network).
In our first dense layer, we specify 64 neurons — an adjustable hyperparameter.
We continue with dense layers, gradually reducing the number of neurons until reaching the final layer, which has only 1 neuron. Why?
The goal is to generate a numerical prediction, specifically the value of humidity. Therefore, a single neuron in the last layer is sufficient.
Activation Function:
The activation function plays a crucial role in deep learning. All operations reduce to matrix operations, which can result in negative values. This is why we use the ReLU activation function.
If there are negative values, ReLU transforms them into zero, keeping only the positive values. This prevents negative values from affecting subsequent calculations.
There are various activation function options, and the choice depends on the type of problem. ReLU is used for regression problems, like ours.
Dropout (Balancing Learning and Generalization):
Dropout is a regularization strategy used to prevent overfitting, which occurs when the model learns too much about the training data, capturing specific patterns that do not generalize well to new data.
Dropout deactivates a percentage of neurons during training. In our case, 30% of neurons are deactivated in each dropout layer.
It’s like the model learns something, and when dropout is applied, part of that learning is “forgotten,” forcing the model to learn more generically and preventing it from becoming overly specialized in the details of the training data.
By repeating this process, the model learns in a stable manner, balancing learning and generalization.
Conclusion of the Architecture:
This is the architecture we are using for the project. The construction of the layers, the choice of activation functions, and the use of dropout regularization are fundamental elements to ensure a robust model that learns effectively without overfitting the training data.
The goal is to ensure that the model has stable learning and genuinely understands the patterns in the data in an effective manner.
Understanding the Next Step in Model Construction:
To understand the next step in building the model, let’s analyze what’s happening. The data enters the architecture, which is essentially the algorithm.
We have a dense layer, which in practice is a matrix that receives the input data and begins the multiplication process. This is where the training starts.
# Defining the model architecture
modelo = Sequential([
Dense(64, activation='relu', input_shape=(X_train.shape[1],)),
Dropout(0.3),
Dense(32, activation='relu'),
Dropout(0.3),
Dense(16, activation='relu'),
Dense(1)
])
The data is multiplied by weights. At the start of training, we don’t know the ideal values of these weights. We begin with something random, and the model learns as it performs these multiplications.
So, the data goes into the model, the first operations are done, and the model learns something, finding a pattern, which is the result of these multiplications. Then, we apply dropout, reducing the learning by 30%.
The model continues, persistent. It learns more in the next multiplication stage. Dropout is applied again, trimming the learning further. But it doesn’t stop; it continues learning until it generates a prediction.
Imagine feeding the first row of data without the date
, which we have already removed. The model takes these standardized values and generates a prediction for humidity
.
Ideally, it would generate something close to 77.95, because if the original data produces this value, the expectation is for the prediction to match. But it might make a different prediction because it is still learning.
Forward Pass and Error Evaluation:
So far, this was the forward pass. But how do we tell the model if it’s right or wrong? How can it adjust and improve? Just like a teacher tests a student’s knowledge, we need a way to evaluate the model. This evaluation is done by compiling the model, where we define the optimizer, loss function, and performance metrics.
# Compiling the model
model.compile(optimizer='adam', loss='mse', metrics=['mae'])
We constructed the architecture with the layers for the forward pass, bringing the data to the prediction.
After this, we compare this prediction with the actual value and calculate the model’s error using the mean squared error (MSE), which measures the difference between the prediction and the actual value.
This error is passed to the adam optimizer, which performs backpropagation (the backward pass).
Backpropagation and Weight Adjustment:
Backpropagation adjusts the weights based on derivatives, guiding the model to reduce the error in each new iteration.
Essentially, the model adjusts the weights to try to improve with each cycle, repeating the process of prediction, error, and adjustment.
# Compiling the model with optimizer, loss function, and metrics
model.compile(optimizer='adam', loss='mse', metrics=['mae'])
Backpropagation adjusts the weights so that the model learns.
It recalculates and adjusts, using mathematics to improve the model’s performance. With the compilation, we define the optimizer, loss function, and evaluation metrics.
Each of these elements is chosen by the AI engineer, who determines everything from the structure of the layers to how the model will be evaluated.
With the model compiled, we are ready to configure the callbacks that will help control the training process.
Pause…
Let me ask you a very useful question to test a professional’s knowledge in machine learning: What is the ideal time to train a machine learning model, specifically a deep learning model in our case? What is the ideal time? The answer is simple: I don’t know the ideal time. In fact, there’s no way to know it beforehand.
How long should I train the model? For one minute, 10 minutes, 10 hours? Should I train for one epoch or a thousand epochs? I can’t know that at this point, especially since I haven’t even trained the first version of the model yet.
Of course, as you gain experience, you start to get an idea, but you will never have an exact answer. Fortunately, there are tools that help us control the training time, even without knowing the ideal duration. This is called a callback.
Nowadays, frameworks like TensorFlow, PyTorch, and others already have ready-made functions for this. One such example is EarlyStopping
, which we will use here:
# Callbacks
early_stopping = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)
model_checkpoint = ModelCheckpoint('model.keras', save_best_only=True)
What is Early Stopping?
EarlyStopping
is a part of the callbacks in Keras within TensorFlow. Its function is to monitor the error on validation data, which is data set aside for testing during training.
We configure a parameter called patience
equal to 10. What does this mean? It means that if the model does not improve its performance for 10 consecutive epochs, it considers that it has stopped learning and ends the training.
So, notice: I don’t need to know how long the model needs to train. I set a number of epochs (training passes), for example, 100, and instruct the callback to monitor the learning process.
If the model stops improving for 10 epochs, it halts the training and restores the best weights learned so far, with restore_best_weights=True
. This prevents unnecessary training, saving computational resources and, most importantly, time.
ModelCheckpoint: Saving the Model During Training:
Another callback we will use is ModelCheckpoint
. During training, the model resides in the computer’s memory. If I lose the session, I lose the model. To avoid this, ModelCheckpoint
saves the model to disk as it learns.
We can choose to always save the best version, as we are doing here with save_best_only=True
, or save multiple checkpoints (version 1, version 2, version 3, and so on).
Each time the model improves, it saves that version to disk. This ensures that if the training is interrupted at any time, you already have the best version saved and ready to be used.
Conclusion of the Model Architecture:
These are the callbacks: tools that help control and enhance the training process. With EarlyStopping
, we avoid excessive training, and with ModelCheckpoint
, we ensure that the model is securely saved during the training.
In our notebook, I have detailed everything happening in each cell.
With this, we have completed the construction of the model architecture. We created the layers for the forward pass, defined the compilation with backpropagation, and configured the callbacks. The architecture is ready!
What to Use in Training?
We can now train the model. I have added the summary here for you to have an overview of what we will use in the training:
model.summary()
Notice that this is a sequential model, a sequence of neural layers, where we have dense layers and dropout layers. In the second line, in the dropout
layer, no learning occurs, which is why the parameters column shows zero. The learning happens in the dense layers (dense
layers).
The parameters on the right are the weights or coefficients that the model will learn during training. Although these elements may be referred to by different names, they all ultimately represent the same concepts: numerical values that express the mathematical relationship between the input data and the output.
Training the Model:
# Train the model
history = model.fit(
X_train_scaled,
y_train,
validation_split=0.2,
epochs=100,
batch_size=32,
callbacks=[early_stopping, model_checkpoint]
)
To train the model, we call the fit
method, passing the standardized training data (X_train_scaled
) and the target variable (y_train
). Normally, it is not necessary to standardize y
, unless there is a specific reason, such as correcting an issue with the variable. Standardization is primarily applied to the input data (X
).
Next, we define the validation_split
. We have already divided our data into training and testing sets, but during training, we also want to evaluate the model in real-time. To achieve this, we split the training data into two samples: one for training and one for validation. This way, we can evaluate the model as it trains and later perform the final evaluation with the test data.
Defining the Number of Epochs and the Callback:
I set the number of epochs (or training iterations) to 100. Is this the ideal number? I don’t know, but that doesn’t matter because we configured the early_stopping
callback.
If the model stops learning before reaching 100 epochs, the callback will halt the training. If more epochs are needed, we can adjust in a future version of the model. The key is that we are always experimenting—after all, this is science.
Batch Size:
I also set the batch_size
to 32. When working with deep learning, we have numerous mathematical operations, specifically matrix multiplications, happening constantly.
Delivering all the data to the model at once is not feasible because it could exhaust the available RAM. Therefore, we work with smaller batches, feeding the model blocks of 32 data points at a time, which helps manage memory and facilitates learning.
Even if the data in this example is small and could fit in memory, it’s still best practice to use batch sizes. This approach simulates a real-world scenario with large data volumes and helps optimize the model training process.
Monitoring the Error:
We are using loss as a metric, specifically mean squared error (MSE) and mean absolute error (MAE) to evaluate performance. What do we expect to happen with the error over time?
We expect it to decrease, indicating that the model is learning. Both training and validation errors tend to decrease, even with some stability along the way, which is normal.
Notice that the model stopped training at epoch 64, triggered by early_stopping
. For our example, this is sufficient. Now that the model has been trained, we can proceed to evaluation.
Evaluation:
The model has been trained, and we already have the best version saved on disk as model.keras
. We also saved the scaler. Now, we can evaluate the model to conclude the work and prepare for deployment. To do this, we will use the model
.
# Evaluate the model on the test set
test_loss, test_mae = model.evaluate(X_test_scaled, y_test)
At this moment, the model is in the memory of the Jupyter Notebook session, and we can use it directly from here. If needed, we can also load the model from disk to validate that the file is correct.
We use the evaluate
method, passing the scaled test data (X_test_scaled
) and y_test
, which are the true values. The model will make predictions and compare them with the actual values to check its performance, returning two metrics: test_loss
and test_mae
.
print(f'Test Loss: {test_loss}')
print(f'Test MAE: {test_mae}')
We reached an error of 153, which is the Loss (MSE), and an MAE of 10. The lower these values, the better. Since we have only one version of the model, it’s not easy to say if it’s the best possible, but the objective here is to demonstrate the whole process.
The model clearly learned: at the start of training, the MAE was 72 for both training and validation.
The model gradually reduced the error, achieving an MAE of 7.6 on the test data.
This concludes our project, where the main goal was to present a deep learning architecture, showing how this approach can be extremely useful when working with multivariate analysis.
Deep learning generally handles data complexity well, something simpler algorithms like linear regression may struggle with due to their limited mathematical operations.
With deep learning, we have more operations that allow the model to learn complex patterns. Complex data with many variables and dimensions are common, and Deep Learning is a good option for such cases.
However, to fully leverage this technique, a large volume of data is necessary, usually in gigabytes or more. This is because deep learning models have a high learning capacity and require substantial data to generalize effectively.
If the data volume is insufficient, other approaches may be more effective. Always consider adjusting hyperparameters, testing variable selections, and refining the model architecture.
Modifications can improve performance and better fit the model to the specific problem you are solving.
Thank you very much. 🐼❤️
All images, content, and text are created and authored by Leonardo A.