ChatGPT for Data Science

Dhanasree Rajamani
8 min readSep 3, 2023

--

Data Science in Crop Recommendation

Agriculture is one of the most important pillars in the global economy and is also a major source of employment in many countries. Providing food to human beings is a basic necessity and agriculture plays a vital role in this. There are multiple factors that affect the produce and yield of food such as rainfall, level of nutrients in the soil, weather conditions, and usage of fertilizers. Making use of this data and performing analysis would help to understand how these factors influence the production and yield of crops and determine the suitable environment to grow a specific crop, improving the overall yield of crops.

This project aims to incorporate data science in farming. We use various data features to determine the most suitable crop to grow in a given environment. This helps farmers make informed decisions about farming strategies.

The dataset consists of the following parameters:

  • N: Nitrogen content in soil.
  • P: Phosphorus content in soil.
  • K: Potassium content in soil.
  • Temperature: Temperature in degree Celsius.
  • Humidity: Relative humidity in percentage.
  • pH: Indicate the chemical composition of soil.
  • Rainfall: Rainfall in mm.

Target label denotes the most suitable crop that can be grown in the given environment.

Dataset: https://www.kaggle.com/datasets/atharvaingle/crop-recommendation-dataset

Using ChatGPT to perform various phases of CRISP-DM for the Crop Recommendation problem.

Now using ChatGPT we perform the following phases according to CRISP-DM methodology:

Figure 1. Problem statement given as prompt to ChatGPT and ChatGPT begins to respond.

Business Understanding: This phase focuses on understanding the objectives and requirements of the project.

Data Understanding: Explore the data to get familiar with it and identify the quality of the data.

Data Preparation: Clean and preprocess the data to make it ready for modeling.

Modeling: Use appropriate algorithms to create predictive models.

Evaluation: Assess the models to ensure they meet the business objectives.

Deployment: Deploy the model to a production environment.

Business Understanding

The primary goal is to provide farmers and agricultural businesses with recommendations on the most suitable crops to grow in specific conditions. By leveraging data on soil composition, weather conditions, and other environmental factors, the system aims to optimize crop yields, reduce risks associated with crop failure, and ensure sustainable farming practices.

Value Proposition

Optimized Yields: By recommending the most suitable crops based on specific conditions, the system can help optimize crop yields, leading to increased productivity and profitability for farmers.

Reduced Risks: Unpredictable weather patterns and changing environmental conditions pose risks to crops. A data-driven recommendation can reduce the chances of crop failure.

Sustainable Farming: Recommending crops that are best suited for specific soil compositions can lead to sustainable farming practices, ensuring soil health is maintained over multiple farming cycles.

Economic Growth: Increased yields and reduced crop failures can boost the agricultural sector’s contribution to the economy, leading to job creation and economic stability in regions heavily reliant on agriculture.

Key Insights from Analysis

Balanced Dataset: Our dataset was well-balanced across different crops, which means the model will not be inherently biased towards any specific crop due to data imbalance.

Feature Importance: All the provided features (soil nutrients, weather conditions, etc.) were deemed essential based on domain understanding. Our Random Forest model can further rank these features based on their importance in making predictions.

Model Performance: The Random Forest Classifier emerged as the best-performing model with an accuracy of around 99.32%. This high accuracy indicates that the model can make reliable crop recommendations.

Recommendations

Model Deployment: Given its high accuracy, the Random Forest Classifier should be considered for deployment. The system could be integrated into web or mobile applications for easy access by farmers.

Continuous Monitoring: Once deployed, the model should be continuously monitored for performance. Feedback from end-users can be invaluable in understanding any discrepancies between recommendations and real-world results.

Data Collection: As conditions change (climate change effects, evolving farming practices, etc.), there’s a need for ongoing data collection to keep the model updated.

Extension of System: Beyond just crop recommendation, the system can be extended to provide advice on pest control, irrigation methods, and other farming best practices.

Potential Challenges

Data Quality: The quality of predictions heavily relies on the quality of input data. Inaccurate data can lead to incorrect recommendations.

Complexity of Real-world Scenarios: While the model performs well on the provided dataset, real-world conditions can be more complex. Factors not considered in this dataset might influence crop selection.

The crop recommendation system presents a promising tool for modernizing and optimizing agricultural practices. Proper implementation and continuous refinement, backed by reliable data, can make it an indispensable asset for the agricultural sector.

Data Understanding and Data Visualization

Summary of Data Understanding

Dataset Shape: The dataset consists of 2200 rows and 8 columns.

Data Types:

Numerical: N, P, K, temperature, humidity, pH, rainfall

Categorical: label (which indicates the recommended crop)

Missing Values: There are no missing values in the dataset.

Summary Statistics:

For the nutrients (N, P, K), the values seem to vary within a reasonable range.

Temperature ranges from approximately 8.83°C to 43.68°C.

Humidity ranges from approximately 14.26% to 99.98%.

pH values, which measure the acidity or alkalinity of the soil, range from 3.5 to 9.93.

Rainfall varies from 20.21 mm to 298.56 mm.

Distribution of Numerical Features: The histograms below show the distribution of each numeric feature in the dataset. These distributions provide insights into the range, central tendency, and spread of the data for each feature.

Figure 2. Distribution of numeric features in the crop dataset.

Distribution of Crops: The bar plot reveals the number of data points available for each crop. This visualization confirms that the dataset is well-balanced across different crops.

Figure 3. Distribution of data for various crops in the crop dataset.

Correlation Matrix: The heatmap of the correlation matrix provides a visual representation of how different numerical features relate to one another. For instance, we can observe some correlation between temperature and humidity, which is a common relationship in climatology.

Figure 4. Correlation matrix representing the relationship between the numerical features in the dataset.

Pairwise Relationships: The pair plots show pairwise relationships between numerical features for three selected crops (rice, maize, and apple). These plots help visualize how the conditions vary for different crops and how they might influence crop growth.

Figure 5. Denotes the pairwise relationships between numerical features of the crops — rice, maize and apple.

Data Preparation

In the data preparation phase, we perform:

Data cleaning

Outlier analysis and processing

Feature selection

Data preprocessing

Outlier Analysis Observations:

From the boxplots, we can observe potential outliers in the following features:

N, P, K: These nutrients show some points outside the whiskers, which might be considered as potential outliers.

Rainfall: This feature shows several data points that are notably higher than the rest.

Outlier Processing Strategy:

For agricultural data, outliers might represent extreme but genuine conditions. For instance, a very high rainfall might be indicative of a specific region or year. Instead of removing these outliers, we’ll retain them, since they could be critical for certain crop recommendations.

Figure 6. Outlier Analysis.

During data processing, the following is done:

Encoding Categorical Variables: The ‘label’ column is categorical and needs to be encoded into numerical values for modeling.

Feature Scaling: As different features have different scales, we’ll standardize the features so they have a mean of 0 and a standard deviation of 1.

Figure 7. Dataset after preprocessing.

Modeling

For this Multiclass classification problem, some of the classifiers that could be used are:

Random Forest Classifier: Known for its robustness and ability to handle non-linearities.

Support Vector Machines (SVM): Effective in high-dimensional spaces.

K-Nearest Neighbors (KNN): Simple algorithm that works on the principle of distance between data points.

To evaluate the models, we’ll split the data into training and testing sets. We’ll train the models on the training set and evaluate their performance on the test set.

The data has been split into training and testing sets:

Training set: 1760 samples

Testing set: 440 samples

Summary of Model Evaluation:

Random Forest Classifier:

99.32% accuracy

Support Vector Machines (SVM):

96.82% accuracy

K-Nearest Neighbors (KNN):

95.68% accuracy

Evaluation

We can select a few samples from the test set, display their original features, and then predict the crops based on our best-performing model (Random Forest Classifier).

Figure 8. Prediction on sample test data using the Random forest classifier.

To perform cross-reference predictions, we compare them against the actual crop labels in the test data.

Figure 9. Comparison between actual and predicted crops for the selected samples

Conclusion

Figure 10. ChatGPT giving a summary of the project as a response to user prompt.

Using ChatGPT for data understanding, processing, and analysis offers various benefits:

  • ChatGPT offers an interactive environment where users can request for analysis on-the-fly. This is very useful in exploratory data analysis.
  • ChatGPT allows people to quickly test hypotheses, visualize data, and apply different models to see preliminary results. This rapid feedback is invaluable to data science projects.
  • For people who are not familiar with programming or data science tools, ChatGPT provides access to advanced data analysis with an intuitive interface.
  • Since ChatGPT has been trained on a wide range of topics and texts, it provides insights from a data science perspective and from a domain/ industry perspective.

Thus, ChatGPT democratizes data science, making advanced analysis accessible and interactive, fostering a deeper understanding of data, and facilitating informed decision-making.

--

--