HELPING FARMERS WITH CLIMATE DATA

By John Lehner, Brice Tikum, Luis Alvaro, Levis Forbang, Andrew Lynch

Published in

INST414: Data Science Techniques

19 min readMay 17, 2023

Introduction

The question we sought to address was how and if different weather indicators can be used to help farmers plan planting and harvesting capabilities? Can a city be compared with another to help make this determination? Can clusters be identified to isolate the best time to plan? Can future climate indicators be predicted to gain insight about the coming season’s climate?

To answer these questions we used an API request from climate.azavea.com to extract the necessary data. The technologies we used are Python, Anaconda/VSCode, Pandas, and predictive models in Sklearn. There are many stakeholders that could use our work and models to make decisions. However, farmers in Adelphi, MD are the primary stakeholders we elected to focus on in the scope of our project. They can look at our work to see how their crops might be impacted by either increasing or decreasing levels of drought, flooding, and frost. If done right we expect to be able to use the data to plot out historical trends and visualize what we expect to happen in the future using time-series models. We think this may be the best way to predict trends in climate change using the historical data acquired, developing a numerical metric, and predicting the next set of weeks, months, or years data with that metric. Finally, using four specific cities, we will suggest a specific one that would be most beneficial for a farmer to examine. For example, if one city has a more similar climate than another to Adelphi, farmers there could look for crops that are successfully grown in that similar climate and attempt to grow them themselves in Adelphi. We used the likes of Euclidean Distance, Jaccard Similarity, K Means Clustering, Supervised Learning, and Data Cleaning to achieve this.

Data Collection

Based on the topic, gathering weather data from an API made the most sense. To collect the raw data we used Azavea Climate API, a climate API that processes climate datasets in NetCDF format and provides the data in the universal format JSON. Therefore, we used JSON and Requests packages in Python to gather the data and parse the JSON data for further processing. The API required us to request a key for authentication and access to the data. The API has endpoints for numerous different weather indicators. For the scope of our project, we elected to focus on six specific indicators we believe would likely be most relevant to farmers; “Frost Days”, “Max High temperature”, “Min Low Temperature”, “Percentile Precipitation”, “Dry Spells”. There are three necessary parts of the API request that were needed. The city ID, scenario, and indicator parameter. We kept the scenario the same throughout the different requests and the API documentation gives all the different indicator parameters that can be used so to begin we needed to find the ID’s of the different cities on which we would be collecting data. To do this we made a simple API request with the url specification “https://app.climate.azavea.com/api/city” which returned json detailing the specifics of each city the API has data on. This json was particularly hard to work with so we used the pandas json_normalize() function to clean the dataset up and this made it easier to read. We decided to choose cities from different regions within the United States to provide geographically diverse choices. After collecting the IDs of the four cities we would be working with: Adelphi, Ackerman, Albany, and Alachua, we were ready to begin collecting the data. In order to collect the data, we recycled the same API request for each city and all we needed to do was change the city ID and indicator parameter in the URL. Upon collecting the data it was received in JSON. Wanting to make it both easier to read and work with, we used json.loads() to convert it from JSON into a python dictionary which we then put into pandas data frame. The size of the file containing the final dataframe from which we are working with is 10,348 bytes and the data frame houses 816 rows and 5 columns. This provided us with 4,080 data points to work with and provide constructive analysis upon which to perform analysis. As far as conducting data exploration and cleaning this took a bit of extensive work. The initial data frame for each city contained the year and month in one column which was in year-month format (ex: 2022–1) and then the min, avg, and max, for the given indicator in their own respective columns. We decided the most useful approach would be to keep only the averages as we thought that would likely be the most useful information for this project. We then combined the separate data frames for each city and indicator into one dataframe. We then used a dictionary and split method to separate the year and month into their own columns as well as convert the months from a numeric value to their actual name: January, February, March, etc.. The year, month, city, and state were then set to be a multi-index as we would only be using the averages as data points. Additionally, the API included data prediction going all the way up to the year 2100. Because we want to make our own predictions we set a delimiter to only give data prior to the year 2022. Upon the completion of our analysis, we will be able to compare and validate our prediction with those housed in the API.

Key Ideas

There are many concepts and techniques from class that we used throughout the course of this project that helped us achieve our goals. The first concept we used was data hygiene or data cleaning. Our collected data comes from an API so there was no guarantee that it was properly formatted or there was no corrupted data. For instance checking for null values, if the data was consistent, if the data was the correct data type, outliers, human errors, etc. Checking these things ensures us that our datasets are as accurate and consistent as possible. If we don’t consider data hygiene in our process we run the risk of making decisions based on inaccurate information. Furthermore, we also used supervised learning techniques to make predictions about different weather points. Since we know the structure of our data and have examples or continuous labels, our aim here is to predict. Furthermore we also applied 80/20 split and model evaluation techniques such as mean absolute error and r-square calculations.

Analysis

Overall Climate Similarity

Which city in our collected data has an overall climate that is most similar to Adelphi, MD? This is the first question we wanted to ask in our analysis. The reason is, if a farmer in Adelphi is looking to expand the selection of crops they plant and wants to tap into a product that is not already widely accessible in the state, they may want to look at other cities with a similar climate and see what is being grown there. If they are able to identify this, they can then begin to look at more specific climate data. How similar is the precipitation, temperature, etc, to determine if the given crop will be more likely to thrive even though the climate as a whole is similar? To determine the overall similarity, there were multiple steps that had to be taken. First, the dataframe was transformed using a “groupby” and “mean” method to get the average value for each indicator in each city for each year. Then using the “scipy” python library, the Euclidean distance between each city’s indicators for each year with that of Adelphi’s was calculated and turned into a new data frame.

Following this, we found the average similarity for each city using the “mean” method (average_similarity=similarAdelphi.mean(axis=0). The resulting similarity scores were as follows.

To create a visual that farmers could see from this we plotted this to a bar graph.

It is worth noting that in the case of similarity ratings, the lower the score, the more similar it is. Of the three cities tested, Ackerman, MS has the best similarity rating. Ideally, the similarity would be as close to zero as possible, but we believe a seven is low enough that we could recommend looking further into Ackerman’s climate to determine crops that would be feasible to grow in Adelphi.

Finally, we realized that an overall similarity might not be an entirely good predictor to go on if the climate similarities have been moving farther apart over time. For example, if Ackerman had a very similar climate in 2006 but has been growing farther apart in similarity from Adelphi from then until the present, the farmer may not want to use this in the decision-making process. Therefore, the last visualization was a line graph that plots the changes in similarity year by year.

From this, it can be seen that despite some slight crests and troughs, Ackerman’s similarity has remained relatively close to its overall similarity rating. It appears that there may have been an outlying event that caused the similarity to stray farther away from Adelphi around 2020. However, in the following year, there was a noticeable dip back close to our recorded overall similarity.

Jaccard Similarity

The Jaccard similarity method we utilized was used to compare the similarities of the individual cities to one another as well as the different weather indicators from each city to the weather indicators of another city. We had 4 techniques for approaching this. Firstly we had an Isolated 1 on 1 method, second a method where we did everything all at once, third a method involving averages of the weather indicators, and lastly a method involving lists of lists. Some of these techniques proved to be successful and some not so much.

For the Isolated 1 on 1 comparison, with this method weather indicators for a city are isolated and then the Jaccard function we created is run to determine similarity in that weather indicator category. Looking at the example below, we see that Dry Spells for Adelphi is compared to Dry Spells for Albany, then the Jaccard similarity is outputted.

Next with the all-at-once comparison, in this method, we compare all of the weather indicators for a city to all the weather indicators for another city. With this, we can display all the individual similarities for each weather indicator comparison in one function. For this technique however we contracted an error when comparing the last 3 weather indicators, those being, Precipitation Percentile, Max, and Min Temperature. As shown in the example, all the weather indicators for Adelphi are compared to all the weather indicators for Albany. But as also shown in the example, there was a Jaccard Similarity outputted for Dry Spells and Frost Days for both cities comparison but the rest of the indicators outputted a 0. We believe this was due to an error on our part. We did however theorize that this was due to our jaccard function not taking into account close values but rather only accounting for exactly similar values.

Moving on to the Averages method, with this method we took all the values for a weather indicator in a city over the years and averaged it out to output one value. Then we’d do the same for a different city with that same weather indicator. Then we would use our Jaccard function to compare those two values. This was a major failure as results would always return 0 for every similarity conducted. But after explaining the last technique we will go over why.

Lastly, we had the list of lists method which proved to be the most effective with Jaccard Similarities. In this example, we stored all the values for a weather indicator of a city inside of a list. Once a list of values is created for all the weather indicators, we then store all those lists for a city inside a larger list. With this we use our Jaccard function to compare the lists for the cities and a single Jaccard similarity is outputted for that comparison. As shown in the example below the two giant lists for Adelphi and Albany are compared and the Jaccard Similarity of 0.0234009360374415.

Making Weather Predictions

Weather forecasting plays an important role in the way farmers plan their farming, their crop development, and their crop yield. Making accurate weather predictions can help farmers better prepare for the future, it helps them have a plan for their harvesting and planting. Each crop has a different range of temperatures where they can grow, for instance, tomatoes’ optimal temperature ranges from 65 to 85 Fahrenheit. So depending on the predicted max temperature or low temperature, farmers should take steps that they believe are the best to help them reduce the devastating impacts of extreme temperatures. Farmers can decide what to plant and when to do it. We have already collected weather data for each city dating from 2006, so our general aim here is to use this historical data to create a tool/model that predicts the next month’s weather.

Further cleaning was not necessary for the collected data frames of each city. However, to make the best possible predictions we had to fetch more data about each month of this year. Therefore I added extra rows for 2023 from January to March. The next thing to do was to double-check once again for possible null values on the updated data frames, but none of the rows had them. Our aim is to predict different weather indicators for each city, so the features we will be using are avg max temperature, avg min temperature, avg frost days, avg dry spell, and avg precipitation percentile. Beginning with the process of creating the model. Since all of these features are continuous variables we will be using regression as a supervised learning method. We decided that ridge regression would be the best option for our prediction model. Ridge regression is an extension of linear regression which takes into account multicollinearity. After checking our correlation matrix for our features using corr() we can see that several features or independent variables are highly correlated to each other so it would make sense to use ridge regression. For example, we can take a look at the correlation matrix from Adelphi down below.

In order to make our predictions for next month we need to create a target column for what we are trying to predict so for instance if we are predicting max temp, we would copy that column and set it as our target column. Next, the idea here is that we are using past data to predict next month so we would shift one row back for each row in our target column so that each row or month has a target value of next month so our model can use it to make predictions. The edited data frame from below is from Adelphi and the target column is avg max high temperature.

Now that we have our selected features and model, the next step was to split the data into training and testing sets with an 80/20 split. Since it’s time series data, it was difficult to split the data using the Sklearn method, so we had to manually split it into different sets with pandas iloc. There are 5 weather metrics to predict and 4 cities so there were a total of 20 different models to make predictions for each specific weather metric of each city. Therefore it was more efficient to make a function that takes in a list of features to use, a data frame for a city, and the model used.

This function called predict(), splits the data, and trains the passed model using the passed features and the target column. It makes the predictions using the testing set and validates our model by calculating the mean absolute error, r², and plotting the predictions. We can see that the output for Adelphi predictions for avg max high temperature is below.

After making the predictions and evaluating each model, we found that some models were not as accurate as we expected. For instance, the avg precipitation percentile prediction for Albany gave us an r-square value of 0.005 which is very low. Unfortunately, due to time constraints, we were not able to improve these models. But our initial thoughts were that we either had to take a closer look at what features to include, this could possibly improve our model as well. Finally, we used each model to predict the weather for May for each city. The output is down below.

SARIMA

To make the predictions more accurate, we tested another model, SARIMA, which stands for seasonal auto-regressive integrated moving average. There are four components of this which we need to understand. Autoregression (AR) uses past values of a variable to predict the next value, which is useful here because the weather of tomorrow is largely dependent on the weather of yesterday. The integration component (I) helps by transforming a non-stationary series to a stationary one by differencing consecutive values. For example, if every January there is an increase in temperature by 5 degrees, we can take the difference between months to remove trends and make the data stationary so that the other parts of the model are more accurate. The moving average (MA) component factors in past errors to further help model accuracy. Finally, the seasonal component (S) allows to tell the model how to apply the differencing.

We know that the seasonal component will be 12 for weather data. Now we will use data from frost days in Adelphi to help further understand the model. We use autocorrelation, which shows variations in the data from itself, to verify that we see the data repeating every 12 months.

We can use autocorrelation to additionally determine inputs for the autoregression and moving average terms, but here we just create various models with different values and calculate error with the Akaike information criterion (AIC) which is calculated by the model. For this implementation of SARIMA we use SARIMAX from the statsmodels library.

For our values of (S, AR, I, MA) we see that (12, 2, 1, 2) has the lowest error so we will use these values. We can see that this model gives a nice fit for Adelphi’s frost days.

This model has a mean squared error of 0.55 for this specific data, compared to 19.6 for the ridge model. Further analysis shows that SARIMA is very accurate for smoothly cyclic weather patterns, but not as much for dry spells and precipitation. We can therefore use this model in place of the ridge model for temperature and frost day predictions.

K-Mean Clustering

This section focuses on the use of a data science technique called cluster that we used to analyze our data. This insight we wanted to extract this from our data using clustering to determine what dates share the same weather characteristics. The decisions farmers, who are the stakeholders, can make from the result of this analysis are; (1) which month(s) is best to plant or harvest crops for all the cities at the same time, (2) which crop(s) to plant at the same time for all the cities best on the similarity of weather characteristics. There can be severe loss in agriculture if farmers don’t make the correct decisions that are backed by accurate weather data analysis. Shifts in the soil processes, changes in nitrogen uptake, and pest populations are a few indirect effects of weather on agriculture crops.

K-mean clustering is the data science technique I used for this analysis. K-mean clustering is a method used to identify clusters in data. More specially, partitional clustering is the type of clustering technique used. This technique ensures that each cluster doesn’t have members that are in other clusters and there must be at least one member in the cluster. To use this technique, you must specify the number of clusters. five is the number cluster we specified for this analysis. We created clusters of dates ranging from 2006 to 2022 for the four cities based on five weather entities. Before we created a cluster using the sklearn.cluster k-means python package, we had to create a pandas data frame that shows the average of the weather entities for each date of the cities. The columns of the data frame are the weather entities (average dry spell, average frost days, average max temperature, average low temperature, and average precipitation percentage, the date of the cities was set as the indices.

The following are our findings after successfully analyzing the data using k-mean clustering. Based on the k values specified which is 3, we were able to extract three clusters from the data. The first cluster comprised 134 dates, the second cluster comprised 326 dates and the third cluster, the largest cluster, comprised 356 dates. To examine each cluster at a micro level we had to first draw a sample of eight dates from each cluster and create separate data frames having the weather information. Then we created line graphs for each data frame. For average dry spells and average precipitation, they are on similar levels in all three clusters. For average frost days, it is much higher in the first cluster (cluster 0) compared to the rest. For average maximum temperature, it is mostly at the same level in all three clusters. And lastly, for average minimum temperature, is much higher in the first cluster compared to the others. The months of March, January and November showed up multiple times for Adelphi, Maryland in the first cluster. The month of May showed up multiple for Florida in the second cluster. A farmer living in Maryland can decide not to plant crops in January and March because of the high average minimum temperature and high average frost days.

Decisions

As mentioned earlier these methods can be extremely beneficial to farmers of certain cities as they learn to apply similar strategies and countermeasures for their crops based on what has worked for other farmers in similar cities.

As a base to look for crops, of the three other cities analyzed, we feel confident in suggesting to Adelphi farmers that they look first to Ackerman, MS. The overall climate similarity is close enough that it is reasonable that there may be crops able to be grown in Adelphi.

The decisions farmers, who are the stakeholders, can make from a k-mean cluster analysis results are; (1) which month(s) is best to plant or harvest crops for all the cities at the same time, (2) which crop(s) to plant at the same time for all the cities best on the similarity of weather characteristics.

In terms of Jaccard Similarities when comparing the individual weather indicators using the 1 on 1 method for a city to our target city of Adelphi we found the best city that is the most similar to Adelphi for that indicator. For Dry Spells, Adelphi was closest to Alachua, as well as Alachua was the closest to Adelphi in Frost Days. However this changes when comparing the renaming 3 indicators Precipitation Percentile, Max, and Min Temp, the city of Ackerman now is the closest to Adelphi. When comparing the cities using the lists of lists method, the city of Ackerman again is the closest to Adelphi. Again we’d recommend that the city of Adelphi take a look at Ackerman when dealing with a specific weather indicator issue relating to Max and Min temp as well as Precipitation. But when dealing with Frost Days and Dry Spells, take a look into Alachua.

After getting the predictions for May, focusing solely on Adelphi we can see that farmers can decide what crops can be harvested/planted. They can use this information like avg max high temperature being 89 degrees Fahrenheit to look at what crops have that optimal range of temperature to grow. Avg precipitation percentile predicted value was 0.039, farmers should expect little to no precipitation for May, thus they should focus and plant more drought-tolerant crops for this month so they don’t have a huge loss in their crops. As said earlier some models unfortunately did not give accurate results, in this case, avg frost days and avg dry spell seem way too off to what can be expected in May, thus these predictions should not be taken into serious consideration until the model is refined or improved.

Conclusion

It is worth noting some limitations of our analysis. First, our analysis dealt with averaged data and does not reflect seasonal change. For example, a city could have a good similarity score because it has an identical spring season, and a relatively similar summer, but a completely different fall. Therefore, within this limitation, it may be hard to determine a crop based on the season without further data collection and analysis.

Overall the lists of lists method worked the best for our Jaccard Similarity and the 1 on 1 comparison method was the second most effective of the 4. Going over why 2 of the 4 methods seem to fail we found an error in the intersection portion of the Jaccard Similarity. When defining Jaccard Similarity we see that it is used to compare the similarity and dissimilarity between two sets. More specifically defined as the size of an intersection of sets over the size of their union. So for example if we had two sets being {1,2,3,4} and {3,4,5,6} the intersection would be the values they share {3,4} and the union is all the values from both but only listed once {1,2,3,4,5,6}. Then we’d do Intersection/Union.

Moving on to our use of Jaccard we found out that when calling intersections between some weather indicators for a city and comparing them to the weather indicators of another city, some of them had no shared values. Without these shared values the intersection ends up being 0 and whatever the union ends up being the result would always be 0. This was the result when comparing the Precipitation Percentile, Max, and Min temperature. These indicators had no exact shared values so they ended up displaying a 0 Jaccard similarity.

Each row of our data represented a month, the API could not provide daily data which would have been a better option for predictions. This would have given us more data to feed into the training set so it would make more accurate predictions. Like mentioned before some of our models did not seem good when performing evaluations. So farmers can only use certain predictions. For instance, farmers in Adelphi can use max high temp and precipitation percentile but not dry spells.

The git repository can be found here

Appendix

John Lehner: Data Collection and overall climate similarity using Euclidean Distance.

Brice Tikum: City Weather Indicator Similarity using Jaccard

Luis Alvaro: Predicting weather metrics for each city using ridge regression.

Levis Forbang: K-mean clustering

Andrew Lynch: SARIMA