Weather Prediction using storm images (Part 2: Model training and feature training)

Shuvam Das
deepkapha notes
Published in
8 min readApr 11, 2023

Model

import pandas as pd
import numpy as np
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, WhiteKernel, DotProduct
import matplotlib.pyplot as plt
import seaborn as sns
from pyDOE import *
from scipy.stats.distributions import norm
from datetime import time
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.animation as animation
import seaborn as sns
sns.set_style(
style='darkgrid',
rc={'axes.facecolor': '.9', 'grid.color': '.8'}
)
sns.set_palette(palette='deep')
sns_c = sns.color_palette(palette='deep')
%matplotlib inline

The code above is a Python script that imports several libraries and performs various operations related to data analysis, machine learning, and visualization. The code begins by importing Pandas and NumPy libraries using the ‘pd’ and ‘np’ aliases, respectively. The Pandas library is used to manipulate and analyze tabular data, while NumPy is used for mathematical operations and array manipulation. The script also imports several other libraries, including sklearn.gaussian_process, matplotlib, seaborn, pyDOE, and scipy.stats.distributions.

The next lines of code import specific modules from the ‘sklearn.gaussian_process’ library, including the GaussianProcessRegressor, RBF, WhiteKernel, and DotProduct modules. These modules are used for building a Gaussian process regression model, which is a powerful machine learning algorithm that can be used to model complex non-linear relationships between variables.

The script then sets the style of the seaborn visualization library, which is used for creating plots and graphs. The style is set to ‘darkgrid’, and some customizations are made to the axes and grid colors to achieve a specific aesthetic.

Python Script for Loading and Inspecting Training Data Set using Pandas and NumPy

train_set_features = pd.read_csv("nasa-data/training_set_features.csv")
unique_storms = np.unique(train_set_features["Storm ID"])
print(len(unique_storms))
unique_storms[0]
train_set_labels = pd.read_csv("nasa-data/training_set_labels.csv")
train_set_labels.head()
int_df = pd.merge(train_set_features, train_set_labels, how ='inner', on =['Image ID', 'Image ID'])
int_df.head()

The next lines of code read in two CSV files using the Pandas library. The first file is named ‘training_set_features.csv’, which contains the features of the training data set. The second file is named ‘training_set_labels.csv’, which contains the labels or target values of the training data set. The script then creates a NumPy array of unique storm IDs from the training set features using the ‘np.unique’ method and prints the length of the array to the console. Finally, the script prints the first element of the unique_storms array.

The last line of code reads in the first few rows of the training set labels data set using the ‘head’ method of the Pandas library. This method returns the first five rows of the data set by default, allowing the user to quickly inspect the data and make sure it has been loaded correctly.

In summary, this code imports several libraries for data analysis, machine learning, and visualization. It reads in two CSV files containing the features and labels of a training data set, creates a NumPy array of unique storm IDs, and prints the length of the array and the first element to the console. It also prints the first few rows of the training set labels data set to the console. The next segment of code begins by merging the previously loaded train_set_features and train_set_labels DataFrames using the pd.merge() function. The resulting DataFrame int_df contains only the rows where the Image ID and Storm ID values match in both DataFrames. The how=’inner’ parameter ensures that only rows with matching Image ID and Storm ID values are kept in the resulting DataFrame.

Next, a loop is initiated that iterates through all unique values of Storm ID in the int_df DataFrame. For each unique storm, a new DataFrame nhe is created that contains only the rows corresponding to that specific storm. A plot is then created using the plot() method from pandas DataFrame on the nhe DataFrame with the x-axis as Relative Time and y-axis as Wind Speed. The titlestr variable is updated to include the current Storm ID, which is used to title the plot. The plot is then saved as an image in the image-storm directory with the name of the image being the Storm ID of the current storm. The plt.close() method is used to close the current plot. The next segment of code defines the number of samples n as 70257 and creates a numpy array t using np.arange(n). A DataFrame data_df is then created with a single column labeled t and containing the values of t. This DataFrame is intended to be used as a placeholder to create synthetic data.

In this code segment, the primary purpose of the for-loop is to create a plot for each unique storm ID and save the plot as an image. The resulting images can be used to visualize the relationship between Relative Time and Wind Speed for each unique storm, which may help identify patterns and trends in the data. The synthetic data_df DataFrame is not used in this code segment but may be used in subsequent code segments.

Python Script for Generating Seasonal Component of Target Variable Using Pandas and NumPy

def seasonal(int_df, unique_storms):
"""Generate a sinusoidal curve."""
actual_speed = []
for i in range(494):
actual_speed.append(int_df[int_df["Storm ID"]== unique_storms[i]]["Wind Speed"])
y1 = [item for sublist in actual_speed for item in sublist]
return y1
# Define target variable.
data_df['y1'] = seasonal(int_df, unique_storms)
#Let us plot this seasonal variable:
fig, ax = plt.subplots()
sns.lineplot(x='t', y='y1', data=data_df, color=sns_c[0], label='y1', ax=ax)
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
ax.set(title='Seasonal Component', xlabel='t', ylabel='');

The code begins by defining a function called seasonal which takes in two arguments int_df and unique_storms. The purpose of this function is to generate a sinusoidal curve that represents the seasonal component of the target variable Wind Speed. To do this, the function loops through each unique storm in the dataset and extracts the Wind Speed data for that storm. This data is then appended to a list called actual_speed. Once the function has looped through all the storms, the function returns the concatenated list of Wind Speed data for all the storms.

Next, the code defines the target variable y1 as the seasonal component generated by the seasonal function. The code then creates a new DataFrame called data_df with a single column t that contains the range of values from 0 to n-1 where n is the total number of data points. The code then adds a column y1 to this DataFrame with the values generated by the seasonal function.

Plotting the Seasonal Component and Preparing Data for Model Training

To visualize the seasonal component, the code creates a matplotlib figure with a single subplot using the subplots() function. The lineplot() function from the seaborn library is used to plot the seasonal component against the time variable t. The color argument is set to sns_c[0] which is the first color in the seaborn color palette. The label argument is set to ‘y1’ and the ax argument is set to ax to ensure that the plot is drawn on the subplot created earlier. The legend() function is then used to add a legend to the plot and the set() function is used to set the title and axis labels for the plot.

The code then defines the predictor variable X as the t values in a numpy array with shape (n, 1) where n is the total number of data points. The target variable y is defined as the y1 values in a numpy array with the same shape as X. The code then defines a variable prop_train which represents the proportion of the data to use for training the model. In this case, prop_train is set to 0.7. The code then defines n_train as the total number of data points to use for training, which is calculated by rounding the product of prop_train and n. However, the next line of code overwrites this value and sets n_train to 10000.

The code then creates two numpy arrays X_train and y_train which contain the first n_train values of X and y respectively. The remaining data points are used to create the numpy arrays X_test and y_test. In this case, X_test contains the next 100 data points after the training data and y_test contains the corresponding y1 values for those data points.

Setting up a Gaussian Process (GP) Regression Model

from sklearn.gaussian_process.kernels import WhiteKernel, ExpSineSquared, ConstantKernel
k0 = WhiteKernel(noise_level=0.3**2, noise_level_bounds=(0.1**2, 0.5**2))
k1 = ConstantKernel(constant_value=2) * \
ExpSineSquared(length_scale=1.0, periodicity=40, periodicity_bounds=(35, 45))
kernel_1 = k0 + k1
from sklearn.gaussian_process import GaussianProcessRegressor
gp1 = GaussianProcessRegressor(
kernel=kernel_1,
n_restarts_optimizer=10,
normalize_y=True,
alpha=0.0
)

This code block is setting up a Gaussian Process (GP) regression model for the given data. The GP model is a probabilistic machine learning model that can be used for regression or classification tasks. In this specific case, the model is being set up for regression. The first line of code imports necessary modules for the model setup, including the WhiteKernel and ExpSineSquared kernels from the Gaussian process package, the ConstantKernel, and the GaussianProcessRegressor. The next few lines of code define the kernel that will be used for the GP model. A kernel is a function that measures the similarity between two inputs in the input space. The kernel_1 variable is defined as a combination of two kernels — a white noise kernel and an exponential sine squared kernel.

The white kernel is used to model observation noise, which is the variance of the noise that can’t be explained by the model. The noise_level parameter of the WhiteKernel is set to 0.3², which is an estimate of the noise level based on prior knowledge. The noise level_bounds parameter is set to (0.1², 0.5²), which is the upper and lower bound for the noise level.

Setting up the Gaussian Process Regression Model

The exponential sine squared kernel is a periodic kernel that captures periodicity in the data. It is defined using the ConstantKernel, which multiplies the exponential sine squared kernel by a constant value of 2. The length_scale parameter of the exponential sine squared kernel is set to 1.0, which determines the smoothness of the kernel. The periodicity parameter is set to 40, which is the period of the seasonal component in the data. The periodicity_bounds parameter is set to (35, 45), which is the upper and lower bound for the period.

gp1_prior_samples = gp1.sample_y(X=X_train, n_samples=100)
fig, ax = plt.subplots()
for i in range(100):
sns.lineplot(x=X_train[…,0], y = gp1_prior_samples[:, i], color=sns_c[1], alpha=0.2, ax=ax)
sns.lineplot(x=X_train[…,0], y=y_train[…, 0], color=sns_c[0], label='y1', ax=ax)
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
ax.set(title='GP1 Prior Samples', xlabel='t');

Finally, the GaussianProcessRegressor is initialized with the kernel_1 defined above. The n_restarts_optimizer parameter is set to 10, which means that the optimizer will run 10 times to find the best hyperparameters for the model. The normalize_y parameter is set to True, which standardizes the target values before fitting the model. The alpha parameter is set to 0.0, which specifies the regularization parameter for the model.

Visualizing Prior Samples from a Gaussian Process Regression Model

This block of code is related to Gaussian Process Regression. In this block, the code is first preparing the data for training and testing. It reshapes the values of the time variable t and y1 to a 2D array of shape (n,1) where n is the number of samples. Then, it selects a portion of the data to use for training and another portion for testing. The training data consists of the first 2500 samples and the testing data consists of the next 100 samples.

Next, the code generates prior samples from the Gaussian process using the sample_y() function of gp1. This function generates prior samples from the Gaussian process with the specified kernel and input variables. Here, 100 samples are generated and plotted in a graph along with the training data. This is done to show the range of values that the Gaussian process can predict before any data is observed. The prior samples are plotted as light-colored lines with low alpha to show their uncertainty.

There will be part 3 for this article talking about parameter tuning and the accuracy scores etc.

The GitHub repo link is here: https://github.com/deepkapha/EarthScanWebinar

--

--