Improving meteorological and ocean models with Machine Learning
Part 2: Applying Deep learning to enhance visibility variable predicted from the meteorological model
As we can see in the last article, part 1, the model performance of the variable visibility is inferior. We are going to evaluate the performance of the model and try to improve it with Deep learning.
We can start defining a data frame from my Github Account:
import pandas as pd
#from Github link
url=”https://raw.githubusercontent.com/granantuin/LEVX_class/master/maestro.csv"
master=pd.read_csv(url,index_col=”datetime”,parse_dates=True)
The variables of the data frame are explained in the article part 0. Our independent variable will be “visibility_o,” and the independent variables will be the variables predicted by the model with an extension “_p.” Therefore:
master_f=master[[‘dir_p’, ‘lhflx_p’, ‘mod_p’, ‘prec_p’, ‘rh_p’, ‘visibility_p’,
‘mslp_p’, ‘temp_p’, ‘cape_p’, ‘cfl_p’, ‘cfm_p’, ‘cin_p’,
‘conv_prec_p’,”visibility_o”]]
master_f contains only the variables that we are interested in. We can plot the matrix of correlations using Seaborn library:
import seaborn as sns
sns.set(rc={‘figure.figsize’:(11.7,8.27)})
sns.heatmap(master_f.corr(),annot=True,linewidths=.5)
And the results are:
Correlation between visibility observed and predicted is weak. We can asses our meteorological model defining the same parameters to asses a machine learning algorithm. It´s vital to forecast when visibility is lower than a threshold. A binary classification problem could be solved. Our goal would be to find when visibility is smaller than a threshold. First, we define the following thresholds: 50,500,1000,5000 meters and evaluate the model capacity to predict them.
for threshold in [50,500,1000,5000]:
master_f[“visibility_o_”+str(threshold)]=[True if c<=threshold else False
for c in master_f.visibility_o]
master_f[“visibility_p_”+str(threshold)]=[True if c<=threshold else False
for c in master_f.visibility_p]
print(“**** Confusion matrix threshold:”+str(threshold)+”m”+” ****”)
print(confusion_matrix(master_f[“visibility_o_”+str(threshold)],
master_f[“visibility_p_”+str(threshold)],
labels=[False,True]))
print(“***************”)
target_names = [“>”+str(threshold)+”m”,”<=”+str(threshold)+”m” ]
print(classification_report(master_f[“visibility_o_”+str(threshold)],
master_f[“visibility_p_”+str(threshold)],
target_names=target_names))
The results are:
From the data above we can see different variables to evaluate the meteorological model performance. For instance, if we look at threshold 500 meters (visibility less than 500 meters) at column support: 3034 hours with visibility less than 500 meters and 58222 hours with more than that. Looking at the confusion matrix, the meteorological model was correct 548 times. It predicted 2189 visibility less than 500 meters wrong. Also, there are 2486 hours with actual visibility of less than 500 meters wrong forecast.
I plot the results in a ROC and get the AUC to asses the model results for several thresholds (visibility ranges).
from itertools import cycle
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.metrics import roc_curve, auc#ROC model one figure
ranges=[50,500,1000,5000]
fprl=[]
tprl=[]
thresholdsl=[]
roc_aucl=[]for vis_range in ranges:
y_data=[False if c<=vis_range else True for c in master_f.visibility_o]
y_pred=master_f.visibility_p
fpr, tpr, thresholds = metrics.roc_curve(y_data,y_pred)
roc_auc = auc(fpr, tpr)
fprl.append(fpr)
tprl.append(tpr)
thresholdsl.append(thresholds)
roc_aucl.append(roc_auc)plt.figure(figsize=[12,10])
n_ranges=len(ranges)
colors = cycle([‘aqua’, ‘darkorange’, ‘cornflowerblue’,”red”])for i, color in zip(range(n_ranges), colors):
plt.plot(fprl[i], tprl[i], color=color, lw=2,
label=’ROC curve of range {0} (area = {1:0.2f})’
.format(ranges[i],roc_aucl[i]))plt.plot([0, 1], [0, 1], color=’navy’, lw=2, linestyle=’ — ‘)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel(‘False Positive Rate’)
plt.ylabel(‘True Positive Rate’)
plt.title(“ROC range:”)
plt.legend(loc=”lower right”)
plt.show()
We can build a neural network to enhance the results. We define dependent and independent variables. Dependent variable visibility observed and independent variables, variables forecasted by the model. Let´s do it :
threshold1=500
y_data=pd.DataFrame({“datetime”:master_f.index,
“visibility_o”:[1 if c<=threshold1 else 0 for c in
master_f[“visibility_o”]]}).set_index(“datetime”)x_data=master_f[[‘dir_p’, ‘lhflx_p’, ‘mod_p’, ‘prec_p’, ‘rh_p’, ‘visibility_p’,
‘mslp_p’, ‘temp_p’, ‘cape_p’, ‘cfl_p’, ‘cfm_p’, ‘cin_p’,
‘conv_prec_p’]]
We build a neural network and display the results:
import numpy as np
#neural network
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix ,classification_report
from sklearn.model_selection import cross_val_score,cross_validate
import tensorflow as tf
from tensorflow.keras import optimizers
from tensorflow.keras.models import Model, load_model, Sequential
from tensorflow.keras.layers import Input, Dense, Dropout, AlphaDropout
from tensorflow.keras.callbacks import ModelCheckpoint, TensorBoard
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScalerscaler = MinMaxScaler()#transform x_data or pca_vectors
scaled_df = scaler.fit_transform(x_data)x_train, x_test, y_train, y_test = train_test_split(scaled_df,y_data.visibility_o, test_size=0.2,)class_weight = {0: (sum(y_train == 1)/len(y_train)), 1: (sum(y_train == 0)/len(y_train))}mlp = Sequential()
mlp.add(Input(shape=(x_train.shape[1], )))
mlp.add(Dense(24, activation=’relu’))
mlp.add(Dropout(0.5))
mlp.add(Dense(24, activation=’relu’))
mlp.add(Dropout(0.5))
mlp.add(Dense(1, activation=’sigmoid’))
mlp.summary()
mlp.compile(optimizer=tf.keras.optimizers.Adam(lr=0.0001),
loss=’binary_crossentropy’,
metrics=[‘accuracy’,tf.keras.metrics.Recall()]
)history = mlp.fit(x=x_train,
y=y_train,
batch_size=128,
epochs=50,
validation_data=(x_test, y_test),
class_weight=class_weight,
verbose=1).history
pd.DataFrame(history).plot(grid=True,figsize=(12,12),yticks=np.linspace(0.0, 1.0, num=11))
y_pred=mlp.predict(x_test)
It seems the neural network performs well. The validation loss is going down, and validation recall is going up. Now we are going to plot two box plots. One depicts the values obtained by the neural network when the y test is one that means visibility below threshold (500 meters in our case). The other one represents the values obtained by the neural network when the y test is zero (visibility more than 500 meters). The ideal result should be that in the first case, the neural network predicted high values nearly one, and in the second case values near zero. The code is:
result=pd.DataFrame({“y_test”:y_test.values,”y_pred”:y_pred.reshape(1,-1)[0],
“datatime”:y_test.index}).set_index(“datatime”)
pd.DataFrame({“y_pred test==1”:result[“y_pred”][result.y_test==1],
“y_pred test==0”:result[“y_pred”][result.y_test==0]}).plot(kind=”box”)
We are looking for a threshold that discriminates between 1 and 0. I choose 0.55 following indications from the figure above. If the neural network predicts a value of less than 0.55, it means visibility more than 500 meters and vice versa. The code:
#select threhold_nor
threshold_nor=0.55
y_pred_nor =[0 if c<=threshold_nor else 1 for c in result.y_pred]
target_names = [“>”+str(threshold1)+”m”,”<=”+str(threshold1)+”m” ]
print(classification_report(y_test.values,y_pred_nor , target_names=target_names)
print(“**** Confusion matrix ****”)
print(confusion_matrix(y_test,y_pred_nor))
We get:
It seems the parameters slightly better than the meteorological model. Last, we plot the ROC curve and calculate the area under the curve (AUC):
#ROC model
fpr, tpr, thresholds = metrics.roc_curve(y_test,y_pred)
roc_auc = auc(fpr, tpr)
plt.figure()
lw = 2
plt.plot(fpr, tpr, color=’darkorange’,
lw=lw, label=’ROC curve (area = %0.2f)’ % roc_auc)
plt.plot([0, 1], [0, 1], color=’navy’, lw=lw, linestyle=’ — ‘)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel(‘False Positive Rate’)
plt.ylabel(‘True Positive Rate’)
plt.title(“ROC “)
plt.legend(loc=”lower right”)
plt.show()
And get:
Conclusion
Perhaps a step to improve a meteorological model with deep learning algorithms…
Thank you!