Improving meteorological and ocean models with Machine Learning (part 1: assessing the meteorological model)

Jorge Robinat
Analytics Vidhya
Published in
4 min readSep 13, 2019

In the last part (part 0), we defined a data frame where the columns were the meteorological variables. Variables observed (“_o” extension) at meteorological station and variables predicted (“_p” extension) by the mathematical model. Rows were date type variable every hour.

First, we get the data frame and import some libraries to plot results:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
#from Github link
url=”https://raw.githubusercontent.com/granantuin/LEVX_class/master/maestro.csv"
master=pd.read_csv(url,index_col=”datetime”,parse_dates=True)

Variables dir_o, mod_o, wind_gust_o, mslp_o, temp_o, rh_o, visibility_o, have their own counterpart as predicted variables (dir_p, mod_p and so on).

Let’s start with the temperature variable and display a scatter plot with temperature observed and predicted. Type:

sns.scatterplot(x=”temp_o”, y=”temp_p”, data=master)

and we get:

temperature predicted (temp_p) vs. temperature observed (temp_o)

Oops! a clear outlier at the upper center position. We should filter the data frame:

master_f=master[master.temp_o < 310].copy()

The correlations looks good, 0.948029 aproximately

The code below compares the differences between temperature observed (temp_o) and temperature predicted (temp_p) by the model itself and a new variable predicted (temp_p_lr). The new variable predicted is obtained by adjusting temperature observed and predicted to a linear regression

from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(master_f[“temp_p”].values.reshape(-1, 1), master_f[“temp_o”].values.reshape(-1, 1))
master_f[“temp_p_lr”]=reg.predict(master_f[“temp_p”].values.reshape(-1, 1))master_f[“dif_temp”]=master_f[“temp_o”]-master_f[“temp_p”]
master_f[“dif_temp_lr”]=master_f[“temp_o”]-master_f[“temp_p_lr”]
master_f[[“dif_temp”,”dif_temp_lr”]].plot(kind=”box”,grid=True, figsize=(8,8))

The analytics results using describe() method are:

And a box plot display

We can say the model performs well. The mean difference temperature between temperature observed and predicted is nearly zero (0.1357) — the standard deviation of 1.9. Linear regression can be used to improve mean almost 0 and the standard deviation similar.

Next variable that performs well is pressure reduced at sea level. Correlation 0.98. Difficult to improve !! We can display the result with Seaborn library. There is an outlier too. To filter the outlier use:

master_f=master[master.mslp_o>90000].copy()

Using Seaborn:

Adjust to linear regression and aggregate dots in 10 bins. We get:

The code to obtain the plot above is:

sns.regplot(x=master_f.mslp_o/100, y=master_f.mslp_p/100, x_bins=10, data=master_f)

We divide by 100 to get a friendly variable, millibars. It seems that the model performs well, and we don´t need any Machine Learning algorithm to improve it.

Now we are going to see if the model performs well at the other variables. Let’s take a look at the wind direction, wind intensity, relative humidity, and, visibility. The code would be:

sns.scatterplot(x=”dir_p”, y=”dir_o”, data=master[master.dir_o!=-1])sns.scatterplot(x=”mod_p”, y=”mod_o”, data=master)sns.scatterplot(x=”rh_p”, y=”rh_o”, data=master)sns.scatterplot(x=”visibility_p”, y=”visibility_o”, data=master[master.visibility_p<10000])

We filter variable wind direction in dir_o (coded as -1 ) and put a threshold of 10000 meters at visibility variable because is the maximum visibility reported at the airport station.

The scatter plot for all the variables are:

wind direction observed vs. wind direction predicted
wind intensity observed vs. wind intensity predicted
visibility observed versus visibility predicted
humidity observed vs. humidity predicted

The correlation index is:

  • wind intensity (mod variable): 0.63
  • wind direction (dir variable): 0.33
  • relative humidity (rh variable): 0.76
  • visibility (visibility variable): 0.36

Conclusions and outlooks

The model performs well at the variables pressure and temperature. Pressure variable with high accuracy (0.98 of correlation). Temperature with a correlation index of 0.95 can be slightly improved using linear regression with one variable.

Other predicted variables like wind speed, direction, or visibility need a lot of machine learning algorithms because of the messy results obtained by the model. The variable relative humidity is in the middle, and perhaps we can improve the forecast with only a linear regression.

The good news is the model performs poorly in four variables, and we can enhance it by applying all kind of machine learning techniques. We could set up binary classification models to detect dangerous thresholds like high wind intensity or low visibility.

Also, classification problems that forecast the wind direction in only the main direction like North, North East, East, and so on. Another kind of question is the regression problem applied at all variety of continues variables. We must select some predicted variables of the model directly or transform the variables using principal component analysis.

Next post, I am going to try to predict the visibility threshold using deep learning. I forget the bad news. Thanks for reading me!

--

--