Improving meteorological and ocean models with Machine Learning

Part 0: Set up a data frame

Jorge Robinat
Analytics Vidhya
5 min readSep 8, 2019

--

Meteorological and ocean models are the essential tool to obtain future behavior of meteorological variables like wind intensity, air temperature, cloud cover, rain intensity, wave height, and so on. Everybody can enjoy a wide variety of applications and web pages where meteorological and oceanic information is displayed.

Meteorological models are mathematical representations. The model calculates for several time intervals and several places values of meteorological or oceanic variables. Let’s look at an example.

The picture above depicts the result of the GFS model. The meteorological variable was wind speed and direction. The model takes the initials conditions (at 06Z 03 September Tuesday 2019) and tries to calculate the wind 45 hours later ( 03Z 05 September Thursday 2019). Models only forecast the value of meteorological variables in certain places. Places form a model grid. The figure shows a model grid and the Vigo (Spain) airport position

Points named from 0 to 8 are points where the meteorological model define the value of variables forecasted. The meteorological station is in the Vigo airport (LEVX by its OACI indicative). We aim to build a machine learning model where the inputs are the outputs of the meteorological model. The meteorological model forecasts at points 0 to 8 at different times in the future all the meteorological variables.

First, we need to set up a database when rows are dates and columns are variable observed (from the meteorological station) and variables forecasted (from the meteorological model).

The meteorological model that I used to build the database is maintained by Meteogalicia (a public meteorological service). Meteogalicia supplied a WRF model applied at Galicia region (Northwest Spain). They use a THREDDS (Thematic Realtime Environmental Distributed Data Service) a connectivity tool between scientific data providers and end-users. We can get a historical model from Meteogalicia WRF archives.

The actual meteorological data are obtained from the meteorological station at Vigo airport. Iowa State University provides a database with the meteorological airports’ reports, check this link.

You can get the data frame from my repository in Github. The code would be:

And you can visualize the data frame with :

The data frame has 33 features (columns) and 61256 observations (rows times)

The columns are named:

Columns with “_o” are data observed at meteorological station. Columns with ”_p” are variables predicted for the model at that time. The model that I use forecasts from 0 to 72 hours. The dates in de data frame are the forecasts from 24 to 48 hours.

We can get the index with:

And we get:

Rows are DateTime type. Times are in UTC units. As an example the first row (index) 2011–08–22 20:00:00 means the variable predicted by the model for 2011–08–22 20:00:00 came from the model issued at time analysis 2011–08–21 00:00:00. It means that we see H+45 forecast. The data frame contains only the intervals forecasts from 24 hours to 48 hours. Rows with times 00:00:00 are H+24 forecast and rows with times 23:00:00 are H+48 forecasts. I choose spatial point 3 from the figure above. Approximately 5 km from the airport meteorological station.

Now we explain the meaning and units of every single variable. Let´s begin with the observed variables (extension “_o”).

metar_o: The raw meteorological report issued every 30 minutes at the Vigo station. We don´t use half hours because the model doesn’t report for half hours. You can see more information about the METAR report here.

dir_o: Observed wind direction. From North direction clockwise. Units are degrees. -1 means variable direction.

mod_o: Wind intensity. Units are meters per second. All wind measurements are taken at 10 meters high

wind_gust_o: Wind gust. Units are meters per second. -1 means no winds gust reported.

visibility_o: Visibility in meters. Minimum visibility reported 48.280319 meters. Maximum visibility reported 9994.026301 (full visibility). Sorry for decimal points. It´s a matter of changing units several times.

wxcodes_o: Present Weather Codes (space separated). Check the link about Metar report to get the present Weather Codes

skyc1_o, skyc2_o, skyc3_o, skyc4_o: Are Sky Level Coverage at several levels. Amount of clouds, roughly speaking. Categorical data. M means no cloud coverage.

skyl1_o, skyl2_o, skyl3_o, skyl4_o: Sky Level Altitude of cloud cover in meters at several levels. -One means no clouds cover

temp_o: Air Temperature in Kelvin at 2 meters

dwp_o: Dew point temperature in Kelvins units at 2 meters

rh_o: Relative Humidity

mslp_o: Sea Level Pressure in pascals

Columns with “_p” extension are variables forecasted by the model. You can get more information about variables predicted by the WRF model here. Let´s describe each one as they appear at the data frame :

lhflx_p: Surface downward latent heat flux. Units, watts per square meters

dir_p: Predicted wind direction. From North direction clockwise. Units are degrees. Unlike dir_o no variable wind is forecasted (no -1 values)

mod_p: Wind intensity forecasted. Units are meters per second

prec_p: Total accumulated rainfall between each model output. In our case, every hour. Units kilograms per meters squared.

rh_p: Relative Humidity

visibility_p: Visibility in air. Units meters. Minimum visibility 26.028316 meters. Maximum visibility 24235.000000

wind_gust_p: Wind gust. Units are meters per second. Unlike wind_gust_o always forecasted (no -1 value)

mslp_p: Sea Level Pressure in pascals

temp_p: Air Temperature in Kelvin at 2 meters

cape_p: Convective available potential energy. Units: Jules per kilogram. Check this link for more information

cin_p: Convective inhibition. Click here for more information. Units Jules per Kilogram

cfl_p: Cloud area fraction at low atmosphere layer. I found 1251 samples with values higher than 1. Perhaps, we wouldn’t trust this feature so much.

cfm_p: Cloud area fraction at mid atmosphere layer. Also, I found 37 samples with values higher than 1.

conv_prec_p: Total accumulated convective rainfall between each model output. Every hour in our case.

With this data frame, we will build classification and regression problems where the independent variable will be variables with “_p” extension that are outputs of the meteorological model. Dependent variables or target variables will be variables with “_o” extension observed variables at the station.

We will asses the meteorological model. We will apply the same metrics that we use to evaluate the machine learning model. Sometimes it will be challenging to defeat the meteorological model. For instance, the correlation between the variables pressure observed and forecasted is more than 0.9.

Conclusion and outlook

The database is the start point to asses the meteorological models and creates machine learning models to improve its accuracy. Next post, I will start with the variable visibility. I hope that many people can be interested in this scientific field. Feel free to submit your comments. Thanks for reading the post!

--

--