Using Open Source Data & Machine Learning to Predict Ocean Temperatures

Monument
The Startup
Published in
7 min readAug 10, 2020
The NOAA Ocean Data in this Tutorial Covers Florida & the Gulf of Mexico.

In this tutorial, we’re going to show you how to take open source data from the National Oceanic and Atmospheric Administration (NOAA), clean it, and forecast future temperatures using no-code machine learning methods.

This particular data comes from the Harmful Algal BloomS Observation System (HABSOS). There are several interesting questions to ask of this data — namely, what is the relationship between algal blooms and water temperature fluctuations. For this tutorial, we’re going to start with a basic question: can we predict what temperatures will be over the next five months?

The first part of this tutorial deals with acquiring and cleaning the dataset. There are a lot of approaches to this; what is shown below is just one approach. Further, if your dataset is already clean, you can skip all that “data engineering” and jump straight into no-code AI bliss :)

Step 1: Download & Clean the Data

First, we download the data from the HABSOS site linked above. For convenience, we are posting the file here as well.

This CSV has 21 columns, which we discovered with this bash command.

$ awk '{print NF}' habsos_20200310.csv | sort -nu | tail -n 1
21

We’ll explore the rest of the data in subsequent tutorials, but, of these 21 columns, the only columns I’m interested in for now are:

  • sample_date
  • sample_depth
  • water_temp

In addition to only needing a subset of the columns in the data, there are other issues to deal with in order to get the data ready for analysis. We need to:

  • Remove rows with NaN values (i.e. empty values) in thewater_temp column,
  • Select only the measurements made at a depth of 0.5 meters (to remove temperature variability due to ocean depth), and
  • Regularize the data periods by turning the datetime values into date values.
import pandas as pd
from datetime import datetime as dt
df = pd.read_csv('habsos_20200310.csv', sep=',', error_bad_lines=False, index_col=False, dtype='unicode')
pd.set_option('display.max_rows', None)
# Get only the columns we care about
dfSub = df[['sample_date','sample_depth','water_temp']]
# Remove the NaN values
dfClean = dfSub.dropna()
# Select 0.5 depth measurements only
dfClean2 = dfClean.loc[df['sample_depth'] == '0.5']
# Split the datetime values
dfClean2['sample_date'] = dfClean2['sample_date'].str.split(expand=True)[0]
dfClean2.to_csv(r'/PATH/TO/YOUR/OUTPUT/out.csv', index = False)

There’s another big problem with this data: on certain days, there are multiple sensor readings; on other days, there are no sensor readings. Sometimes there are entire months without readings.

These problems are quicker to address in spreadsheets by using pivot tables. And, now that we have reduced the size of the data with the preceding Python script, we areable to load it into a Google Sheet.

What we ended up doing is making a pivot table of each month of each year (1954 to 2020) and took the median water temperature for that month. We used median instead of average values in case there were wild outlier measurements that might skew our summarized data.

Our results are available for viewing in the third tab of this Google Sheet.

Let’s take those results and bring them into Monument!

Step 2: Chart the Data & Use No-Code Machine Learning Generate a Forecast

To chart the data, we’re first going to load it into Monument (www.monument.ai). Monument is an artificial intelligence/machine learning platform that allows you to use advanced algorithms without touching a line of code.

First, we’re going to import our freshly cleaned data into Monument as a CSV file. In the INPUT tab, you’ll see the data as it exists in the source file on the top and the data as it will be imported into Monument on the bottom. If you’re satisfied with how it will be imported, click OK in the bottom right.

Load the data!

When you click OK, you’ll be brought into the MODEL tab. You can drag the “data pills” from the far left into the COLS(X) and ROWS(Y) areas to chart the data. You will clearly see the gaps in the data, where there were months with no temperature readings.

Monument’s algorithms can handle missing data.

This data has a visually recognizable pattern: it resembles a sine wave. In general — and especially when data has a repetitive pattern — it’s good to start an analysis with AutoRegression (AR). AR is one of the more “primitive” algorithms, but it often learns obvious patterns quickly.

When we apply AR to the water temperature data by dragging it into the chart, we see a spiked divergence from the actual historical data early in the training period, but that the algorithm quickly gets a handle on what is occurring in the dataset.

By the end of the training data, it almost perfectly overlays onto the training set. When an algorithm does a good job anticipating known historical data in the training period, it can be an indication that the algorithm will do well forecasting the future. (However, a concern is “overfitting,” which we will explore in future articles.)

Off to a good start!

Now, let’s try a Dynamic Linear Model (DLM). DLM is a slightly more complex algorithm — let’s see if it gets us even better results. When we drag DLM into the chart, we notice immediately that something seems off: DLM appears out of sync with the training data. It has trouble anticipating where the peaks and troughs are in the historical data.

Uh oh…

If we zoom in by dragging the windowing widget below the chart and mute the AR results by clicking the color box above the cart, the effect is even more pronounced. The historical data and DLM are out of sync, so it’s unlikely that the forecasted results — beyond the historical data — will be reliable.

Not looking good…

Let’s try Time-Varying AutoRegression (TVAR). It looks like it produces similar results to AR.

Looking good.

Now, let’s try Long Short-Term Memory (LSTM). This is way off! An LSTM often produces great results for “noisier” data that has less regular patterns. However, on highly patterned data like this dataset, it has trouble.

There are ways to improve the performance of the LSTM (and any algorithm) by adjusting the algorithm’s parameters, but we already have algorithms performing well, so it doesn’t seem worth the effort.

The LSTM has forsaken us…

Now, let’s zoom in to see what we are working with by using the windowing widget on the bottom of the chart. Let’s also click the circles icon in the top right of Monument and select “forecast” to remove the training period and only show the prediction.

The TVAR had looked good when zoomed out, but up close all of our algorithms seem to agree with one another, with the exception of TVAR. Let’s drop TVAR.

TVAR does not look so good up close.

Let’s bring back “training+forecast,” remove everything but AR, and apply the Gaussian Dynamic Boltzmann Machine (G-DyBM). Things are looking pretty good now :)

The sweet spot.

Let’s flip over to the OUTPUT tab and scroll to the bottom to see our forecasts. Because we made our data periods monthly, p1, p2, p3, p4, and p5 are Month-1, Month-2, Month-3, Month-4, and Month-5 into the future.

In this tutorial, we took open source data from the internet, cleaned it, loaded it into Monument, and — in minutes! — used advanced data science methods to get forecasts for future median monthly water temperatures in the Gulf of Mexico at a depth of 0.5 meters.

You can download the .mai file of our results from this link.

In the next tutorial, we’ll look deeper at the error rates for each of the algorithms we tried above and discuss why we might select one algorithm over another. We’ll also calculate the standard deviation for the outliers and discuss why this is important.

Interested in learning more about Monument? Book a free introductory Zoom call here.

--

--