The data science can make a lot of things in a multitude of science domain. Let’s see some applications in the physical oceanography domain and focus on the multivariate wave climate !
In the last decade, long-term wave databases from numerical models have been developed improving the knowledge of deep-water wave climate, especially at locations where instrumental data is not available, making the happiness of surfers around the world.
How the data science can be useful here ?
In data science, machine learning domain and more precisely the unsupervised clustering techniques can be interesting. The purpose here, is not to go deeply into the explanation of clustering algorithms, but just show the utilities of them. These techniques extract features from the original N data, giving a more compact and manageable representation of some important properties contained in the data. They can provide Clustering but also anomaly detection, like unreal prediction model or wave storm event.
if intelligence was a cake, unsupervised learning would be the cake, supervised learning would be the icing on the cake, and reinforcement learning would be the cherry on the cake.”
Yann Le Cun computer scientist.
In coastal engineering or coastal management, the wave climate is crucial. The process stills quite similar, used the available information is usually located in deep water and transfer to shallow water using a state of-the-art wave propagation model capable of simulating the most important wave transformation processes. The more common methodology, used in the engineering office, consist to analyze all available wave climate data and summarize to small number of representative sea states, which are later propagated to shallow water areas. The success of the coastal study is bounded to the correct selection of sea states scenarios.
But how choose the best representation of the deep-water wave climate ?
The common methodology is making basic statistical analysis, using graphical plots as histogram plots or correlogram tables for choose the correct representativity of the sea state. Examples are presented below.
The selection of representative sea state scenarios can be relatively complex, requiring cross visualization plots, it can be tricky and subject to vizualization or interpretation errors. In this example, we are in two dimensions ( Wave amplitude and wave period). The methodological approach can be more complex in three dimensions (if we add wave direction).
And if i told you, that the clustering algorithms can do the job for you and certainly more precisely … Quite cool isn’t it !
Welcome to the wonderful world of the K Mean algorithm !
First ! What the Kmean algorithm does ? The kmean algorithm assigns, all the instances in the dataset, to multiple clusters whose the centroids are closest. The algorithm starts by placing the centroids randomly. Then label the instances, update the centroids, label the instances, update the centroids, and so on until the centroids stop moving [Aurélien Géron].
For the mathematicians, the k-means algorithm divides a set of N samples X into K disjoint clusters C, each described by the mean μj of the samples in the cluster. The means are commonly called the cluster “centroids”; note that they are not, in general, points from X, although they live in the same space [source scikit learn website]. The K-means algorithm aims to choose centroids that minimise the inertia, or within-cluster sum-of-squares criterion:
A simple wave toolbox that i have made, is available on my github. The module is in Python and the link is below.
Module contains tools for analyze ocean wave timeseries. Unsurpervised clustering algorithms have been also implemented…
Let’s try it, with data from wave modelling propagation MARC from IFREMER (French institut). the link to the model https://marc.ifremer.fr/. The extraction point is located in the offshore domain in front of the city called “Les sables d’Olonne” in France.
In this case, the data need to pass through a standardization step before be clustered. The process consists of subtract the mean value (so standardized values always have a zero mean), and divide by the standard deviation so that the resulting distribution has unit variance. The standardization is automatically done in the Wave_tools module.
The Kmean method requires the number of cluster ( k ) as hyperparameter. The Wave_tools module provides several plots for evaluate the best k : The inertia vs number of clusters k and the silhouette score vs the number of clusters. The silhouette plot is also automatically proposed.
Let’s go back to the data. The timeserie contains 26,280, Hs, Tp and wave direction, records between 2016 and 2018.
Hs Tp Direction
2016-01-01 00:00:00 2.355760 13.87020 245.966
2016-01-01 01:00:00 2.327610 13.87020 245.675
2016-01-01 02:00:00 2.317590 13.87020 245.664
2016-01-01 03:00:00 2.317530 13.88890 245.944
2016-01-01 04:00:00 2.335020 13.88890 246.426
... ... ...
2018-12-30 19:00:00 0.468191 10.00000 254.370
2018-12-30 20:00:00 0.489997 9.90099 254.241
2018-12-30 21:00:00 0.509803 9.80392 254.221
2018-12-30 22:00:00 0.525608 9.70874 254.293
2018-12-30 23:00:00 0.537219 12.65820 254.281
[26280 rows x 3 columns]
Out: (26280, 3)
Now, let’s find the optimal number of cluster. As i wrote above, the Wave_tools module contains functions being able to do that.
Wave_analyse = Wave_Tools(df_Data, seasons_split = False, clustering='kmean', n_k = range(2,11))
The left plot is the inertia metric for each number of cluster tested. The inertia is the mean squared distance between each instance and its closest centroid. BUT the inertia is not a good performance metric when trying to choose k because it keeps getting lower as we increase k. Indeed, the more clusters there are, the closer each instance will be to its closest centroid, and therefore the lower the inertia will be. The reason of this plot is to find the elbow of the inertia curve.
The number of cluster decided, all that remains is to do the centroid extraction.
Hs Tp Direction
0 0.704627 10.005830 256.922928
1 1.713324 13.182331 246.245376
2 0.940236 8.564546 208.884613
There we are ! From timeserie contains 26,280 records of Hs, Tp and wave direction, the Kmean algoritm allowed us to find the three wave climates the most representative of the timeserie.
For simplify the visualisation, let’s take the same time-serie but without the Direction. It means stay in 2 dimension ( Hs and Tp). For the exemple, we decice to use k = 4. The figure shows the centroid positions for each Kmean iteration. At the first iteration, the centroids are randomly placed. After each iteration the centroid positions update until they stop moving.
Kmean algorithm is a powerful tool in the wave climate analysis. He is easy to setup, fast and scalable. The Kmean algorithm is suitable for anautomatic selection of a subset of sea states representative of wave climate in deep water in a methodology to transfer the wave climate to coastal areas.