Detecting Trends in Time Series Data using Python

Unleashing Python’s Time Series Analysis: Uncover Hidden Trends Amidst the Data Deluge, Identifying Needles in the Haystack of Time Series.

Published in

VorTECHsa

9 min readMay 19, 2023

In this article, we will discuss how to detect trends in time series data using Python, which can help pick up interesting patterns among thousands of time series, especially in the sophisticated oil and gas market. However, with the power of Python, and the combined forces of Linear Regression and Kendall Tau statistics, we can navigate through the complexities and unlock valuable insights.

Introduction

Time series analysis is an important topic in data analytics, as it is used to study and predict patterns in data that vary over time. One of the most useful patterns that can be found in time series data is a trend. A trend is a gradual change in the data over time, and detecting trends can be helpful for making trading decisions.

However, analyzing data for each flow in the oil and gas market can be a tedious and time-consuming process, as there are thousands of possible combinations of origin, destination, and product to consider. As a result, an automated way to detect trends could be immensely helpful in saving time and reducing the workload. By leveraging the power of data analysis tools and statistical methods, we can quickly identify trends and patterns that would be difficult to detect manually. This can help oil and gas companies make better decisions and stay competitive in the market.

Data Preparation

Part 1. Extract data from VortexaSDK

In this article, we will be using Vortexa data as our data source.

Russia has been active in engaging STS activity and has been a market spotlight since last year. In this article, we will analyze the recent trends in Russian STS activity as our case study.

We will be utilizing Vortexa Cargo Movements data to extract the trends in STS activity. The code below is used to download diesel cargoes loading from Russia between 1 December 2022 and 31 March 2023, extract movements containing STS events, and group them by the month in which the STS took place.

# Extract data from CargoMovement Endpoint
df1 = v.CargoMovements().search(
    filter_activity= 'loading_end',
    filter_time_min = datetime(2022,12,1),
    filter_time_max = datetime(2023,3,31,23,59,59),
    filter_products = diesel,
    filter_origins = russia
    ).to_df(columns = 'all')

# Filter for CargoMovements that engage in STS
df1_sts = df1[df1['events.cargo_sts_event.0.from_vessel_id'] != “”]

# Group by STS start date in a monthly basis
df_ts =df1_sts.set_index('events.cargo_sts_event.0.start_timestamp')\
  .groupby([pd.Grouper(freq='MS'),
  'events.cargo_sts_event.0.location.sts_zone.label'
  ])\
  .agg({'quantity':'sum'})\
  .unstack(level=1).fillna(0).stack().reset_index()

# Analyze trend before end of Mar 23
df_ts_filter = df_ts[
  df_ts['transit_month']<datetime(2023,3,31,23,59,tzinfo=timezone.utc)
]

Fig 1. Transiting quantity for different sts zones (df_ts_filter)

Part 2. Data Cleaning to analyze top STS location

Assuming we are only interested in the top STS locations, we will remove STS zones that have only one STS event. The following code identifies the top STS zones of interest.

# Filter for top 7 STS locations
top_7=df_ts_filter.groupby('sts_zone')\
  .agg({'quantity':'sum'})\
  .sort_values('quantity',ascending = False)\
  .index[:7].tolist()

# Filter out sts zones which has only one STS event (exclude list)
top_7 = list(set(top_7) - set(exclude_list))

Detecting Time Series

Method 1. Linear Regression

To detect an increasing trend using linear regression, you can fit a linear regression model to the time series data and perform a statistical test on the estimated coefficient (slope). If the coefficient is significantly positive, it indicates that the time series has an increasing trend. On the other hand, if the coefficient is significantly negative, it indicates a decreasing trend.

Linear Regression generally works well even in smaller datasets (such as in monthly time series). Here is an example code that demonstrates this approach:

*Code for linear regression see Appendix

# Perform Linear Regression and calculate the gradient/slope
lr_df = calculate_slope(df_ts_filter,location_col = 'sts_zone',quantity_col = 'quantity', location_list = top_location)

# Visualizing the output
lr_df[['r2_score']].style.bar(subset = 'r2_score',align='left', color=['#d65f5f', '#5fba7d'])
lr_df[['mape']].style.bar(subset = 'mape',align='right', color=['#d65f5f', '#5fba7d'])
lr_df[['gradient']].style.bar(subset = 'gradient',align='mid', color=['#d65f5f', '#5fba7d'])

Fig 2. r2 score, MAPE and gradient of each STS zones

Fig 3. Fitted lines of Linear Regression on Top STS zones

To validate the result, we first look at R² score and MAPE of the result of our fitted regression.The R² score measures how well the linear regression model fits the data. It represents the proportion of variance in the dependent variable that can be explained by the independent variables in the model.

A higher R² score indicates a better fit between the model and the data.

The MAPE measures the accuracy of the model’s predictions. It represents the average percentage difference between the actual values and the predicted values.

A lower MAPE indicates that the model is more accurate in its predictions.

In our case, Kalamata STS, Augusta STS, and Taman STS have both high R² score and low MAPE, thus the regression result has higher confidence.

From the slope/gradient results above, it shows that we have Kalamata STS [GR] becoming the hottest STS locations recently, followed by Augusta STS zone, and lastly Taman STS [RU], as they all have high gradients.

Method 2. Kendall Tau Statistics

Kendall tau measures the strength of the association between two variables by comparing the number of concordant and discordant pairs of observations. In the context of time series data, this means comparing the order of the values for the variable being measured at different points in time.

A pair of observations is considered concordant if the value of the variable being measured increases as time increases, or if the value of the variable decreases as time decreases. A pair of observations is considered discordant if the value of the variable being measured increases as time decreases, or if the value of the variable decreases as time increases.

To use Kendall tau to detect trends in time series, we would first calculate the Kendall tau coefficient between time and the variable being measured. This involves counting the number of concordant and discordant pairs of observations. By doing so, Kendall tau coefficient can determine the direction and strength of the association between time and the variable being measured. See detailed step-by-step guide on computing Kendall tau coefficient

If there are more concordant pairs than discordant pairs, this indicates a positive association between time and the variable being measured, suggesting a positive trend. Conversely, if there are more discordant pairs than concordant pairs, this indicates a negative association between time and the variable being measured, suggesting a negative trend.

It’s important to note that Kendall tau is a non-parametric method, meaning it doesn’t assume any specific distribution of the data. This makes it useful for detecting trends in time series data that may not follow a normal distribution or may have outliers. However, it’s important to use Kendall tau in conjunction with other methods, such as visual inspection or linear regression, to get a more complete picture of the trend in the data.

*Code for Relative Order Testing see Appendix

# Compute Tau statistics
tau_df = calculate_tau(df_ts_filter,location_col = 'sts_zone',quantity_col = 'quantity', location_list = top_location)

# Visualizing the output
tau_df[['pvalue']].style.bar(subset = 'pvalue',align='right', color=['#d65f5f', '#5fba7d'])
tau_df[['tau']].style.bar(subset = 'tau',align='mid', color=['#d65f5f', '#5fba7d'])

Fig 4. p-value and tau-statistics of each STS locations

We need to choose a threshold (alpha) for the p-values returned by Kendall tau. If the p-value is smaller than our chosen alpha, we reject the null hypothesis and accept the alternative hypothesis — that a significant trend exists in the series. By increasing the alpha level, we are essentially allowing for a greater chance of committing a Type I error — in other words, of detecting a trend that does not in fact exist.. However, we will use an alpha level of 0.1 as we do not wish to miss any interesting trends.

Kendall tau p-values (Fig. 4) suggest that there are significant trends in Augusta STS [IT], Taman STS [RU], and Kalamata STS [GR]. By looking at their tau-statistics (Fig. 4), all of them are approaching 1, meaning that all three STS zones are experiencing increasing trends.

Validation of results

Fig 5. Actual Trends in STS locations derived by 2 methods

The chart above shows that all these three STS locations derived by 2 methods above have in fact an obvious increasing trend. By checking other sts zones, none of them has a more obvious increasing trend than these three. This has helped to verify the capability of these two methods in detecting trends in time series. However, there are some parameters that you may need to consider such as the significance level (alpha) and MAPE threshold to reduce the risk of detecting spurious trends.

Conclusion

Using two different methods — Linear Regression and Kendall Tau test, we have successfully shown the existence of a recent trend in STS activity in three STS zones and also obtained the magnitude of the trend. Our results were validated by examining the R2 value and MAPE value of the linear regression model and the Kendall tau coefficient, all of which are indicative of a significant trend in the time series data. The methods above can significantly save analysts’ time by automatically detecting significant trends in the energy world as there are millions of combinations of routes and products in this market.

Appendix

Linear Regression

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score

def calculate_slope(df,location_col,quantity_col,location_list):
    '''
        This function will loop through the location list and calculate slope of best fit of linear regression.
        
    Parameters:
    df: time series dataframe
    location_col: the group by location column name
    quantity_col: aggregated quantity column name by certain frequency
    location_list: Location to loop through
    
    '''
    location_dict = {}
    for location in location_list:
        df2 = df[df[location_col] ==  location].reset_index(drop = True)
        df2[quantity_col] = df2[quantity_col].replace('',0)
        # Extract the time values and reshape them to a 2D array
        X = df2.index.values.reshape(-1, 1)

        # Extract the time series values and reshape them to a 1D array
        y = df2[quantity_col].values.reshape(-1, 1)
        
        # Fit a linear regression model to the data
        reg = LinearRegression().fit(X, y)

        # Get the prediction
        y_pred = reg.predict(X)
        
        # Get the fitting metrics
        mape = mean_absolute_error(y, y_pred)
        r2 = r2_score(y,y_pred)
        
        # Get the slope of the trend line
        slope = reg.coef_[0][0]
        
        mean = np.mean(y)
        
        # Store each location - gradient pairs into dictionary
        location_dict[location] = (slope,mape,mean,r2)
        lr_df = pd.DataFrame(location_dict).transpose()
        lr_df = lr_df.rename(columns = {0:'gradient',1:'mae',2:'mean',3:'r2_score'})
        lr_df['mape'] = lr_df['mae']/lr_df['mean']
    return lr_df

def visualize_location_gradient(df):
    '''
    Take location - gradient pairs and give bar chart visualization.
    '''
    df = df.sort_values('gradient',ascending = False)
    fig = px.bar(df,x = df.index, y = 'gradient', title = 'Gradient for each location')
    fig.show()

Kendall Tau statistics

import scipy.stats as stats

def calculate_tau(df,location_col,quantity_col,location_list):
    '''
        This function will loop through the location list and compute tau statistics for each zone.
        
    Parameters:
    df: time series dataframe
    location_col: the group by location column name
    quantity_col: aggregated quantity column name by certain frequency
    location_list: Location to loop through
    
    '''
    location_pvalue_dict = {}
    for location in location_list:
        df2 = df[df[location_col] ==  location].reset_index(drop = True)
        df2[quantity_col] = df2[quantity_col].replace('',0)
        # Extract the time values and reshape them to a 2D array
        X = df2.index.values.reshape(-1, 1)

        # Extract the time series values and reshape them to a 1D array
        y = df2[quantity_col].values.reshape(-1, 1)
        
        # Compute tau statistics
        tau, p_value = stats.kendalltau(X, y)
        
        # Store each location - gradient pairs into dictionary
        location_pvalue_dict[location] = (tau, p_value)
        tau_df = pd.DataFrame(location_pvalue_dict).transpose()
        tau_df = tau_df.rename(columns = {0:'tau',1:'pvalue'})
    return tau_df