Predicting Time-Series with SARIMAX

A Kaggle Notebook Guide to COVID-19 New Cases forecast in Italy

Gianpiero Andrenacci
Data Bistrot
14 min readSep 3, 2024

--

Forecasting Time-Series with Python — All rights reserved

In this article, we will explore a Kaggle notebook that predicts new Covid-19 cases in Italy using the SARIMAX model. The notebook demonstrates how to forecast time-series data effectively by leveraging the power of SARIMAX. This conceptual explanation will guide you through the key steps involved, from data preparation to model evaluation, without diving into the code, which can be accessed directly in the Kaggle notebook.

What is SARIMAX?

SARIMAX (Seasonal AutoRegressive Integrated Moving Average with eXogenous regressors) is a robust statistical model used for time series forecasting. It extends the ARIMA model by incorporating seasonal effects and external factors (exogenous variables) that might influence the target variable. Here’s a brief breakdown of its components:

  • Seasonal (S): Accounts for repeating patterns or cycles in the data, such as daily, weekly, or monthly effects.
  • AutoRegressive (AR): Uses the dependency between an observation and several lagged observations.
  • Integrated (I): Differencing the raw observations to make the time series stationary.
  • Moving Average (MA): Incorporates the dependency between an observation and a residual error from a moving average model applied to lagged observations.
  • eXogenous Regressors (X): Includes external factors that might affect the target variable, such as holidays or other relevant events.

What Will You Learn?

By following the steps outlined in the Kaggle notebook, you will learn how to:

  1. Prepare Data for SARIMAX: Understand how to process and organize your time series data, including the creation of relevant features and handling of exogenous variables.
  2. Analyze ACF and PACF Plots: Use Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots to identify the appropriate parameters for your model.
  3. Find Optimal Hyperparameters: Employ techniques such as auto_arima to automate the selection of the best hyperparameters for your SARIMAX model, ensuring accurate and reliable forecasts.
  4. Predict Future Values: Generate forecasts for future time periods and visualize the predictions to assess their alignment with actual observed data.
  5. Evaluate Model Performance: Use various error metrics and diagnostic plots to evaluate the model’s accuracy and diagnose any potential issues, ensuring robust and reliable forecasts.

This conceptual guide will provide you with a solid understanding of the steps involved in building and evaluating a SARIMAX model for time series forecasting. For detailed implementation, you can refer to the actual code in the Kaggle notebook. By the end of this article, you will have a clearer insight into how to use SARIMAX to forecast time-series data effectively.

1. Introduction

Covid-19 New Cases Prediction with SARIMAX

The first cases of on-ground spread of the Coronavirus in Italy appeared in the northern regions of Lombardy, Veneto, and Emilia-Romagna on February 20, 2020.

Coronavirus is a family of viruses named after their spiky crown-like appearance. The novel coronavirus, also known as SARS-CoV-2, is a contagious respiratory virus that was first reported in Wuhan, China.

On February 11, 2020, the World Health Organization designated the name COVID-19 for the disease caused by this novel coronavirus. As the virus rapidly spread across the globe, data analysis and projections became pivotal in understanding and combating the pandemic.

This article aims at exploring COVID-19 through data analysis and projections, specifically focusing on predicting new cases in Italy.

Data Collection and Sources

Data collection started on February 24, 2020, ensuring a robust dataset for analysis. The geographical areas are sourced from the ISTAT website, which details the Italian territory comprising 19 regions and 2 autonomous provinces (Trento and Bolzano).

For simplicity, we will treat the 2 autonomous provinces as one region, called Trentino-Alto Adige. The primary data source for this analysis is the Protezione Civile’s GitHub repository, which provides comprehensive and up-to-date COVID-19 case data.

Model Selection: SARIMAX

To achieve accurate predictions, we will employ the SARIMAX (Seasonal AutoRegressive Integrated Moving Average with eXogenous factors) model. SARIMAX is an extension of the ARIMA model that supports seasonal components and exogenous variables, making it particularly well-suited for time series data with seasonal patterns and external influences.

Objective

The primary goal of this notebook is to predict the number of new COVID-19 cases in Italy for the upcoming seven days. By utilizing the SARIMAX model, we aim to provide actionable insights that can aid in public health decision-making and resource allocation.

This article will conceptually explain the steps and methodology used in the Kaggle notebook.

1.1 Import Libraries

The first step in any data science project is to import the necessary libraries. This notebook makes use of various Python libraries that are essential for data manipulation, visualization, and modeling. These include:

  • pandas and numpy: For data manipulation and numerical operations.
  • matplotlib and seaborn: For data visualization.
  • sklearn: For machine learning metrics.
  • statsmodels: For time series analysis and SARIMAX modeling.
  • pmdarima: For automated ARIMA model selection.
  • geopandas and shapefile: For geographic data manipulation and plotting.

These libraries provide a comprehensive toolkit for handling the different aspects of the project, from data preparation to model building and evaluation.

1.2 Custom Functions

Custom functions are defined to streamline repetitive tasks and enhance the modularity of the code. In this notebook, several custom functions are created, such as:

  • create_features(df): This function generates time series features from the date column, which include day of the week, month, year, day of the year, and other time-related features. This is essential for capturing the temporal patterns in the data.
  • my_plot(df, column_name, ylabel, title, start, end, navigate): A plotting function that allows for flexible visualization of the data, including options for date range navigation and customizable titles and labels.
  • last_n_months(df, n_months): This function filters the dataframe to include only the data from the last ’n’ months, which is useful for focusing on recent trends.
  • adf_test(series, title): Performs the Augmented Dickey-Fuller test to check for stationarity in the time series data.
  • error_metrics(y_test, y_pred): Calculates common forecast error metrics such as R2, MSE, RMSE, MAE, and MAPE to evaluate the model’s performance.

These functions encapsulate specific tasks, making the code cleaner and more maintainable.

1.3 Load the Data

Loading the data is a crucial step in the data analysis process. The data for this notebook is sourced from the Italian Protezione Civile’s GitHub repository, which provides updated Covid-19 statistics for Italian regions. The dataset includes various columns such as date, new cases, intensive care admissions, total hospitalizations, recovered cases, and more.

The data is loaded into a pandas dataframe and the relevant columns are selected and renamed for clarity. Missing values are checked and handled appropriately. The dataset is then prepared for further analysis by filtering the necessary features and creating new date-related columns using the custom functions defined earlier.

Dataset link: dpc-covid19-ita-regioni.csv

1.4 Dataset Description

The dataset used in this notebook comprises daily records of Covid-19 new cases and other related metrics across different regions in Italy. Key features include:

  • date: The date of the record.
  • new_cases: The number of new Covid-19 cases reported on that date.
  • region: The name of the region in Italy where the cases were reported.
  • holiday: A binary indicator of whether the date is a public holiday in Italy.

The data spans from the initial outbreak in February 2020 and is updated regularly to provide the most current information. This dataset allows for detailed time series analysis and forecasting, which is crucial for understanding the trends and predicting future cases.

By understanding these foundational steps, we set the stage for deeper exploration and modeling of the data, which includes data preparation, exploratory data analysis (EDA), and building and evaluating the SARIMAX model.

2. Data Preparation and Exploration

Data preparation and exploratory data analysis (EDA) are essential steps in any data science project. They help in understanding the data, identifying patterns, and preparing it for modeling. In this section, we will discuss how data is prepared and explored in the notebook.

2.1 Data Preparation

Data preparation involves several steps to ensure the dataset is ready for analysis and modeling. Here are the main steps taken in this notebook:

  • Aggregation and Grouping: The dataset is aggregated to get the total number of new cases per day. This involves summing up the new cases across all regions for each date.
  • Feature Creation: Additional features are created using the custom create_features function. These features include day of the week, month, year, day of the year, and other time-related attributes. These features help in capturing the temporal patterns in the data.
  • Handling Missing Values: The dataset is checked for any missing values. If any are found, appropriate measures are taken to handle them, such as imputation or removal, to ensure the dataset is complete and ready for analysis.
  • Setting the Index: The date column is set as the index of the dataframe. This is important for time series analysis as it allows for easy manipulation and plotting of the data over time.

2.2 EDA with Pandas-Profiling

Exploratory Data Analysis (EDA) is performed to understand the data better and identify any patterns, trends, or anomalies. In this notebook, pandas-profiling is used for EDA:

  • Generating a Report: The pandas-profiling library generates a detailed report of the dataframe. This report includes information about the number of rows and columns, data types, missing values, and the distribution of values in each column.
  • Statistical Measures: The report provides statistical measures such as mean, median, and standard deviation for numeric columns. It also includes plots of the distribution of each column, helping to visualize the data.
  • Insights and Patterns: The report helps in identifying any patterns, trends, or anomalies in the data. For example, it may highlight periods with unusually high or low new cases, or identify columns with a significant number of missing values.

Pandas Profiling (ydata-profiling) in Python: A Guide for Beginners

2.3 Geographic Data Analysis

Geographic data analysis is conducted to understand the spatial distribution of Covid-19 cases across different regions in Italy. This involves:

  • Merging Data: The Covid-19 dataset is merged with geographic shapefiles of Italian regions. These shapefiles are obtained from the Italian National Statistic Institute (ISTAT) and provide the geographic boundaries of each region.
  • Mapping: The merged data is used to create maps that visualize the number of new cases in each region. These maps help in identifying geographic patterns and regions with higher or lower numbers of cases.
  • Temporal Analysis: The geographic data is analyzed over time to see how the distribution of cases changes. This can highlight regions that have become hotspots or show improvements over time.

2.4 Exploratory Data Analysis

Further exploratory data analysis is conducted to gain deeper insights into the data:

  • Time Series Plots: The number of new cases is plotted over time to visualize trends and patterns. This helps in understanding the overall trajectory of the pandemic and identifying any seasonal patterns.
  • Heatmaps: Heatmaps are used to visualize the distribution of new cases on different weekdays and months. This can highlight days or periods with higher numbers of cases.
  • Pareto Analysis: A Pareto diagram is created to apply the 80–20 rule, identifying the most significant time periods or categories that account for the majority of new cases.
  • Distribution Plot: The distribution plot in the context of the provided analysis shows the distribution of new COVID-19 cases in Italy over the last three months. This type of plot helps visualize the frequency and spread of new case counts, providing insights into the underlying patterns of the data. In simple terms, the plot helps to see how often different numbers of new COVID-19 cases were reported each day over the last three months in Italy. It visually summarizes the spread and frequency of the new case counts.

These steps ensure that the data is well-understood and ready for the next phase of building and evaluating the SARIMAX model. By performing thorough data preparation and exploration, the notebook sets a solid foundation for accurate and reliable time series forecasting.

3. Build SARIMAX Model

In this section, we will discuss how to build the SARIMAX model for predicting Covid-19 new cases in Italy. The SARIMAX (Seasonal AutoRegressive Integrated Moving Average with eXogenous regressors) model is a powerful tool for time series forecasting, especially when there are seasonal patterns and external factors influencing the data.

3.1 Data Preparation for SARIMAX

Before building the SARIMAX model, it is essential to prepare the data appropriately. This involves the following steps:

  • Creating Endogenous and Exogenous Variables: The target variable, which is the number of new Covid-19 cases, is defined as the endogenous variable. Additionally, exogenous variables such as holidays are included to account for external factors that might influence the number of new cases. These variables help the model to better capture the underlying patterns in the data.
  • Splitting the Data: The data is split into training and testing sets. The training set is used to train the model, while the testing set is used to evaluate its performance. Typically, the most recent data points are reserved for testing to assess how well the model can predict future values.
  • Setting the Frequency: The date index is set with a daily frequency to ensure the model treats the data as a daily time series. This is important for capturing the temporal structure and seasonality in the data.

3.2 SARIMA ACF/PACF

The SARIMA model is an extension of the ARIMA model that includes seasonal components. To determine the appropriate parameters for the SARIMA model, we use the Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots:

  • ACF Plot: The ACF plot shows the correlation of the time series with its own lagged values. It helps to identify the order of the Moving Average (MA) component. If the ACF plot shows a gradual decline, it suggests the need for differencing. If it shows a sharp cut-off, it indicates the order of the MA component.
  • PACF Plot: The PACF plot shows the partial correlation of the time series with its own lagged values, after controlling for the correlations at shorter lags. It helps to identify the order of the AutoRegressive (AR) component. If the PACF plot shows a sharp cut-off, it indicates the order of the AR component.
  • Seasonal Decomposition: The time series is decomposed into trend, seasonal, and residual components. This helps to visualize the underlying patterns and confirm the presence of seasonality, which is crucial for choosing the seasonal parameters of the SARIMA model.

By analyzing the ACF and PACF plots, along with the seasonal decomposition, we can make informed decisions about the parameters of the SARIMA model.

3.3 Find Optimal Hyperparameters

Finding the optimal hyperparameters for the SARIMAX model is a critical step to ensure the model’s accuracy and reliability. This involves selecting the values for the following parameters:

  • p: The order of the autoregressive component (AR).
  • d: The order of differencing needed to make the series stationary.
  • q: The order of the moving average component (MA).
  • P: The order of the seasonal autoregressive component (SAR).
  • D: The order of seasonal differencing.
  • Q: The order of the seasonal moving average component (SMA).
  • m: The number of periods in each season (e.g., 7 for weekly seasonality in daily data).

To automate the process of selecting these hyperparameters, the auto_arima function from the pmdarima library is used. This function performs a grid search over a range of possible values for the hyperparameters and selects the combination that minimizes the model's error metrics, such as the Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC).

Using auto_arima, we can efficiently find the best set of hyperparameters for the SARIMAX model, ensuring that it captures the essential patterns in the data while avoiding overfitting.

By following these steps, we build a robust SARIMAX model capable of accurately predicting the future number of Covid-19 new cases in Italy. The prepared data, the analysis of ACF/PACF plots, and the optimized hyperparameters all contribute to the model’s effectiveness and reliability.

4. Predict and Evaluate the Model

Once the SARIMAX model is built and trained, the next steps are to use the model for making predictions and to evaluate its performance. This section will cover the conceptual aspects of these steps as demonstrated in the Kaggle notebook.

4.1 Predict

With the SARIMAX model trained on historical data, we proceed to make predictions for future values of new Covid-19 cases. The process involves:

  • Forecasting Future Values: The model generates forecasts for a specified number of future time periods. For instance, in this notebook, the model is used to predict new cases for the next seven days. The model takes into account the historical data, seasonal patterns, and exogenous variables (such as holidays) to make these predictions.
  • Generating Forecasts with Exogenous Variables: If the model includes exogenous variables, future values of these variables must also be provided to generate accurate forecasts. This ensures that the model considers all relevant factors affecting the time series during the prediction period.
  • Visualization of Predictions: The predicted values are typically visualized alongside the actual values to provide a clear comparison. This helps in assessing how well the model is performing and whether the predictions align closely with the observed data. Visualization can include line plots showing the predicted and actual values over time.

4.2 Evaluate the Model

Evaluating the model’s performance is a critical step to understand its accuracy and reliability. This involves several key metrics and techniques:

  • Error Metrics: Various error metrics are used to quantify the model’s performance. Common metrics include:
  • Mean Squared Error (MSE): Measures the average squared difference between predicted and actual values. Lower values indicate better model performance.
  • Root Mean Squared Error (RMSE): The square root of MSE, providing error in the same units as the data.
  • Mean Absolute Error (MAE): Measures the average absolute difference between predicted and actual values.
  • Mean Absolute Percentage Error (MAPE): Provides error as a percentage of the actual values, useful for comparing performance across different scales.
  • R-squared (R²): Indicates the proportion of the variance in the dependent variable that is predictable from the independent variables. Higher values indicate better model performance.
  • Residual Analysis: Examining the residuals (differences between actual and predicted values) helps in diagnosing any patterns or biases in the model. Ideally, residuals should be randomly distributed with no discernible patterns, indicating a well-fitted model.
  • Visualization of Actual vs. Predicted Values: Plotting the actual values against the predicted values provides a visual assessment of the model’s accuracy. This can highlight periods where the model performs well and periods where it may struggle.
  • Model Diagnostics: Additional diagnostic plots can be used to assess the model’s assumptions and performance. These plots might include:
  • Residual Plots: To check for homoscedasticity (constant variance of residuals).
  • Q-Q Plots: To check if residuals follow a normal distribution.
  • Autocorrelation Plots of Residuals: To ensure residuals are not autocorrelated.

By thoroughly evaluating the model using these metrics and techniques, we can gain confidence in its predictions and identify areas for potential improvement. This comprehensive approach ensures that the SARIMAX model is robust and reliable for forecasting Covid-19 new cases in Italy.

This completes the conceptual explanation of the notebook for predicting Covid-19 new cases using SARIMAX. For detailed code and implementation, please refer to the Kaggle notebook itself.

--

--

Gianpiero Andrenacci
Data Bistrot

AI & Data Science Solution Manager. Avid reader. Passionate about ML, philosophy, and writing. Ex-BJJ master competitor, national & international titleholder.