Revealing The Power of Time Series Forecasting: Predicting Ecuador’s Grocery Sales with Precision

Isaac Rambo a.k.a Data Rambo
9 min readSep 20, 2023

--

Elevating Retail Analytics: Predicting Ecuador’s Grocery Sales — The Favorita Forecasting Challenge!

Click here to view Super Sales Dashboard

Introduction

Welcome to the exciting world of time series forecasting, where we embark on a journey to predict store sales for Favorita, one of Ecuador’s largest and most prominent grocery retailers. In this data-driven project, we delve deep into the intricate art of predictive analytics, armed with historical sales data, cutting-edge machine learning techniques, and the drive to optimize the future of retail.

In this machine learning regression project, the goal is to develop a model that can accurately predict the value of the dependent variable based on the values of the independent variables. The model is developed by training the algorithm on a dataset of historical data. The algorithm learns from the data and identifies patterns that can be used to predict the value of the dependent variable.

Once the model is trained, it can be used to predict the value of the dependent variable for new data points. This can be used to make decisions about future outcomes, such as predicting sales, forecasting demand, or assessing risk.

Imagine the ability to foresee sales trends, optimize inventory management, and ensure customer satisfaction with pinpoint accuracy. As we explore this captivating time series forecasting problem, we’ll unravel the mysteries of Ecuador’s retail landscape, exploring trends, patterns, and hidden insights that will empower Favorita to thrive in an ever-evolving market.

Join me as we navigate through this compelling data-driven adventure, utilizing advanced analytics to make precise predictions, maximize efficiency, and shape the future of grocery retail in Ecuador. Fasten your seatbelts, as we embark on a journey to harness the power of data for the benefit of Favorita and the entire retail industry

Business Understanding

This is a time series forecasting problem. In this project, we’ll predict store sales on data from Corporation Favorita, a large Ecuadorian-based grocery retailer.

Specifically, we are to build a model that more accurately predicts the unit sales for thousands of items sold at different Favorita stores.

The training data includes dates, store, and product information, whether that item was being promoted, as well as the sales numbers. Additional files include supplementary information that may be useful in building your models

IMPORTATIONS

  1. Data Access and Database Connectivity:
  • pyodbc: This library allows data professionals to connect to various database management systems, facilitating the retrieval and manipulation of data directly from databases.
  • dotenv: This library helps in managing environment variables, which can be crucial for securely storing database credentials and other sensitive information.

2. Data Manipulation:

  • numpy and pandas: These libraries are essential for data manipulation, offering powerful tools for handling and transforming data, from numerical arrays to structured data frames.

3. Data Visualization:

  • matplotlib, plotly, seaborn, and missingno: These visualization libraries provide an extensive toolkit for creating a wide range of static and interactive plots, graphs, and visualizations to convey data insights effectively.

4. Time Series Analysis:

  • statsmodels.tsa.seasonal: Time series decomposition is vital in understanding temporal data patterns, and this library offers tools for decomposing time series data into its constituent components (e.g., trend, seasonality, and noise).

5. Statistical Analysis:

  • scipy.stats and statsmodels.stats.weightstats: Statistical analysis is fundamental in drawing meaningful conclusions from data. These libraries provide functions for hypothesis testing, calculating various statistical measures, and performing t-tests.

6. Time Series Modeling:

  • pmdarima.arima and arch.unitroot: Time series modeling libraries are used to build forecasting models and assess stationarity and unit root properties of time series data, which are crucial for model selection and validation.

7. Miscellaneous:

  • random and warnings: These libraries allow for random number generation and control over warnings, respectively, which can be important in ensuring the reproducibility and reliability of data analysis code.

The code snippet also showcases the use of the dotenv library to securely manage sensitive information like database credentials and the warnings library to filter out unnecessary warnings, enhancing code readability and maintainability.

LOADING DATASET

In this project, the dataset has been sourced from three distinct locations. The initial records are stored in a remote database, requiring remote access to retrieve the data.

# Query statement to fetch oil, holidays_events and stores data from the remote server
oil_query = 'SELECT * FROM dbo.oil'
holidays_query = 'SELECT * FROM dbo.holidays_events'
stores_query = 'SELECT * FROM dbo.stores'

The subsequent records are contained in an Excel file named “test.csv and sample_submission.csv” conveniently accessible from OneDrive. Lastly, the final portion of the dataset can be located within a GitHub Repository, specifically stored in a CSV file compressed, containing the train dataset.

# Read all data from different sources
oil = pd.read_sql_query(oil_query, connection_string, parse_dates=['date'])
holidays_events = pd.read_sql_query(holidays_query, connection_string, parse_dates=['date'])
stores = pd.read_sql_query(stores_query, connection_string)
transactions = pd.read_csv('./data/transactions.csv', parse_dates=['date'])
train = pd.read_csv('./data/train.csv', parse_dates=['date'])
test = pd.read_csv('./data/test.csv', parse_dates=['date'])

This diverse collection of data sources underscores the complexity of real-world scenarios, where data analysts must adeptly navigate multiple platforms to gather comprehensive datasets for analysis.

UNDERSTANDING THE DATA

“File Descriptions and Information about Data Fields

train.csv

This file contains the training data, consisting of time series records that include essential features such as store_nbr, family, and onpromotion, as well as the target variable: sales.

- store_nbr: Identifies the specific store where the products are sold.
- family: Indicates the product type being sold.
- sales: Represents the total sales for a product family at a particular store on a given date. It’s important to note that sales values can be fractional, reflecting the sale of fractional units (e.g., 1.5 kg of cheese) rather than whole units (e.g., 1 bag of chips).
- onpromotion: Denotes the total count of items within a product family that were under promotion at a store on a specific date.

test.csv

Similar to the training data, the test data also includes features identical to those in the training set. In this file, your task is to predict the target variable, sales, for the provided dates. Notably, the dates in the test data span the 15 days following the last date in the training dataset.

transaction.csv

This dataset contains information regarding transactions, featuring date, store_nbr, and the total number of transactions recorded on that particular date.

sample_submission.csv

A sample submission file is included, showcasing the correct format for submitting your predictions.

stores.csv

Store metadata is provided in this file, encompassing details such as the store’s city, state, type, and cluster. Cluster categorizes similar stores into groups for analysis.

oil.csv

Daily oil prices are included in this dataset, spanning both the train and test data timeframes. Given Ecuador’s heavy reliance on oil exports, these oil prices are of paramount significance for understanding the country’s economic health and its susceptibility to oil price fluctuations.

holidays_events.csv

This dataset contains comprehensive information about holidays and special events, complemented by relevant metadata.”

CLICK HERE TO GET DATASETS

HYPOTHESIS

Null Hypothesis (H0): The number of products under promotion does not influence sales in supermarkets.

Alternative Hypothesis (H1): The number of products under promotion significantly influence sales in supermarkets.

RATIONALE

The justification behind conducting these hypothesis tests is to assess whether there exists concrete empirical evidence substantiating the notion that promotions exert a significant influence on supermarket sales.

By subjecting these hypotheses to scrutiny and exploring the relationship between promotions and sales, enterprises can acquire valuable insights into the intricate mechanisms governing supermarket sales. These insights, rooted in empirical data, can subsequently guide businesses in formulating promotional strategies based on concrete evidence, enhancing their decision-making processes.

ANALYTICAL QUESTIONS

The questions below are to be answered. Do note that, you are free to draw more hypothesis from the data.

  1. Is the train dataset complete (has all the required dates)?
  2. Which dates have the lowest and highest sales for each year?
  3. Did the earthquake impact sales?
  4. Are certain groups of stores selling more products? (Cluster, city, state, type)
  5. Are sales affected by promotions, oil prices and holidays?
  6. What analysis can we get from the date and its extractable features
  7. What is the difference between RMSLE, RMSE, MSE (or why is the MAE greater than all of them?)

see more here

CHECKED DATA QUALITY

There were Missing data points in the dcoilwtico column of the oil dataset.

EVALUATION DATA ANALYSIS

Exploratory Data Analysis (EDA) is a critical initial phase in the data analysis process. It is a systematic approach used by data professionals to gain a deeper understanding of a dataset before undertaking formal statistical modeling or hypothesis testing. EDA involves the use of statistical and visual techniques to summarize, visualize, and explore data.

HYPOTHESIS TESTING

Hypothesis testing plays a fundamental role in time series analysis, a field of statistical analysis that deals with data points collected, recorded, or measured at sequential points in time. Time series data often exhibit temporal dependencies, trends, and seasonality, making hypothesis testing a powerful tool to uncover meaningful patterns and relationships. Here, we explore how hypothesis testing is applied in time series analysis and its significance. see more on my github

DATA PREPROCESSING

Data preprocessing is a crucial step in the data analysis pipeline. It involves a series of operations and transformations applied to raw data to make it suitable for analysis and modeling. Data preprocessing helps ensure data quality, enhances the performance of machine learning algorithms, and enables the extraction of meaningful insights.

MODELLING

Time series models are statistical and machine learning techniques designed to analyze and forecast data points collected at regular intervals over time. Each of these models has its unique strengths and applications, making them valuable tools in time series analysis and forecasting. In this essay, we’ll explore ARIMA, ETS, SARIMA, and XGBoost models.

After carefully assessing the performance of our models using key evaluation metrics, it is evident that the XGBoost model stands out as the most effective choice for our dataset. See More on github

MODEL EVALUATION

Evaluating time series models is crucial to assess their performance, reliability, and accuracy in forecasting future values. Whether you’re using statistical models like ARIMA, machine learning models like XGBoost, or other forecasting techniques, robust evaluation ensures that your model’s predictions are trustworthy.

Recommendations

  1. Promotion Optimization
  2. Focus on High-Performing Cities
  3. Cluster-Centric Approach
  4. Cross-Analysis Opportunities

Read more of Recommendations here

To Summary Up

The time series project successfully addressed the objectives of analyzing and forecasting sales trends in a retail environment. The combination of statistical and machine learning models, along with thorough data preprocessing and EDA, enabled the generation of actionable insights. These insights empower Super Store to make data-driven decisions and enhance its overall operational efficiency. The project highlights the significance of time series analysis in optimizing super store sales processes and improving decision-making in sales and similar industries.

References

Let’s Connect on My LinkedIn Profile below:

https://www.linkedin.com/in/isaac-agbogah/

You can also reach out to me on Instagram @fantasticrambo

Special Thanks

To God Almighty for Strength, To my Team Mate Solomon. I would also like to express my sincere gratitude to the Azubi Africa team for their support in this project. I would also like to thank all of my readers and everyone else for taking the time to read and react to this project. Your feedback has been invaluable, and I have learned a great deal from it. Don’t forget to click on the “clap” icon below if you have enjoyed reading this article. Thank you for your time.

--

--

Isaac Rambo a.k.a Data Rambo

Hi there! I'm Isaac, a Data Analyst, YouTuber, Python programmer, teaching assistant, web designer, and content creator. It's nice to meet you! Connect with me!