5 Cool Ways to Enrich ML Models with Open Data for Free: An In-depth Review of Python Libraries

With code examples

Soner Yıldırım
Geek Culture
Published in
14 min readSep 1, 2022

--

Photo by Towfiqu barbhuiya on Unsplash

Machine learning algorithms are used by numerous businesses to solve a variety of forecasting problems. Predicting the values ​​of time series data is quite common, which has the potential to create business value. For example, a typical task in retail business is to forecast sales in individual stores or in certain categories of goods. Another example is forecasting the demand for rail and air tickets to certain destinations.

All these forecasting problems are strongly tied to the people’s behavior, which is influenced by several factors such as weather, holidays, the state of the economy in the country, and global processes. These factors must be taken into account if you want to build a strong predictive model that produces accurate and robust results.

The number one requirement to create such a model is, of course, data. In this article, we will focus on the task of collecting data and explore 5 popular libraries that gather necessary data for given dates. Moreover, based on the obtained data, we will construct derived features and enrich the training set with them.

We will do an in-depth review of the 5 of the most popular libraries that provide access to different data types:

  • 📚holidays — holidays in different countries
  • 📚yfinance — stock data from Yahoo Finance
  • 📚meteostat — weather data from weather stations around the world
  • 📚pandas-datareader — stock data and economic statistics from many sources around the world
  • 📚upgini — ready-made features based on many sources

All these libraries can be installed via pip:

pip install holidays yfinance meteostat pandas-datareader upgini

Creating the training data

Let’s imagine we need to solve a very typical problem: Predict the volume of sales for different categories of products in different stores across several countries. We will start with generating a dataset that resembles real sales statistics using Pandas.

--

--