Sliding Window Price Predictions

Lauren Faulds
Analytics Vidhya
Published in
6 min readJul 21, 2020

The code used here is available in its original repository in .ipynb format. You can download it & fiddle with it in Jupyter Notebook on your own device.

Intro

Today we’ll be seeing how we can use historic produce prices to make predictions over a twenty year period. This will be done in Python using a simple linear regression model. Beautiful Soup 4 helps with parsing the observations from an online source. This data will then be accessed & manipulated from a Pandas dataframe. Training will be done on a sliding window; this and model fitting will be conducted using Sci Kit Learn.

Our package imports:

from bs4 import BeautifulSoup
import requests
import numpy as np
import pandas as pd
from pandas import concat
from sklearn.model_selection import TimeSeriesSplit
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from math import sqrt

This post covers the core functions for linear regression using a sliding window on time series data. To see more on visualizing the results of our linear regression with Yellowbrick as well as other findings, please see original repository.

✧ Outline of this Post:

  1. Preparation — Getting model ready data with acquisition, cleaning & feature engineering
  2. Model Fitting — Linearly regressing time series splits
  3. About the Data — More on produce
Produce in basket

Acquisition

The function created for acquiring Produce data is very much tailored for the unique website: Produce Price Index. Here’s a look at the sort of information we’ll be extracting.

Commodity, Record Date, Farm Price, Atlanta Retail, Chicago Retail, Los Angeles Retail, New York Retail, Average Spread

We’ll need to parse the HTML to save each row’s unique values. For 638 pages worth, there are about 15,950 rows. Luckily with the package Beautiful Soup (v4) it’ll be easy enough to distill what we need from this resource (see link for installation process). Values are held in specific HTML classes, such as Strawberries in “fix-width-16”. These classes’ contents will be emptied then zipped together as tuples.

def getproducedata(inpages):
'''
The number argument inpages will be the page the function finishes on. Each row's values will be stored in a tuple & added to returned list tuple_data.
'''
tuple_data = []
url = "http://www.producepriceindex.com/produce-price-index?field_ppi_commodity_target_id=All&field_ppi_date_value%5Bmin%5D&field_ppi_date_value%5Bmax%5D&page={}"
page_nums = list(range(inpages))

for page_num in page_nums:
page_url = url.format(page_num)
rawhtml = requests.get(url.format(page_num)).text
body = BeautifulSoup(rawhtml, "lxml").tbody

producelist = []
for produceitem in body.find_all("td", {"class": "fix-width-16"}):
producelist.append(produceitem.text.strip())

pricedatelist = []
for pricedate in body.find_all("td", {"class": "fix-width-12"}):
pricedatelist.append(pricedate.text.strip())
# The list is broken up into pieces, for 6 values in each row
pricedate_list = [pricedatelist[x:x+6] for x in range(0, len(pricedatelist),6)]

#Spaces are removed before the spread value is added
spreadlist = []
for each in body.find_all("td", {"class":"views-field views-field-expression-4"}):
spreadlist.append(each.text.strip())

#combine spread with the rest of the pricedate list
pricedate_list = [x + [y] for x, y in zip(pricedate_list, spreadlist)]

#combine produce names & pricedates
producetuples = list(zip(producelist, pricedate_list))
tuple_data += producetuples

return tuple_data

We’ve successfully retrieved data by formatting the URL & caching the relevant information from each page. Next we’ll work it into a usable data frame.

Cleaning

Photo from http://www.personalcreations.com/

We’ll use Pandas to pop our tuples into a dataframe and get cleaned up. We’ll be fixing our columns, so each column carries one value. Additional columns will also be created that carry information like averaged price & date attributes.

def cleanproduce(dataframe):
'''
Values will be divided into appropriate columns
Errors will be removed in the process
The Date will become the index for time series analysis
''' #The 2nd column's data is broken up, & the 2nd column removed
dataframe[['Date', 'Farm_Price', 'Atlanta_Retail', 'Chicago_Retail', 'LA_Retail', 'NY_Retail', 'AVGspread']] = pd.DataFrame(produceframe["Attris"].to_list(), columns=['Date', 'Farm_Price', 'Atlanta_Retail', 'Chicago_Retail', 'LA_Retail', 'NY_Retail', 'AVGspread'])

# Columns made numeric with nonnumber characters removed
for each in ['Farm_Price', 'Atlanta_Retail', 'Chicago_Retail', 'LA_Retail', 'NY_Retail']:
dataframe[each] = pd.to_numeric(dataframe[each].str[1:])
dataframe['AVGspread'] = dataframe['AVGspread'].replace({'\%': '', ',': ''}, regex=True).astype(float)
produceframe['AVGspread'] = produceframe['AVGspread'] / 100
dataframe['AVGprice'] = dataframe[['Atlanta_Retail', 'Chicago_Retail', 'NY_Retail', 'LA_Retail']].mean(axis=1)

# Removing data intake errors
dataframe.dropna(inplace=True)
dataframe = dataframe[dataframe.AVGprice < 9]

#Order the data chronologically
dataframe = dataframe.iloc[::-1]
# The frame can be made into a time series, a numeric index is preserved
dataframe['Date'] = pd.to_datetime(dataframe.Date)
dataframe['Year'] = dataframe.Date.dt.year
dataframe['Month'] = dataframe.Date.dt.month
dataframe["Num_Index"] = range(1, (len(dataframe)+1))
dataframe = dataframe.set_index('Date')

# Dummie variables are created for produce category
dataframe["Produce_"] = dataframe["Produce"]
dataframe = pd.get_dummies(dataframe, columns=["Produce_"], prefix='Produce', drop_first=True)
return dataframe

In dummy variables the produce column is now a binary feature and usable for linear regression. The original categorical variable is still preserved, this significantly enables groupby functions, which assist in our feature engineering.

Creating Features

The time series features we can add now will be simple but effective. Each produce price observation will include the previous logged price & an average of last 3. This will create a small amount of empty values, where no previous or no 3 previous is available & will be removed afterwards.

def timeseriesfeatures(dataframe):
# Last entry as well as average of last 3 prices saved
previousprice = dataframe.groupby('Produce', as_index=False)['AVGprice'].shift(1)
avg3prev = previousprice.rolling(3).mean()

# Dataframe of these values created & indiced
prevsavgs = concat([avg3prev, previousprice], axis =1)
prevsavgs.columns = ['Window_AVG', 'Prev_AVG']

# Original dataframe will aquire new features & drop resulting na
dataframe = concat([dataframe, prevsavgs], axis = 1)
dataframe = dataframe.dropna()

return dataframe

Our model uses these previous observations to estimate future price.

Modeling

Child looking out window
Looking only at what’s present

When we’re doing time series predictions we want machine learning where only the information available at a given time is used. Data will be divided into folds for training on information & testing novel predictions.

We might be given one training fold to test one testing fold — this would be a sliding window. Or we might progressively accumulate our folds to test with more data— this would be an expanding window. Sci Kit Learn offers a time series split object that easily provides an expanding window.

You may be wondering why pick sliding window?

Our data set stretches a long period of time & experiences produce price volatility. Here, previous patterns may hinder novel training. I found smaller test errors occurring from a sliding window.The linear regression model will only have a selected frame before the event & can learn on the most relevant trends.

So we will make a sliding window from our expanding window object to make predictions as well as analyze our mistakes.

def timeseriesregression(X):
'''
Builds tscv expanding window into sliding window
Fitting to train test splits & storing test scores &
indices in lists
'''
trained_on = []
r_scores = []
rmse_scores = []
prev_train = 0
for train_index, test_index in tscv.split(X):
# An array of indices are created that starts
# at the finish of the previous training set
# & ends on the start of the current test set
last_step = test_index[0]
a_train_index = np.arange(prev_train, last_step)
# Train & Test portions are then allocated
X_train, X_test = X.iloc[a_train_index], X.iloc[test_index]
y_train, y_test = y.iloc[a_train_index], y.iloc[test_index]
# Linear Regression model is fit to the current training set
lrm = LinearRegression()
lrm.fit(X_train, y_train)
print('Test For Rows:', test_index)
print('For Years:', X_test['Year'].unique())
# Validation scores are saved with their indices for
# later analysis
r_score = lrm.score(X_test, y_test)
r_scores.append(r_score)
rmse_score = sqrt(mean_squared_error(y_test, (lrm.predict(X_test))))
rmse_scores.append(rmse_score)
print(r_score, rmse_score, '\n')
trained_on.append(a_train_index)# The final row is saved for the next iteration
prev_train = train_index[-1]

Looking at the results, this model gives us an average R-Score of 0.91 over 10 iterations. Performance is best on the first couple of iterations as well as the last couple of iterations. Further analysis can be done to see what may need to be optimized for the middle of the dataset where performance is lower.

About the Dataset

Produce Price Index

Starting in 1999 Western Growers took on documenting prices of produce at farm and retail in cities Atlanta, Chicago, Los Angeles & New York. Two decades later in 2019 Western Growers concluded their expedition in price tracking — at least for the time being. They are advocates for producers in the agricultural sector and more on their work can be found here.

--

--