SARIMA Modelling for Car Sharing — Basic Data Pipelines — Applications with Python Pt. 1

The data scientists don’t tell you all the steps involved in a project when you first start out. Especially when you’re learning from previous examples, cross-validated articles, and clean(ish) datasets from Kaggle or UCI.

Here I want to run through all the steps involved in finding, getting, hosting, cleaning, and analyzing a dataset. The git repo can be found here, and will be updated through the life of the project.

The subject of our data science project

Finding The Data

I wanted to find a live public API that I could connect to, save a series of data over time, and apply a statistical model on. Because I would be collecting time and geographical data, I thought of a local car-sharing service in Vancouver, BC called Modo. And luckily, they also had a public API I could connect to.

Modo has a bunch of different data streams to choose from:

  • List, description, and features of each car available.
  • Where the cars based, lat-long locations, and a location description.
  • Which cars are available right now, or at any time in the future?
  • How much would a booking cost?

An interesting question to look at would be, how different areas in Vancouver change their supply and demand of cars available over a 10 day period.

Examining The Data

Let’s take a look at the actual data at this endpoint. I’m using a chrome extension called JSONview so I can quickly look at the data and get the correct nesting level for the data I want to extract.

# Here is some example output from the car location endpoint
{
Status: "Success",
Checksum: "00aa83873957f7eee0eed65f82492587",

Response: {

Locations: {

3: {
ID: "3",
Name: "W 2nd & Cypress",
ShortDescription: "Vancouver - West 2nd Avenue & Cypress Street",
Region: "Lower Mainland",
City: "Vancouver",
Neighbourhood: "Kitsilano",
Latitude: "49.269861",
Longitude: "-123.148762",
ExceptionClass: null
},
      4: {
ID: "4",
Name: "Main Street SkyTrain",
ShortDescription: "Vancouver - Terminal Avenue & Quebec Street",
Region: "Lower Mainland",
City: "Vancouver",
Neighbourhood: "False Creek",
Latitude: "49.272521",
Longitude: "-123.100764",
ExceptionClass: null
},
...

Looks like there are some parts of the data I want and others I can discard. Let’s just get the following:

  • The current date and time;
  • The car id;
  • The region;
  • The city;
  • The neighbourhood, and;
  • The lat-lng values.

Finding the nested locations of the data in the JSONb is quite easy if you’re using the extension above. These locations are included in our pipeline file below, so there is no need to do the searching yourself.

A Basic Pipeline

There is a file in the repo called pythonScript.py which is our basic data pipeline responsible for connecting to the Modo API, cleaning the data, and saving our results to a .csv for use later in the project.

Here are the main points of the pipeline script:

  • Create your logging file;
  • Create an empty dataframe to fill;
  • Determine how long you want the script to run for;
  • Attempt a data extraction loop, and if successful save the data;
  • Sleep until the next iteration, and;
  • Once all iterations are complete, close the connection.
# Here is part of the pythonScript.py that deals with looping
# through the JSON object returned
...
# Create array of current datetime
currentDateTime = datetime.datetime.now().isoformat()
logging.info('Current iteration is happening at %s' % currentDateTime)
dateArray = np.array([currentDateTime for i in xrange(lengthOfCars)])
for record in keyToIterate:
id_val = data2['Response']['Locations'][record]['ID']
name_val = data2['Response']['Locations'][record]['Name']
city_val = data2['Response']['Locations'][record]['City']
neighbourhood_val = data2['Response']['Locations'][record]['Neighbourhood']
region_val = data2['Response']['Locations'][record]['Region']
lat_val = data2['Response']['Locations'][record]['Latitude']
lng_val = data2['Response']['Locations'][record]['Longitude']
...

In my opinion, the most important part of the script is the logging file. Because you’re running this outside of a controlled IDE like IntelliJ or Jupyter, you’ll be losing much of the functionality that comes with these environments. And with that, you need to be able to diagnose errors which could (and will) pop up while running your script.

# Create logging file
logFileName = '%s_modo_logging_file.log' % currentDate
# Set logging file attributes
logging.basicConfig(filename=logFileName, level=logging.INFO, datefmt='%a, %d %b %Y %H:%M:%S', format='%(asctime)s %(levelname)s %(message)s')
# Write to the loging file
logging.info('Starting script')

By setting up a logging file, it will make debugging much easier if something fails within the pipeline. Save yourself some headache, and practice good programming principles.

Setting Up A Server

To capture all this information, we’re going to set up a simple server in GCloud.

From the params in pythonScript.py the project is supposed to save the data collected from the API every 15 minutes over 10 days. This is going to give us data for 960 time periods (4 * 24 * 10), which should be enough to run our time series model. We could do all of this from our local machines, however, using GCloud gives us an additional understanding of how to host projects outside your machine and the steps involved in doing so.

If you don’t have an account already, it’s free to sign up and they give you $500 in cloud credits so. Even if you burned through your free credits already this project cost be around $4.60 in server costs. Not much at all.

To setup a simple VM, log into GCloud and go to Compute Engine >> VM Instance >> Create Instance. You don’t need much attached to it besides a small amount of storage and some processing power, so if you want to reduce the CPUs attached to it, there should be no problem with doing so. Now you can SSH into your VM, and set up the project. Choose “Open in Browser Window” and you’ll see a new window pop up with your server running on it.

Connect to the VM Instance (Left); The terminal window after SHH’ing into the VM (Right)

Now we need to get our pythonScript.py into the sever, and the best way to do this is to create a Github repo first, and then clone everything into the VM. To do this, you will need to register your VM with your Github account, and set up your SSH keys between the two. The instructions to do that can be found here.

Once you get your script into your VM, you need to make sure all the correct packages are installed for your python script to run correctly. I’ve included a requirements.txt file in the repo, and installed it by running this code in the directory:

pip install -r requirements.txt

There are only three requirements for this project:

requests==2.9.1
numpy==1.14.2
pandas==0.22.0

Now check if your pythonScript.py can be executed through the terminal by checking the permissions on the file with ls -la. If you see x on the left side of the file (which are the file permissions) it has the ability to be executed. If it is not here, grant the executable permission with sudo chmod +x pythonScript.py.

pythonScript.py has the ability to be executed

Finally, we need to ensure our script runs even if you log out of the VM. We do this with this command:

nohup python ~/path-to-script/pythonScript.py &

The nohup command runs the specified program even after the terminal is closed. E.g. If you are logged in remotely to the server and the connection drops, or if you close your terminal emulator, the script will still run.

You can check if the command is running with ps -A | grep python and you can check the output of your logging file to make sure everything is running correctly. Just write the contents of the logging file to the terminal with:

cat name-of-log-file.log

You should see the that the script has started to run, and is sleeping on its current cycle. Great! We’re getting the data we need.

Confirmation our script is running and it is collecting and saving our data

The Time Series Model

Let’s look at the model we’re going to use on our data, a Seasonal ARIMA (AutoRegressive Integrated Moving Average) model from the statsmodels package. I’ve chosen this model because I’m assuming there is some sort of seasonal component — in this case our season is a day — with car sharing.

I think it can be reasonably assumed throughout a day there is some trend for which, high values tend always to occur in some particular hours and low values tend always to occur in other hours.

So, what exactly are SARIMA models?

SARIMA models are a general time series model, and is used to analyze and forecast data which have an additional seasonal component. We derive values for p, d, and q in order to make the time series stationary. A stationary series has a constant mean and variance.

Consider if our data series which increases over time, which implying the mean does not stay constant, therefore our estimates today will underestimate future periods because of the increase increased. Therefore any estimates we make based on this mean are incorrect, and so will their correlation with other variables. We’ll come back to this later.

Below is a general explanation of a SARIMA model.

Non-seasonal (Left); Seasonal (Right)

The AR portion (p) implies the output variable at time t is a weighted sum of past values and some stochastic term, allowing us to incorporate the effect of yesterday (or longer) today. This value is the number of lags we will use. E.g. The output today depends linearly on its own previous values.

The I portion (d) is the number of non-seasonal differences applied to the series needed for to make it stationarity. The model takes past values, and subtracts them from the current value. E.g. the series is integrated to order “2” if taking repeated differences 2 times creates a stationary process and your model would be ARIMA(0, 2, 0)

The MA portion (q) is a moving average of previous error terms. This MA processes are used to model various shocks to a system, which gradually dissipate over time. E.g. What the output in period t is only to random errors that occurred in previous periods

The Seasonal portion (P, D, Q)m has the same structure as the non-seasonal parts. It may — but does not have to include — an AR factor, an MA factor, and/or an I factor. In this part of the model, all of these factors operate across the number of period (m) in your season. This seasonality is a regular pattern of changes that repeats over m time periods. E.g. If there is a seasonal factor of 3 over a time series of monthly data, the pattern repeats every quarter.

Put these all together and you get a SARIMA (p, d, q) x (P, D, Q)m which is what we’re looking to build.

The exact model we’ll be using from the python package statsmodels is the SARIMAX, where X is the use of an exogenous explanatory variable (the X part of SARIMAX). We’re not going to get into adding exogenous variables to our model, but it’s best to explain it so there is no confusion. The default is none when running the model, and we don’t have any extra variables to include at the moment.

An exogenous variable’s value is independent of the states of other variables included. Its value is determined by factors outside our model. E.g. Rainfall is exogenous to the demand for car sharing networks, as the more it rains the fewer people want to walk around and will probably use other means of transportation.

Simple SARIMA Model

Let’s look at some randomly generated random walk of time series data y to get us primed for looking over the Modo data we’re extracting in our pipeline. y models the percentage return of a stock over the past decade. This is just a randomly generated series, and is not drawn from a particular index so I don’t expect to get any substantial analysis from this process. However, we want to plot it to see if it is stationary, look at some method to transform the data if it is not, and see if we can do some basic predictions.

While the variance is mostly stable, our mean is increasing over time. So our data is not stationary. We can verify this with a Dicky-Fuller test to examine unit root. The H0 of the test is the data can be represented with a unit root and it is not stationary, and H1 (rejecting the null hypothesis) implies the data does not have a unit root and is stationary.

# Run a DF test
Results of Dickey-Fuller Test:
Test Statistic 1.113542
p-value 0.995331 # Fail to reject

I have to briefly talk about autocorrelation because it relates to the strength of our model, and is related to stationarity. Autocorrelations are values between -1 and 1 indicating how a data series is related to itself over time. Specifically, how strongly values at specific periods (lags) apart are correlated to each other over time. E.g. An autocorrelation at lag 1 measures how values 1 period apart arecorrelated to one another throughout the series. Additionally, autocorrelation at lag 2 measures how the data two periods apart are correlated throughout the series.

By graphing the autocorrelations at specific lags, we can see how the data relates to itself either positively or negatively, and therefore find out how to create a stationary series.

We can see significant positive ACF decay, and spikes at lag 1 for the PACF. Remember we need stationary data so we can correctly estimate the parameters of our model.

I’m going to quote directly from Duke’s time series forecasting site:

If the PACF displays a sharp cutoff while the ACF decays more slowly (i.e., has significant spikes at higher lags), we say that the stationarized series displays an “AR signature,” meaning that the autocorrelation pattern can be explained more easily by adding AR terms than by adding MA terms … the AR(1) term in this model will turn out to be equivalent to a first difference.

That sounds exactly like our simulated data.

So we can start modelling our SARIMA with these parameters (1, 0, 0) x (0, 0, 0, 0). It’s not a lot, but it is a start.

But before we take the difference from our data let look at the QQ and PP plots. These will show us if our series is normally distributed by plotting its distribution against normally distributed series. If it is normally distributed, we should see our blue dots all along the red line without significant deviation. However, what we have is quite a bit of deviation, implying our model is both skewed and heavily tailed.

There is a whole list of rules for identifying different techniques to apply to your time series to make it stationary. Right now, we’re just going to transform our data by taking the first difference which will return values transformed by y_t = y_t+1 — y_t

You can do this quickly by using the diff() method and plot the results:

y_diff = y.diff()

The results below are much much better than our previous series. We’ll confirm the series is stationary with a DF test below. However, if we look at the values past 2016 we can see some heteroscedasticity. This is when the variability of our dependent variable becomes unequal across some range. However, it is not enough variability for our DF to not be rejected. We also get an AIC of -246.71 which we’ll come back to later when we’re specifying models.

SARIMAX(1, 0, 0) x (0, 0, 0, 0)
# New test on our differenced data
Results of Dickey-Fuller Test:
Test Statistic -5.487494
p-value 0.000002 # Reject H0

Now look at the QQ and PP plots, to see if we’re closer to a normal distribution. It’s not perfectly distributed across a normal distribution, but it is much closer.

However, while this model’s errors look much better than the previous one there is still room for improvement. But how can we do this without having to look at the ACF and PACF graphs? How can we specify the strength of our model?

How To Find The Best Model

How do we choose the correct SARIMA model? We could do some guessing, or use the AUTOArima method in R, but that’s no fun. Similar to GridCV in SKLearn, we can do a search over the set possible parameter values to see which are is best for our data given a specific comparable output. Essentially we’re performing a hyperparameter optimization for the model using grid search.

Similar to other regression problems we need a baseline to compare each model. Here we’re choosing between AIC, BIC, or RMSE.

AIC adds a penalty to the complexity of the model, so as complexity increases so does the penalty. It’s aim is to balance a complex model with accuracy and a simple model with accuracy. It measures the goodness of fit of the model by assigning an informational cost to the number of parameters to estimate.

E.g. Consider two model ARIMA_1 (p1, i1, q1) which could — and does — offer better fit than ARIMA_2 (p2, i2, q2), where p1 > p2 and q1 > q2. However, the information gained from increased fit is traded off with the increase of complexity imposed by the additional AR and MA terms of ARIMA_1.

AIC tries to select the model that most adequately describes an unknown future, given the stochastic nature of the data. In a sense, it tries to take out “what is” and determine “what could be.”

Similarly to AIC, BIC — which could have been used here — tries to find the “what is” by best selecting a model to evaluate the current data rather than future data.

BIC does not take into account the model’s complexity, which can lead to overfitting.

We won’t use RMSE, as extending the lag length in our model will yield a better value for RMSE. So by adding complexity to the model, RMSE gets better which as stated above, might make for an incorrect model.

So lets run through different combinations for our SARIMA model using a modified grid search.

# Set variables to populate
lowest_aic = None
lowest_parm = None
lowest_param_seasonal = None
# GridSearch the hyperparameters of p, d, q and P, D, Q, m
for param in param_grid:
for param_seasonal in seasonal_param_grid:
try:
mdl = sm.tsa.statespace.SARIMAX(y, order=param, seasonal_order=param_seasonal, enforce_stationarity=True, enforce_invertibility=True)
results = mdl.fit()

# Store results
current_aic = results.aic
            # Set baseline for aic
if (lowest_aic == None):
lowest_aic = results.aic
            # Compare results
if (current_aic <= lowest_aic):
lowest_aic = current_aic
lowest_parm = param
lowest_param_seasonal = param_seasonal
print('SARIMA{}x{} - AIC:{}'.format(param, param_seasonal, results.aic))
except:
continue

print('The best model is: SARIMA{}x{} - AIC:{}'.format(lowest_parm, lowest_param_seasonal, lowest_aic))
# Show the results
The best model is: SARIMA(1, 0, 0)x(0, 0, 0, 12) - AIC:-260.40166312

By using the inputs of (1, 0, 0) x (0, 0, 0, 12) for p, d, q, P, D, Q, and m, we now have the inputs which will best specify our model. We can also use the plot_diagnostics on our SARIMA object to quickly see some of the results we need to evaluate.

We need to make sure our residuals are uncorrelated and normally distributed with mean = 0. If we don’t get these results, we need to consider improving our model because we won’t get the correct coefficient estimates. We will have a higher risk of Type 1 (false positive) errors for our coefficients.

SARIMA(1, 0, 0) x (0, 0, 0, 12)

However we can be confident our residuals are uncorrelated and normally distributed because of our graph below:

  • Our QQ plot is fitting linearly along the sampled normal distribution.
  • The Kernel Density Estimation (KDE) plot — which estimates a PDF of a normal RV — of the residuals is consistent with a normal distribution.
  • Our residuals are still showing the heteroscedasticity after 2016, but are still fitting well, and are roughly a white noise process.
  • And finally, our autocorrelation shown in the correlogram have a low correlation at lags > 0

It seems that our ARIMA model is working fine. Not perfect, but fine.

Results from plot_diagnostics

The real test comes when we try to predict future values, and this where our model falls down. Hard.

We were way off …

Additionally, it seems that heteroscedasticity around 2016 has thrown our predictions way off. Which is not beyond anything we could have expected. It is completely random data after all.

Next Steps

Seeing as this is dummy data, I would not go much farther into analyzing it or trying to fit additional models. However, if I wanted to try additional methods on this data I might want to look at change point analysis to see exactly where a shock was introduced to the system, or trying to model the series with a GARCH. I would suspect using a GARCH model would be our best choice as it can model volatility clusters — our heteroscedasticity — as is clearly exhibited in our data past 2016.

So now that we have our server set up, our pipeline running, and our model selected, and an understanding of what to look for, we have to actually model our data.

The modelling the Modo data pulled from the pipeline, a breakdown of the SARIMA we’re going to use, and the results from the analysis, will be posted in Part 2.

As always, I hope this helped fill in some gaps on your knowledge.

Cheers,

Additional Reading

https://stats.stackexchange.com/questions/84076/negative-values-for-aic-in-general-mixed-model

https://stats.stackexchange.com/questions/40905/arima-model-interpretation/63072#63072