# SARIMA Modelling for Car Sharing — Basic Data Pipelines — Applications with Python Pt. 1

The data scientists don’t tell you all the steps involved in a project when you first start out. Especially when you’re learning from previous examples, cross-validated articles, and clean(ish) datasets from Kaggle or UCI.

Here I want to run through *all* the steps involved in finding, getting, hosting, cleaning, and analyzing a dataset. The git repo can be found here, and will be updated through the life of the project.

**Finding The Data**

I wanted to find a live public API that I could connect to, save a series of data over time, and apply a statistical model on. Because I would be collecting time and geographical data, I thought of a local car-sharing service in Vancouver, BC called Modo. And luckily, they also had a public API I could connect to.

Modo has a bunch of different data streams to choose from:

- List, description, and features of each car available.
- Where the cars based, lat-long locations, and a location description.
- Which cars are available right now, or at any time in the future?
- How much would a booking cost?

An interesting question to look at would be, how different areas in Vancouver change their supply and demand of cars available over a 10 day period.

**Examining The Data**

Let’s take a look at the actual data at this endpoint. I’m using a chrome extension called JSONview so I can quickly look at the data and get the correct nesting level for the data I want to extract.

# Here is some example output from the car location endpoint

{

Status: "Success",

Checksum: "00aa83873957f7eee0eed65f82492587",

Response: {

Locations: {

3: {

ID: "3",

Name: "W 2nd & Cypress",

ShortDescription: "Vancouver - West 2nd Avenue & Cypress Street",

Region: "Lower Mainland",

City: "Vancouver",

Neighbourhood: "Kitsilano",

Latitude: "49.269861",

Longitude: "-123.148762",

ExceptionClass: null

},

4: {

ID: "4",

Name: "Main Street SkyTrain",

ShortDescription: "Vancouver - Terminal Avenue & Quebec Street",

Region: "Lower Mainland",

City: "Vancouver",

Neighbourhood: "False Creek",

Latitude: "49.272521",

Longitude: "-123.100764",

ExceptionClass: null

},

...

Looks like there are some parts of the data I want and others I can discard. Let’s just get the following:

- The current date and time;
- The car id;
- The region;
- The city;
- The neighbourhood, and;
- The lat-lng values.

Finding the nested locations of the data in the JSONb is quite easy if you’re using the extension above. These locations are included in our pipeline file below, so there is no need to do the searching yourself.

**A Basic Pipeline**

There is a file in the repo called `pythonScript.py`

which is our basic data pipeline responsible for connecting to the Modo API, cleaning the data, and saving our results to a `.csv`

for use later in the project.

Here are the main points of the pipeline script:

- Create your logging file;
- Create an empty
`dataframe`

to fill; - Determine how long you want the script to run for;
- Attempt a data extraction loop, and if successful save the data;
- Sleep until the next iteration, and;
- Once all iterations are complete, close the connection.

`# Here is part of the ``pythonScript.py that deals with looping`

# through the JSON object returned

...

# Create array of current datetime

currentDateTime = datetime.datetime.now().isoformat()

logging.info('Current iteration is happening at %s' % currentDateTime)

dateArray = np.array([currentDateTime for i in xrange(lengthOfCars)])

for record in keyToIterate:

id_val = data2['Response']['Locations'][record]['ID']

name_val = data2['Response']['Locations'][record]['Name']

city_val = data2['Response']['Locations'][record]['City']

neighbourhood_val = data2['Response']['Locations'][record]['Neighbourhood']

region_val = data2['Response']['Locations'][record]['Region']

lat_val = data2['Response']['Locations'][record]['Latitude']

lng_val = data2['Response']['Locations'][record]['Longitude']

...

In my opinion, the most important part of the script is the `logging`

file. Because you’re running this outside of a controlled IDE like IntelliJ or Jupyter, you’ll be losing much of the functionality that comes with these environments. And with that, you need to be able to diagnose errors which could (and will) pop up while running your script.

# Create logging file

logFileName = '%s_modo_logging_file.log' % currentDate

# Set logging file attributes

logging.basicConfig(filename=logFileName, level=logging.INFO, datefmt='%a, %d %b %Y %H:%M:%S', format='%(asctime)s %(levelname)s %(message)s')

# Write to the loging file

logging.info('Starting script')

By setting up a logging file, it will make debugging much easier if something fails within the pipeline. Save yourself some headache, and practice good programming principles.

**Setting Up A Server**

To capture all this information, we’re going to set up a simple server in GCloud.

From the params in `pythonScript.py`

the project is supposed to save the data collected from the API every 15 minutes over 10 days. This is going to give us data for 960 time periods (4 * 24 * 10), which should be enough to run our time series model. We could do all of this from our local machines, however, using GCloud gives us an additional understanding of how to host projects outside your machine and the steps involved in doing so.

If you don’t have an account already, it’s free to sign up and they give you $500 in cloud credits so. Even if you burned through your free credits already this project cost be around $4.60 in server costs. Not much at all.

To setup a simple VM, log into GCloud and go to `Compute Engine >> VM Instance >> Create Instance`

. You don’t need much attached to it besides a small amount of storage and some processing power, so if you want to reduce the CPUs attached to it, there should be no problem with doing so. Now you can SSH into your VM, and set up the project. Choose “Open in Browser Window” and you’ll see a new window pop up with your server running on it.

Now we need to get our `pythonScript.py`

into the sever, and the best way to do this is to create a Github repo first, and then clone everything into the VM. To do this, you will need to register your VM with your Github account, and set up your SSH keys between the two. The instructions to do that can be found here.

Once you get your script into your VM, you need to make sure all the correct packages are installed for your python script to run correctly. I’ve included a `requirements.txt`

file in the repo, and installed it by running this code in the directory:

`pip install -r requirements.txt`

There are only three requirements for this project:

requests==2.9.1

numpy==1.14.2

pandas==0.22.0

Now check if your `pythonScript.py`

can be executed through the terminal by checking the permissions on the file with `ls -la`

. If you see `x`

on the left side of the file (which are the file permissions) it has the ability to be executed. If it is not here, grant the executable permission with `sudo chmod +x pythonScript.py`

.

Finally, we need to ensure our script runs even if you log out of the VM. We do this with this command:

nohup python ~/path-to-script/pythonScript.py &

The `nohup`

command runs the specified program even after the terminal is closed. E.g. If you are logged in remotely to the server and the connection drops, or if you close your terminal emulator, the script will still run.

You can check if the command is running with `ps -A | grep python`

and you can check the output of your logging file to make sure everything is running correctly. Just write the contents of the logging file to the terminal with:

`cat`

`name-of-log-file.log`

You should see the that the script has started to run, and is sleeping on its current cycle. Great! We’re getting the data we need.

**The Time Series Model**

Let’s look at the model we’re going to use on our data, a Seasonal ARIMA (AutoRegressive Integrated Moving Average) model from the `statsmodels`

package. I’ve chosen this model because I’m assuming there is some sort of seasonal component — in this case our season is a day — with car sharing.

I think it can be reasonably assumed throughout a day there is some trend for which, high values tend always to occur in some particular hours and low values tend always to occur in other hours.

So, what exactly are SARIMA models?

SARIMA models are a general time series model, and is used to analyze and forecast data which have an additional seasonal component. We derive values for `p`

, `d`

, and `q`

in order to make the time series `stationary`

. A `stationary`

series has a constant mean and variance.

Consider if our data series which increases over time, which implying the mean does not stay constant, therefore our estimates today will underestimate future periods because of the increase increased. Therefore any estimates we make based on this mean are incorrect, and so will their correlation with other variables. We’ll come back to this later.

Below is a general explanation of a SARIMA model.

The `AR`

portion `(p)`

implies the output variable at time `t`

is a weighted sum of *past values* and some stochastic term, allowing us to incorporate the effect of yesterday (or longer) today. This value is the number of lags we will use. E.g. The output today depends linearly on its own previous values.

The `I`

portion `(d)`

is the number of non-seasonal differences applied to the series needed for to make it stationarity. The model takes past values, and subtracts them from the current value. E.g. the series is integrated to order “2” if taking repeated differences 2 times creates a stationary process and your model would be `ARIMA(0, 2, 0)`

The `MA`

portion `(q)`

is a moving average of previous *error terms*. This MA processes are used to model various shocks to a system, which gradually dissipate over time. E.g. What the output in period `t`

is only to random errors that occurred in previous periods

The `Seasonal`

portion `(P, D, Q)m`

has the same structure as the non-seasonal parts. It may — but does not have to include — an AR factor, an MA factor, and/or an I factor. In this part of the model, all of these factors operate across the number of period `(m)`

in your season. This seasonality is a regular pattern of changes that repeats over m time periods. E.g. If there is a seasonal factor of 3 over a time series of monthly data, the pattern repeats every quarter.

Put these all together and you get a `SARIMA (p, d, q) x (P, D, Q)m`

which is what we’re looking to build.

The exact model we’ll be using from the python package `statsmodels`

is the `SARIMAX`

, where `X`

is the use of an exogenous explanatory variable (the `X`

part of `SARIMAX`

). We’re not going to get into adding exogenous variables to our model, but it’s best to explain it so there is no confusion. The default is `none`

when running the model, and we don’t have any extra variables to include at the moment.

An exogenous variable’s value is independent of the states of other variables included. Its value is determined by factors outside our model. E.g. Rainfall is exogenous to the demand for car sharing networks, as the more it rains the fewer people want to walk around and will probably use other means of transportation.

**Simple SARIMA Model**

Let’s look at some randomly generated random walk of time series data `y`

to get us primed for looking over the Modo data we’re extracting in our pipeline. `y`

models the percentage return of a stock over the past decade. This is just a randomly generated series, and is not drawn from a particular index so I don’t expect to get any substantial analysis from this process. However, we want to plot it to see if it is `stationary`

, look at some method to transform the data if it is not, and see if we can do some basic predictions.

While the variance is mostly stable, our mean is increasing over time. So our data is not `stationary`

. We can verify this with a `Dicky-Fuller`

test to examine unit root. The `H0`

of the test is the data can be represented with a unit root and it is not `stationary`

, and `H1`

(rejecting the null hypothesis) implies the data does not have a unit root and is `stationary`

.

# Run a DF test

Results of Dickey-Fuller Test:

Test Statistic 1.113542

p-value 0.995331 # Fail to reject

I have to briefly talk about `autocorrelation`

because it relates to the strength of our model, and is related to stationarity. `Autocorrelations`

are values between -1 and 1 indicating how a data series is related to itself over time. Specifically, how strongly values at specific periods (lags) apart are correlated to each other over time. E.g. An `autocorrelation`

at `lag 1`

measures how values 1 period apart arecorrelated to one another throughout the series. Additionally, `autocorrelation`

at `lag 2`

measures how the data two periods apart are correlated throughout the series.

By graphing the `autocorrelations`

at specific lags, we can see how the data relates to itself either positively or negatively, and therefore find out how to create a stationary series.

We can see significant positive `ACF`

decay, and spikes at `lag 1`

for the `PACF`

. Remember we need stationary data so we can correctly estimate the parameters of our model.

I’m going to quote directly from Duke’s time series forecasting site:

If the PACF displays a sharp cutoff while the ACF decays more slowly (i.e., has significant spikes at higher lags), we say that the stationarized series displays an “AR signature,” meaning that the autocorrelation pattern can be explained more easily by adding AR terms than by adding MA terms … the AR(1) term in this model will turn out to be equivalent to a first difference.

That sounds exactly like our simulated data.

So we can start modelling our SARIMA with these parameters `(1, 0, 0) x (0, 0, 0, 0)`

. It’s not a lot, but it is a start.

But before we take the difference from our data let look at the QQ and PP plots. These will show us if our series is normally distributed by plotting its distribution against normally distributed series. If it is normally distributed, we should see our blue dots all along the red line without significant deviation. However, what we have is quite a bit of deviation, implying our model is both skewed and heavily tailed.

There is a whole list of rules for identifying different techniques to apply to your time series to make it stationary. Right now, we’re just going to transform our data by taking the first difference which will return values transformed by `y_t = y_t+1 — y_t`

You can do this quickly by using the `diff()`

method and plot the results:

y_diff = y.diff()

The results below are much *much* better than our previous series. We’ll confirm the series is `stationary`

with a DF test below. However, if we look at the values past 2016 we can see some `heteroscedasticity`

. This is when the variability of our dependent variable becomes unequal across some range. However, it is not enough variability for our DF to not be rejected. We also get an `AIC`

of -246.71 which we’ll come back to later when we’re specifying models.

# New test on our differenced data

Results of Dickey-Fuller Test:

Test Statistic -5.487494

p-value 0.000002 # Reject H0

Now look at the QQ and PP plots, to see if we’re closer to a normal distribution. It’s not perfectly distributed across a normal distribution, but it is much closer.

However, while this model’s errors look much better than the previous one there is still room for improvement. But how can we do this without having to look at the `ACF`

and `PACF`

graphs? How can we specify the strength of our model?

**How To Find The Best Model**

How do we choose the correct `SARIMA`

model? We could do some guessing, or use the `AUTOArima`

method in `R`

, but that’s no fun. Similar to GridCV in `SKLearn`

, we can do a search over the set possible parameter values to see which are is best for our data given a specific comparable output. Essentially we’re performing a hyperparameter optimization for the model using grid search.

Similar to other regression problems we need a baseline to compare each model. Here we’re choosing between `AIC`

, `BIC`

, or `RMSE`

.

`AIC`

adds a penalty to the complexity of the model, so as complexity increases so does the penalty. It’s aim is to balance a complex model with accuracy and a simple model with accuracy. It measures the goodness of fit of the model by assigning an informational cost to the number of parameters to estimate.

E.g. Consider two model `ARIMA_1 (p1, i1, q1)`

which could — and does — offer better fit than `ARIMA_2 (p2, i2, q2)`

, where `p1 > p2`

and `q1 > q2`

. However, the information gained from increased fit is traded off with the increase of complexity imposed by the additional `AR`

and `MA`

terms of `ARIMA_1`

.

`AIC`

tries to select the model that most adequately describes an unknown future, given the stochastic nature of the data. In a sense, it tries to take out “what is” and determine “what could be.”

Similarly to `AIC`

, `BIC`

— which could have been used here — tries to find the “what is” by best selecting a model to evaluate the current data rather than future data.

`BIC`

does not take into account the model’s complexity, which can lead to overfitting.

We won’t use `RMSE`

, as extending the lag length in our model will yield a better value for `RMSE`

. So by adding complexity to the model, `RMSE`

gets better which as stated above, might make for an incorrect model.

So lets run through different combinations for our `SARIMA`

model using a modified grid search.

# Set variables to populate

lowest_aic = None

lowest_parm = None

lowest_param_seasonal = None

# GridSearch the hyperparameters of p, d, q and P, D, Q, m

for param in param_grid:

for param_seasonal in seasonal_param_grid:

try:

mdl = sm.tsa.statespace.SARIMAX(y, order=param, seasonal_order=param_seasonal, enforce_stationarity=True, enforce_invertibility=True)

results = mdl.fit()

# Store results

current_aic = results.aic

# Set baseline for aic

if (lowest_aic == None):

lowest_aic = results.aic

# Compare results

if (current_aic <= lowest_aic):

lowest_aic = current_aic

lowest_parm = param

lowest_param_seasonal = param_seasonal

print('SARIMA{}x{} - AIC:{}'.format(param, param_seasonal, results.aic))

except:

continue

print('The best model is: SARIMA{}x{} - AIC:{}'.format(lowest_parm, lowest_param_seasonal, lowest_aic))

# Show the results

The best model is: SARIMA(1, 0, 0)x(0, 0, 0, 12) - AIC:-260.40166312

By using the inputs of `(1, 0, 0) x (0, 0, 0, 12)`

for `p`

, `d`

, `q`

, `P`

, `D`

, `Q`

, and `m`

, we now have the inputs which will best specify our model. We can also use the `plot_diagnostics`

on our SARIMA object to quickly see some of the results we need to evaluate.

We need to make sure our residuals are uncorrelated and normally distributed with mean = 0. If we don’t get these results, we need to consider improving our model because we won’t get the correct coefficient estimates. We will have a higher risk of Type 1 (false positive) errors for our coefficients.

However we can be confident our residuals are uncorrelated and normally distributed because of our graph below:

- Our QQ plot is fitting linearly along the sampled normal distribution.
- The Kernel Density Estimation (KDE) plot — which estimates a PDF of a normal RV — of the residuals is consistent with a normal distribution.
- Our residuals are still showing the heteroscedasticity after 2016, but are still fitting well, and are roughly a white noise process.
- And finally, our autocorrelation shown in the correlogram have a low correlation at
`lags > 0`

It seems that our ARIMA model is working fine. Not perfect, but fine.

The real test comes when we try to predict future values, and this where our model falls down. Hard.

Additionally, it seems that heteroscedasticity around `2016`

has thrown our predictions way off. Which is not beyond anything we could have expected. It is completely random data after all.

**Next Steps**

Seeing as this is dummy data, I would not go much farther into analyzing it or trying to fit additional models. However, if I wanted to try additional methods on this data I might want to look at change point analysis to see exactly where a shock was introduced to the system, or trying to model the series with a GARCH. I would suspect using a GARCH model would be our best choice as it can model volatility clusters — our heteroscedasticity — as is clearly exhibited in our data past 2016.

So now that we have our server set up, our pipeline running, and our model selected, and an understanding of what to look for, we have to actually model our data.

The modelling the Modo data pulled from the pipeline, a breakdown of the SARIMA we’re going to use, and the results from the analysis, will be posted in Part 2.

As always, I hope this helped fill in some gaps on your knowledge.

Cheers,

**Additional Reading**

**Is there any reason to prefer the AIC or BIC over the other?**

*From what I can tell, there isn't much difference between AIC and BIC. They are both mathematically convenient…*stats.stackexchange.com

https://stats.stackexchange.com/questions/84076/negative-values-for-aic-in-general-mixed-model

https://stats.stackexchange.com/questions/40905/arima-model-interpretation/63072#63072