Predict the Future: Python vs BigQuery in Time Series Forecasting!

Do you want to create your own Machine Learning algorithms, but not sure where to start? How hard is it to set up a basic algorithm? Read on, to learn how to set up a basic time series model in Python or with SQL in BigQuery.

Time series

Time series? Never heard of it, is it new on Netflix? Nice try, but no. Time series is, well, a series of data points spread out over time. For example, historic sales per day of an ice cream vendor. With time series you can make awesome predictions for the future. It searches for patterns and tries to predict what happens next. These patterns could be depending on a bunch of things. For an ice cream vendor, a big factor would be the weather. Warm weather = more ice creams sold (at least, in general).

In this blog we will not look at the sales of an ice cream vendor, but we will try to predict the temperature in New Delhi.

Data analysis

So, the temperature in New Delhi, cool. I mean warm… right? Let’s just have a look. In the graph below, we can see the monthly average temperature in New Delhi between January 2013 and December 2016. This data has been taken from Kaggle, you can download and play with it yourself over here. (Note: This data is actually on a daily basis, for simplicity I’ve taken the average temperature per month, to have fewer data points and less possible noise). I also have another data file with the average temperature of January 2017 up to and including April 2017. This data will be used to see if the predictions the model makes are any good and close to what actually happened.

Image created with Python

So what can we see from this picture? Well, firstly, since the image is created in Python, and Python likes to start counting from 0, the first month (January 2013) is visualized as ‘Month 0’. Furthermore, we see seasonality occurring. Seasonality is behavior that happens at the same intervals. In this case, as you could expect, the temperature shows peaks around month 6, 18, 30, and valleys in the winter. We can see the same pattern happening every 12 months, which ‘coincidentally’ follows the pattern of the seasons. We also see that during the winters, the temperature drops to around 15 degrees, which frankly is warmer than the average temperature for May in the Netherlands. Cosy 🙂

Next up we’ll start diving into the fun stuff: Code and Machine Learning!

Python

Now, let’s start making the weathercaster obsolete, we can predict it ourselves after this! Before we start, let me note 1 disclaimer: During these technical blocks I’ll generalize and simplify a lot. The actual weathercasters (should) use way more data than just the historic behavior. For example humidity, wind directions, clouds etc.

Firstly, let’s load the data to Python. We’re going to load two data files: the training data and the test data. The training data is described in the previous section. The test data, which contains the next 4 months, will be used to check the accuracy of our model. This part is simply done by reading two csv files with the Pandas package.

import pandas as pd
data = pd.read_csv("DailyDelhiClimateTrain.csv").sort_values('yearmonth')
test = pd.read_csv("DailyDelhiClimateTest.csv").sort_values('yearmonth')

Once this is done, it’s time to grab a Python package that supports time series forecasting. For this demo, we’ll use the open source package pmdarima. Even though this package and its functions offer a lot of flexibility in terms of parameters used, I’ll stick to the most basic ones only. For this demo, all we specify to the model is that there is a strong seasonality (seasonal=True), and that it happens every 12 data points (m=12, i.e. every 12 months). With this, creating and predicting the model is literally done in 2 lines of Python code (excluding importing the package).

import pmdarima as pm


# Train the model
model = pm.auto_arima(data['temperature'], seasonal=True, m=12)


# Predict the next 12 months
forecast = model.predict(n_periods=12)

Now, that was easy wasn’t it? But how good are the predictions made, is it any good? Again, for simplicity, I’m only looking at the Root Mean Squared Error (RMSE) and the Mean Average Error (MAE). These are some of many possible metrics to define how well a model is performing. I will not bore you with the exact details, you can check them out on Wikipedia. All you need to know for now, is that in general you want these numbers to be quite low, as this indicates the model has a lower overall prediction error .

So, let’s plot the predictions, and add the test data (this is the actual weather that happened between January 2017 and April 2017, which we did not include in the training dataset) to the plot as well, and print the RMSE and MAE while we’re at it.

So, the green line is what actually happened, and the red line is what the forecast returned. The green line is almost exactly on the red line, which is a good thing. It means the predictions are very close to what actually happened. The RMSE and MAE confirm this. The MAE of 0.55 could be read that, on average, the model is 0.55 degrees celcius off what actually happened. To me, that is close enough and seems like an acceptable prediction.

Well, I have to admit, Python did a great job at this. In the next section we’ll see how BigQuery does.

Google BigQuery

In Google Bigquery, we can also create a time series forecast, but without Python knowledge! Sure, you’d need some SQL knowledge instead, but generally SQL is considered easier to learn and requires less technical knowledge. So, let’s repeat the above, by starting with uploading the same dataset to BigQuery. This process can be done in a nice UI using no code, so I’ll skip the screenshots of this process.

Once the two tables are made, it’s time to train and predict, just like we did above. In BigQuery, this is also a very short statement.

So what do we see here? We create a model, which we call forecast_arima_plus, in the BQ_ML dataset. We tell BigQuery which model I want (ARIMA_PLUS), which column contains the timestamp and which column contains the value we want to predict.
Then, we end the statement with selecting these columns from a table called Train_data. You could add SQL logic in the select statements, or exogenous variables if you’re feeling fancy, but again, for simplicity, we’ll leave those out for now.

Once the model is trained, let’s predict 12 months as well. Again, this is a one-liner

So, how did this model perform? You could generate a bunch of metrics via SQL with the EVALUATE function, but we’ll just add the prediction, the RMSE and the MAE to the Python plot of above. This results in the graph below.

The BigQuery prediction is quite good as well. However, looking at the smallest of details, we see that the MAE is a little higher than the Python MAE. BigQuery is on average 0.98 degrees off the actual values. Now, this does not say much yet, since both the training and test set are very small, but it is enough for now. Time to wrap this blog up.

(Careful) Conclusion

This conclusion is not going to call a winner (this blog never intended to do so). It would also be impossible based on the above, as it is just 1 test case, with frankly very little data. But there are some (comparative) takeaways that we can get from above.

  • You can definitely create decent forecasts both in Python or in SQL with BigQuery ML. This is nice, since it allows a wide range of people with different skill sets to set something up.
  • BigQuery ML picks up some tasks automatically, under the hood, without you having to specify them. For example, if you scroll up, you’d notice we did not tell BigQuery the seasonality of our data (monthly), which in Python we did with these parameters seasonal=True, m=12. It figured that out itself, which is very nice, especially for starters who might not think of these things.
  • Most Python packages that you can use for time series forecasting have more parameters you can tweak, so with enough experience and knowledge, you could probably create more accurate models in Python.
  • Assuming that your data is saved in BigQuery, then BigQuery ML offers a huge benefit, in the sense that no data movement is needed, no compute engine has to be set up to run your Python script, etc. It’s generally way less of a hassle in terms of architecture to set up a running time series forecast model in BigQuery ML. In most cases it could also be a cheaper option.

Want to learn more about BigQuery or BigQuery ML? Via Incentro I offer different (online) classes around BigQuery. Check out this link.

--

--