Forecasting multiple time-series using Prophet in parallel

A short story about multiprocessing

Published in

spikelab

4 min readDec 7, 2018

For a few weeks I have been using Facebook Prophet library, its a great tool for forecasting time-series, because is pretty simple to use and the forecasted results are pretty good!, but doesn’t run all the process in parallel, so basically if you want to forecast multiple timeseries all the process could take a lot of time, this is how we reduce the forecasting process time significantly using multiprocessing package from Python.

Generating time-series

We are going to generate 500 random time-series, the purpose of this post is not to evaluate the effectiveness of Prophet prediction, but the time required to do accomplish this.

So I wrote a function that generates random time-series between a time period:

import pandas as pd
import numpy as npdef rnd_timeserie(min_date, max_date):
    time_index = pd.date_range(min_date, max_date)
    dates = (pd.DataFrame({'ds': pd.to_datetime(time_index.values)},
                          index=range(len(time_index))))
    y = np.random.random_sample(len(dates))*10
    dates['y'] = y
    return dates

So one of our random time-series looks like this:

Lets generate 500 series

series = [rnd_timeserie('2018-01-01','2018-12-30') for x in range(0,500)]

We have generated our time-series, now its time to run Prophet.

Forecasting using Prophet

Let’s create a simple Prophet model, for this we define a function called run_prophet that takes a time-series and fits a model with the data, then we can use that model to predict the next 90 days.

from fbprophet import Prophetdef run_prophet(timeserie):
    model = Prophet(yearly_seasonality=False,daily_seasonality=False)
    model.fit(timeserie)
    forecast = model.make_future_dataframe(periods=90, include_history=False)
    forecast = model.predict(forecast)
    return forecast

For example, we can run this function with the first generated time-serie:

f = run_prophet(series[0])
f.head()

We can see our forecasted results for that serie:

Running 500 time-series

Now let’s add a timer and run prophet for the 500 time-series without using any kind of multiprocessing tool, i’m using tqdm so I can check the progress

start_time = time.time()
result = list(map(lambda timeserie: run_prophet(timeserie), tqdm(series)))
print("--- %s seconds ---" % (time.time() - start_time))

The previous code took: 12.53 minutes to run, the processors usage looked like this the whole time:

Now, let’s add multiprocessing to our code, the idea here is to launch a process for each time-serie forecast, so we can run our run_prophet function in parallel while we do the map of the list.

For this we are going to use a Pool of process and quoting the documentation:

The Pool object which offers a convenient means of parallelizing the execution of a function across multiple input values, distributing the input data across processes (data parallelism).

from multiprocessing import Pool, cpu_countp = Pool(cpu_count())
predictions = list(tqdm(p.imap(run_prophet, series), total=len(series)))
p.close()
p.join()
print("--- %s seconds ---" % (time.time() - start_time))

With the previous code, we launch N processes depending of how many CPUs our machine has, then we run the run_prophet function for each time serie among the cpus.

The code took 3.04 minutes to run, the usage of CPUs in the whole run time looked like this:

So we got a speedup of 4.12!, which is pretty good!!!, my machine only has 8 CPU, if we want to run this faster, we could use a machine with more CPUs.

Conclusions

We could see that using multiprocessing is a great way to forecasting multiple time-series faster, in many problems multiprocessing could help to reduce the execution time of our code.

In a real world problem, we decreased the forecasting of 29000 time-series from 13hours to 45minutes using multiprocessing with a large CPU machine on Google Cloud.