Photo by Luke Chesser on Unsplash

Time Series Data Analysis using Datalab, Pandas & Prophet

Vinu Kumar
Mar 18, 2019 · 5 min read

A time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. Examples of time series are heights of ocean tides, counts of sunspots, and the daily closing value of the Dow Jones Industrial Average.

- Wikipedia

There are a number of tools available to analyse time-series data, plot and generate insights. This post outlines my experience with one such data analysis tool called Pandas. Pandas is a software library for Python programming language, which offers data structures and operations for analysing time series.

Setup

datalab create <vm-name>

A new VM is created, launched and a port forwarding is also created

Figure 1: Launching Datalab

Reading, parsing and merging CSV files using Pandas

Figure 2: Read CSV from Pandas

When running Python in Jupyter, IPython is used. IPython is a rich toolkit which allows running python interactively. IPython provides ‘magic commands’, very similar to command line tools which can be run within the shell. Datalab provides magic commands to easily access resources within Google Cloud like BigQuery, Google Cloud Storage, BigTable etc.

Pandas DataFrame

Figure 3: Parse CSV into Pandas Dataframe

If you have a compressed CSV file, Pandas can read that as well into a DataFrame

df = pd.read_csv(BytesIO(data),compression='gzip',usecols=['col1','col2','col3','date_time_utc'])

DataFrame info function will spit out interesting information about the frame.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 310370 entries, 0 to 310369
Data columns (total 2 columns):
Date_Time 310370 non-null datetime64[ns]
Energy (J) 310370 non-null int64
dtypes: datetime64[ns](1), int64(1)
memory usage: 4.7 MB

Once we have individual frames, the next step is to merge all of them together.

Figure 4: Merge Dataframes

The variable frames is an array of dataframes, and Pandas function concat merges all of them into a single dataframe. Next line sets the column Date_Time as an index. The head function returns top 5 records by default.

If you are running Jupyter and want to access files from Google Cloud Storage, the library gcsfs can be used: https://github.com/dask/gcsfs

The DataFrame object provides an easy way to calculate the mathematical statistics functions.

Figure 5: Mean, median and mode
data.dropna().describe()

dropna function drops empty fields and describe function calculates the standard mathematical functions.

Plotting

Plotting is an important capability in Jupyter notebook. There are a number of frameworks like matplotlib, Seaborn, mpld3, bokeh, Altair and others. matplotlib is the de-facto standard. Seaborn is based on matplotlib and makes the matplotlib plots richer. Below is a plot using seaborn, which shows a summary of three columns averaged to week.

Figure 6: Seaborn Plot Code

The resulting plot looks like this:

Figure 7: Seaborn Plot

Another example using matplotlib, showing two plots overlayed and gives an indication of anomalies.

Figure 8: Matplotlib plot code

The resulting plot:

Figure 9: Matplotlib plot

Facebook Prophet for prediction

The below code snippet demonstrates how to resample the Pandas DataFrame to be used with Prophet

Figure 10: DataFrame for Prophet
Figure 11: Resampled DataFrame

Resample DataFrame for input into Prophet

Next step is to create a Prophet object and fit the DataFrame using the object

from fbprophet import Prophet
m = Prophet()
m.fit(input_frame)

Next step will create a dataframe with future dates (6 months)

future = m.make_future_dataframe(periods=182)

Run a prediction using the framework

forecast = m.predict(future)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()

Gives the below output:

Figure 12: Predictions

Now using this, we can simply plot the prediction or plot the seasonality component.

fig2 = m.plot_components(forecast)

This gives the following plot:

Figure 11: Seasonality Component

Conclusion

Jupyter is a great interactive tool to explore, transform, visualise and share the analysis. It has a very rich ecosystem of modules to explore data across various sources and optimise machine learning models for deployments.

HorizonX

We’re a team of passionate, expert and customer-obsessed…