A time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. Examples of time series are heights of ocean tides, counts of sunspots, and the daily closing value of the Dow Jones Industrial Average.
There are a number of tools available to analyse time-series data, plot and generate insights. This post outlines my experience with one such data analysis tool called Pandas. Pandas is a software library for Python programming language, which offers data structures and operations for analysing time series.
The first tool of choice to use Pandas framework was Jupyter (acronym of Julia, Python and R, the three core languages supported) notebook. Jupyter can be run using Anaconda framework or using pip. Google Cloud provides a hosted version of Jupyter called Datalab. Datalab is what I have used for my prototype. Simply start a cloud shell and run the command:
datalab create <vm-name>
A new VM is created, launched and a port forwarding is also created
Reading, parsing and merging CSV files using Pandas
The source files are in CSV file format, with one for each month. The files were uploaded into Google Cloud Storage for easy analysis. Use the below library to access GCS. The below commands retrieves the CSV file ‘data-export-site-2018–09-Sep18–5m.csv’ from the bucket ‘demo-bucket-horizonx’ and returns the path.
When running Python in Jupyter, IPython is used. IPython is a rich toolkit which allows running python interactively. IPython provides ‘magic commands’, very similar to command line tools which can be run within the shell. Datalab provides magic commands to easily access resources within Google Cloud like BigQuery, Google Cloud Storage, BigTable etc.
To access GCS, Datalab provides a magic command called “gcs”. It reads the CSV from the GCS URI into a variable data. This is, in turn, converted to a Pandas DataFrame object using the function read_csv. In the below snippet, df_sept is a Pandas DataFrame. The procedure is repeated for all the months available.
If you have a compressed CSV file, Pandas can read that as well into a DataFrame
df = pd.read_csv(BytesIO(data),compression='gzip',usecols=['col1','col2','col3','date_time_utc'])
DataFrame info function will spit out interesting information about the frame.
RangeIndex: 310370 entries, 0 to 310369
Data columns (total 2 columns):
Date_Time 310370 non-null datetime64[ns]
Energy (J) 310370 non-null int64
dtypes: datetime64[ns](1), int64(1)
memory usage: 4.7 MB
Once we have individual frames, the next step is to merge all of them together.
The variable frames is an array of dataframes, and Pandas function concat merges all of them into a single dataframe. Next line sets the column Date_Time as an index. The head function returns top 5 records by default.
If you are running Jupyter and want to access files from Google Cloud Storage, the library gcsfs can be used: https://github.com/dask/gcsfs
The DataFrame object provides an easy way to calculate the mathematical statistics functions.
dropna function drops empty fields and describe function calculates the standard mathematical functions.
Plotting is an important capability in Jupyter notebook. There are a number of frameworks like matplotlib, Seaborn, mpld3, bokeh, Altair and others. matplotlib is the de-facto standard. Seaborn is based on matplotlib and makes the matplotlib plots richer. Below is a plot using seaborn, which shows a summary of three columns averaged to week.
The resulting plot looks like this:
Another example using matplotlib, showing two plots overlayed and gives an indication of anomalies.
The resulting plot:
Facebook Prophet for prediction
Prophet is a forecasting tool for Python and R. It always takes a DataFrame with two columns ‘ds’ (timestamp) and ‘y’ and provides two methods fit and predict.
The below code snippet demonstrates how to resample the Pandas DataFrame to be used with Prophet
Resample DataFrame for input into Prophet
Next step is to create a Prophet object and fit the DataFrame using the object
from fbprophet import Prophet
m = Prophet()
Next step will create a dataframe with future dates (6 months)
future = m.make_future_dataframe(periods=182)
Run a prediction using the framework
forecast = m.predict(future)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()
Gives the below output:
Now using this, we can simply plot the prediction or plot the seasonality component.
fig2 = m.plot_components(forecast)
This gives the following plot:
Pandas is a rich framework which fills the gap Python has in data analysis. Easy to use without much programming, it allows easy filtering, slicing and plotting of data as series or data frames.
Jupyter is a great interactive tool to explore, transform, visualise and share the analysis. It has a very rich ecosystem of modules to explore data across various sources and optimise machine learning models for deployments.