Grouping and Sampling Time Series Data

Shelvi Garg

Published in

Nerd For Tech

5 min readMay 18, 2021

How to group data by time intervals in Python Pandas

Image Ref: Unsplash

What is Time-Series Data:

Time series data also referred to as time-stamped data, is a sequence of data points indexed in time order. Time-stamped is data collected at different points in time. These data points typically consist of successive measurements made from the same source over a time interval and are used to track change over time.

What analysis do we need for Time-series Data?

While dealing with time-Series data analysis we need to combine data into certain intervals like with each day, a week, or a month.

How Can We Do this?

We will solve these using only 2 Pandas APIs i.e. resample() and GroupBy().

Resample():

The resample() function is used to resample time-series data. Convenience method for frequency conversion and resampling of time series. The object must have a DateTime-like index(DatetimeIndex, PeriodIndex, or TimedeltaIndex), or pass DateTime-like values to the on or level keyword.

GroupBy():

GroupBy is a pretty simple concept. We can create a grouping of categories and apply a function to the categories. We need it when dealing with large amounts of data and trying things over and over, so for efficiency, we use the Groupby concept. The GroupBy concept is really important because of its ability to summarize, aggregate, and group data efficiently.

Let’s Get our Hands Dirty:

Importing and Installing:

#pip install openpyxl
#pip install pandas_profilingimport pandas as pd 
import pandas_profilingimport matplotlib.pyplot as plt
import seaborn as sns

About the Dataset:

The dataset I used is Covid-19 Cases Of India Dataset. You can find it here.

COVID-19 in India

Dataset on Novel Corona Virus Disease 2019 in India

www.kaggle.com

Read the Database and Combine Date-Time

We can combine data and time directly using parse_dates while reading our CSV File.

data = pd.read_csv('covid-19 INDIA.csv',parse_dates=[['Date','Time']]) 
data.head()

Converting to DateTime Format

Now, the first step to deal with any time series data is to convert the date column into a DateTime format:

data['Date_Time'] = pd.to_datetime(data['Date_Time'])

Setting Index

data.set_index("Date_Time",inplace=True)

Drop Unnecessary Columns

data.drop(axis=1, columns=["Sno","ConfirmedForeignNational","ConfirmedIndianNational"],inplace=True)data.tail()

data

Generating Report of Dataset : Quick View of Complete Dataset

report = pandas_profiling.ProfileReport(data)report

Let’s view the data ourselves by state and cases

We can zoom-in our data using matplotlib notebook

# %matplotlib notebookplt.figure(figsize=(10,5));
plt.scatter(x=data["State/UnionTerritory"],y=data.Confirmed);
plt.show()

Find out the unique states/UTs data we have

data["State/UnionTerritory"].unique()Output:array(['Kerala', 'Telengana', 'Delhi', 'Rajasthan', 'Uttar Pradesh',
       'Haryana', 'Ladakh', 'Tamil Nadu', 'Karnataka', 'Maharashtra',
       'Punjab', 'Jammu and Kashmir', 'Andhra Pradesh', 'Uttarakhand',
       'Odisha', 'Puducherry', 'West Bengal', 'Chhattisgarh',
       'Chandigarh', 'Gujarat', 'Himachal Pradesh', 'Madhya Pradesh',
       'Bihar', 'Manipur', 'Mizoram', 'Andaman and Nicobar Islands',
       'Goa', 'Unassigned', 'Assam', 'Jharkhand', 'Arunachal Pradesh',
       'Tripura', 'Nagaland', 'Meghalaya',
       'Dadra and Nagar Haveli and Daman and Diu',
       'Cases being reassigned to states', 'Sikkim', 'Daman & Diu',
       'Lakshadweep', 'Telangana'], dtype=object)data["State/UnionTerritory"].nunique()Output: 40

I don’t think these “Cases being reassigned to states” rows are necessary. Lets lookout

data.loc[data["State/UnionTerritory"]=="Cases being reassigned to states"].head(10)

Grouping the cases data by states

data.groupby('State/UnionTerritory').sum().head()

Total no of instances state-wise

data.groupby('State/UnionTerritory').size().head(40)State/UnionTerritory
Andaman and Nicobar Islands                 401
Andhra Pradesh                              415
Arunachal Pradesh                           393
Assam                                       395
Bihar                                       405
Cases being reassigned to states             60
Chandigarh                                  408
Chhattisgarh                                408
Dadra and Nagar Haveli and Daman and Diu    360
Daman & Diu                                   1
Delhi                                       425
Goa                                         401
Gujarat                                     407
Haryana                                     423
Himachal Pradesh                            406
Jammu and Kashmir                           418
Jharkhand                                   395
Karnataka                                   418
Kerala                                      457
Ladakh                                      420
Lakshadweep                                 141
Madhya Pradesh                              406
Maharashtra                                 418
Manipur                                     403
Meghalaya                                   382
Mizoram                                     402
Nagaland                                    349
Odisha                                      411
Puducherry                                  409
Punjab                                      418
Rajasthan                                   424
Sikkim                                      342
Tamil Nadu                                  420
Telangana                                     1
Telengana                                   424
Tripura                                     389
Unassigned                                    3
Uttar Pradesh                               423
Uttarakhand                                 412
West Bengal                                 409
dtype: int64

Resampling generates a unique sampling distribution on the basis of the actual data. It groups rows by some time or date information

We can apply various frequencies to resample our time series data. Most commonly used time series frequency is –

W: weekly frequency
M: month end frequency
SM: semi-month end frequency (15th and end of the month)
Q: quarter end frequency
H : Hourly
D : Per Day

Sampling Monthly Data

Let’s group our data based upon the monthly cases data

data.groupby('State/UnionTerritory').resample("M").sum()

Sampling by Day

data.groupby('State/UnionTerritory').resample("D").sum()

Quaterly Sampling

data_q = data.groupby('State/UnionTerritory').resample("Q").sum().head(30)
data_q

data_q.columnsIndex(['Cured', 'Deaths', 'Confirmed'], dtype='object')

Grouping By hours

data.groupby('State/UnionTerritory').resample("H").sum().tail(5)

We have data at 8:00 clock thus for all other rows the values are 0.

This was all about the basics of resampling and grouping for a time-series dataset. Try and experiment your own with some cool time series dataset including weather updates, census data, etc

I hope this tutorial helped you in understanding the concept! Don’t forget to experiment on your own and keep learning.

….and if you like this blog, don’t forget to leave a few hearty likes.

Happy Learnings!