Grouping and Sampling Time Series Data

Shelvi Garg
Nerd For Tech
Published in
5 min readMay 18, 2021

How to group data by time intervals in Python Pandas

Image Ref: Unsplash

What is Time-Series Data:

Time series data also referred to as time-stamped data, is a sequence of data points indexed in time order. Time-stamped is data collected at different points in time. These data points typically consist of successive measurements made from the same source over a time interval and are used to track change over time.

What analysis do we need for Time-series Data?

While dealing with time-Series data analysis we need to combine data into certain intervals like with each day, a week, or a month.

How Can We Do this?

We will solve these using only 2 Pandas APIs i.e. resample() and GroupBy().

Resample():

The resample() function is used to resample time-series data. Convenience method for frequency conversion and resampling of time series. The object must have a DateTime-like index(DatetimeIndex, PeriodIndex, or TimedeltaIndex), or pass DateTime-like values to the on or level keyword.

GroupBy():

GroupBy is a pretty simple concept. We can create a grouping of categories and apply a function to the categories. We need it when dealing with large amounts of data and trying things over and over, so for efficiency, we use the Groupby concept. The GroupBy concept is really important because of its ability to summarize, aggregate, and group data efficiently.

Let’s Get our Hands Dirty:

Importing and Installing:

#pip install openpyxl
#pip install pandas_profiling
import pandas as pd
import pandas_profiling
import matplotlib.pyplot as plt
import seaborn as sns

About the Dataset:

The dataset I used is Covid-19 Cases Of India Dataset. You can find it here.

COVID-19 in India

Dataset on Novel Corona Virus Disease 2019 in India

www.kaggle.com

Read the Database and Combine Date-Time

We can combine data and time directly using parse_dates while reading our CSV File.

data = pd.read_csv('covid-19 INDIA.csv',parse_dates=[['Date','Time']]) 
data.head()

Converting to DateTime Format

Now, the first step to deal with any time series data is to convert the date column into a DateTime format:

data['Date_Time'] = pd.to_datetime(data['Date_Time'])

Setting Index

data.set_index("Date_Time",inplace=True)

Drop Unnecessary Columns

data.drop(axis=1, columns=["Sno","ConfirmedForeignNational","ConfirmedIndianNational"],inplace=True)data.tail()
data

Generating Report of Dataset : Quick View of Complete Dataset

report = pandas_profiling.ProfileReport(data)report

Let’s view the data ourselves by state and cases

We can zoom-in our data using matplotlib notebook

# %matplotlib notebookplt.figure(figsize=(10,5));
plt.scatter(x=data["State/UnionTerritory"],y=data.Confirmed);
plt.show()

Find out the unique states/UTs data we have

data["State/UnionTerritory"].unique()Output:array(['Kerala', 'Telengana', 'Delhi', 'Rajasthan', 'Uttar Pradesh',
'Haryana', 'Ladakh', 'Tamil Nadu', 'Karnataka', 'Maharashtra',
'Punjab', 'Jammu and Kashmir', 'Andhra Pradesh', 'Uttarakhand',
'Odisha', 'Puducherry', 'West Bengal', 'Chhattisgarh',
'Chandigarh', 'Gujarat', 'Himachal Pradesh', 'Madhya Pradesh',
'Bihar', 'Manipur', 'Mizoram', 'Andaman and Nicobar Islands',
'Goa', 'Unassigned', 'Assam', 'Jharkhand', 'Arunachal Pradesh',
'Tripura', 'Nagaland', 'Meghalaya',
'Dadra and Nagar Haveli and Daman and Diu',
'Cases being reassigned to states', 'Sikkim', 'Daman & Diu',
'Lakshadweep', 'Telangana'], dtype=object)data["State/UnionTerritory"].nunique()Output: 40

I don’t think these “Cases being reassigned to states” rows are necessary. Lets lookout

data.loc[data["State/UnionTerritory"]=="Cases being reassigned to states"].head(10)

Grouping the cases data by states

data.groupby('State/UnionTerritory').sum().head()

Total no of instances state-wise

data.groupby('State/UnionTerritory').size().head(40)State/UnionTerritory
Andaman and Nicobar Islands 401
Andhra Pradesh 415
Arunachal Pradesh 393
Assam 395
Bihar 405
Cases being reassigned to states 60
Chandigarh 408
Chhattisgarh 408
Dadra and Nagar Haveli and Daman and Diu 360
Daman & Diu 1
Delhi 425
Goa 401
Gujarat 407
Haryana 423
Himachal Pradesh 406
Jammu and Kashmir 418
Jharkhand 395
Karnataka 418
Kerala 457
Ladakh 420
Lakshadweep 141
Madhya Pradesh 406
Maharashtra 418
Manipur 403
Meghalaya 382
Mizoram 402
Nagaland 349
Odisha 411
Puducherry 409
Punjab 418
Rajasthan 424
Sikkim 342
Tamil Nadu 420
Telangana 1
Telengana 424
Tripura 389
Unassigned 3
Uttar Pradesh 423
Uttarakhand 412
West Bengal 409
dtype: int64

Resampling generates a unique sampling distribution on the basis of the actual data. It groups rows by some time or date information

We can apply various frequencies to resample our time series data. Most commonly used time series frequency is –

  • W: weekly frequency
  • M: month end frequency
  • SM: semi-month end frequency (15th and end of the month)
  • Q: quarter end frequency
  • H : Hourly
  • D : Per Day

Sampling Monthly Data

Let’s group our data based upon the monthly cases data

data.groupby('State/UnionTerritory').resample("M").sum()

Sampling by Day

data.groupby('State/UnionTerritory').resample("D").sum()

Quaterly Sampling

data_q = data.groupby('State/UnionTerritory').resample("Q").sum().head(30)
data_q
data_q.columnsIndex(['Cured', 'Deaths', 'Confirmed'], dtype='object')

Grouping By hours

data.groupby('State/UnionTerritory').resample("H").sum().tail(5)

We have data at 8:00 clock thus for all other rows the values are 0.

This was all about the basics of resampling and grouping for a time-series dataset. Try and experiment your own with some cool time series dataset including weather updates, census data, etc

I hope this tutorial helped you in understanding the concept! Don’t forget to experiment on your own and keep learning.

….and if you like this blog, don’t forget to leave a few hearty likes.

Happy Learnings!

--

--