Grouping and Sampling Time Series Data
How to group data by time intervals in Python Pandas
Image Ref: Unsplash
What is Time-Series Data:
Time series data also referred to as time-stamped data, is a sequence of data points indexed in time order. Time-stamped is data collected at different points in time. These data points typically consist of successive measurements made from the same source over a time interval and are used to track change over time.
What analysis do we need for Time-series Data?
While dealing with time-Series data analysis we need to combine data into certain intervals like with each day, a week, or a month.
How Can We Do this?
We will solve these using only 2 Pandas APIs i.e. resample() and GroupBy().
Resample():
The resample() function is used to resample time-series data. Convenience method for frequency conversion and resampling of time series. The object must have a DateTime-like index(DatetimeIndex, PeriodIndex, or TimedeltaIndex), or pass DateTime-like values to the on or level keyword.
GroupBy():
GroupBy is a pretty simple concept. We can create a grouping of categories and apply a function to the categories. We need it when dealing with large amounts of data and trying things over and over, so for efficiency, we use the Groupby concept. The GroupBy concept is really important because of its ability to summarize, aggregate, and group data efficiently.
Let’s Get our Hands Dirty:
Importing and Installing:
#pip install openpyxl
#pip install pandas_profilingimport pandas as pd
import pandas_profilingimport matplotlib.pyplot as plt
import seaborn as sns
About the Dataset:
The dataset I used is Covid-19 Cases Of India Dataset. You can find it here.
COVID-19 in India
Dataset on Novel Corona Virus Disease 2019 in India
Read the Database and Combine Date-Time
We can combine data and time directly using parse_dates while reading our CSV File.
data = pd.read_csv('covid-19 INDIA.csv',parse_dates=[['Date','Time']])
data.head()
Converting to DateTime Format
Now, the first step to deal with any time series data is to convert the date column into a DateTime format:
data['Date_Time'] = pd.to_datetime(data['Date_Time'])
Setting Index
data.set_index("Date_Time",inplace=True)
Drop Unnecessary Columns
data.drop(axis=1, columns=["Sno","ConfirmedForeignNational","ConfirmedIndianNational"],inplace=True)data.tail()
data
Generating Report of Dataset : Quick View of Complete Dataset
report = pandas_profiling.ProfileReport(data)report
Let’s view the data ourselves by state and cases
We can zoom-in our data using matplotlib notebook
# %matplotlib notebookplt.figure(figsize=(10,5));
plt.scatter(x=data["State/UnionTerritory"],y=data.Confirmed);
plt.show()
Find out the unique states/UTs data we have
data["State/UnionTerritory"].unique()Output:array(['Kerala', 'Telengana', 'Delhi', 'Rajasthan', 'Uttar Pradesh',
'Haryana', 'Ladakh', 'Tamil Nadu', 'Karnataka', 'Maharashtra',
'Punjab', 'Jammu and Kashmir', 'Andhra Pradesh', 'Uttarakhand',
'Odisha', 'Puducherry', 'West Bengal', 'Chhattisgarh',
'Chandigarh', 'Gujarat', 'Himachal Pradesh', 'Madhya Pradesh',
'Bihar', 'Manipur', 'Mizoram', 'Andaman and Nicobar Islands',
'Goa', 'Unassigned', 'Assam', 'Jharkhand', 'Arunachal Pradesh',
'Tripura', 'Nagaland', 'Meghalaya',
'Dadra and Nagar Haveli and Daman and Diu',
'Cases being reassigned to states', 'Sikkim', 'Daman & Diu',
'Lakshadweep', 'Telangana'], dtype=object)data["State/UnionTerritory"].nunique()Output: 40
I don’t think these “Cases being reassigned to states” rows are necessary. Lets lookout
data.loc[data["State/UnionTerritory"]=="Cases being reassigned to states"].head(10)
Grouping the cases data by states
data.groupby('State/UnionTerritory').sum().head()
Total no of instances state-wise
data.groupby('State/UnionTerritory').size().head(40)State/UnionTerritory
Andaman and Nicobar Islands 401
Andhra Pradesh 415
Arunachal Pradesh 393
Assam 395
Bihar 405
Cases being reassigned to states 60
Chandigarh 408
Chhattisgarh 408
Dadra and Nagar Haveli and Daman and Diu 360
Daman & Diu 1
Delhi 425
Goa 401
Gujarat 407
Haryana 423
Himachal Pradesh 406
Jammu and Kashmir 418
Jharkhand 395
Karnataka 418
Kerala 457
Ladakh 420
Lakshadweep 141
Madhya Pradesh 406
Maharashtra 418
Manipur 403
Meghalaya 382
Mizoram 402
Nagaland 349
Odisha 411
Puducherry 409
Punjab 418
Rajasthan 424
Sikkim 342
Tamil Nadu 420
Telangana 1
Telengana 424
Tripura 389
Unassigned 3
Uttar Pradesh 423
Uttarakhand 412
West Bengal 409
dtype: int64
Resampling generates a unique sampling distribution on the basis of the actual data. It groups rows by some time or date information
We can apply various frequencies to resample our time series data. Most commonly used time series frequency is –
- W: weekly frequency
- M: month end frequency
- SM: semi-month end frequency (15th and end of the month)
- Q: quarter end frequency
- H : Hourly
- D : Per Day
Sampling Monthly Data
Let’s group our data based upon the monthly cases data
data.groupby('State/UnionTerritory').resample("M").sum()
Sampling by Day
data.groupby('State/UnionTerritory').resample("D").sum()
Quaterly Sampling
data_q = data.groupby('State/UnionTerritory').resample("Q").sum().head(30)
data_q
data_q.columnsIndex(['Cured', 'Deaths', 'Confirmed'], dtype='object')
Grouping By hours
data.groupby('State/UnionTerritory').resample("H").sum().tail(5)
We have data at 8:00 clock thus for all other rows the values are 0.
This was all about the basics of resampling and grouping for a time-series dataset. Try and experiment your own with some cool time series dataset including weather updates, census data, etc
I hope this tutorial helped you in understanding the concept! Don’t forget to experiment on your own and keep learning.
….and if you like this blog, don’t forget to leave a few hearty likes.
Happy Learnings!