Meteorological Data Analysis
Analysis of the Weather data of Finland to check if the monthly average temperature and humidity have increased from the years 2006 to 2016 or not.
In this notebook I have performed the analysis of Meteorological Data, the dataset used in this notebook can be obtained from the following link: https://www.kaggle.com/muthuj7/weather-dataset
Problem Statement
Find whether the average Apparent temperature for a month starting from 2006 to 2016 and the average humidity for the same period have increased or not.
Importing Libraries
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'
Loading Data
weather_df = pd.read_csv("weatherHistory.csv")
weather_df.head()
Understanding the data
weather_df.shape(96453, 11)
We have 96453 rows and 11 columns but we are interested in only 3 columns which are Formatted Date, Apparent Temperature and Humidity. So lets just use these three columns from here on.
weather_df.columnsIndex(['Formatted Date', 'Summary', 'Precip Type', 'Temperature (C)',
'Apparent Temperature (C)', 'Humidity', 'Wind Speed (km/h)',
'Wind Bearing (degrees)', 'Visibility (km)', 'Pressure (millibars)',
'Daily Summary'],
dtype='object')weather_final_df = weather_df[['Formatted Date', 'Apparent Temperature (C)', 'Humidity']]
weather_final_df.head()
Let's see some basic info about our dataset
weather_final_df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96453 entries, 0 to 96452
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Formatted Date 96453 non-null object
1 Apparent Temperature (C) 96453 non-null float64
2 Humidity 96453 non-null float64
dtypes: float64(2), object(1)
memory usage: 2.2+ MB
We see that we have 2 numeric and one categorical column and all of them are free from null values. Though the data type of the numeric columns is correct, there is a problem in the data type of Formatted date, which should have been DateTime, with which we will deal later.
The statistical description of our dataset
weather_final_df.describe()
We don't see a large difference in the mean and median values for both of our numeric columns so they don't require any transformation.
Estimating Skewness and Kurtosis
weather_final_df.skew()Apparent Temperature (C) -0.057302
Humidity -0.715880
dtype: float64weather_final_df.kurt()Apparent Temperature (C) -0.706844
Humidity -0.462170
dtype: float64
Parsing the Formatting Date column
weather_final_df['Formatted Date'] = pd.to_datetime(weather_final_df['Formatted Date'],utc=True)
weather_final_df = weather_final_df.set_index('Formatted Date')
data = weather_final_df[['Apparent Temperature (C)','Humidity']].resample('MS').mean()
data
Making new columns
In this step, we will divide our Formatted time column into Year and Month columns because of the need of our problem statement
weather_final_df = weather_final_df.reset_index()
weather_final_df['Formatted Date'] = pd.to_datetime(weather_final_df['Formatted Date'],utc = True)
weather_final_df['Month'] = weather_final_df['Formatted Date'].dt.month
weather_final_df['Year'] = weather_final_df['Formatted Date'].dt.year
weather_final_df.drop(columns= ['index', 'level_0'], inplace=True)
weather_final_df.head()
weather_final_grouped_df = weather_final_df.groupby(['Month','Year']).mean()
weather_final_grouped_df.head(11)
Variation in apparent temperature in different months
plt.figure(figsize=(21,12))
x_axis = np.arange(2006,2017)
plt.plot(x_axis,weather_final_grouped_df['Apparent Temperature (C)'][:11].values, label = 'January')
plt.plot(x_axis,weather_final_grouped_df['Apparent Temperature (C)'][11:22].values, label = 'February')
plt.plot(x_axis,weather_final_grouped_df['Apparent Temperature (C)'][22:33].values, label = 'March')
plt.plot(x_axis,weather_final_grouped_df['Apparent Temperature (C)'][33:44].values, label = 'April')
plt.plot(x_axis,weather_final_grouped_df['Apparent Temperature (C)'][44:55].values, label = 'May')
plt.plot(x_axis,weather_final_grouped_df['Apparent Temperature (C)'][55:66].values, label = 'June')
plt.plot(x_axis,weather_final_grouped_df['Apparent Temperature (C)'][66:77].values, label = 'July')
plt.plot(x_axis,weather_final_grouped_df['Apparent Temperature (C)'][77:88].values, label = 'August')
plt.plot(x_axis,weather_final_grouped_df['Apparent Temperature (C)'][88:99].values, label = 'September')
plt.plot(x_axis,weather_final_grouped_df['Apparent Temperature (C)'][99:110].values, label = 'October')
plt.plot(x_axis,weather_final_grouped_df['Apparent Temperature (C)'][110:121].values, label = 'November')
plt.plot(x_axis,weather_final_grouped_df['Apparent Temperature (C)'][121:132].values, label = 'December')
plt.legend(loc = 'lower right')
plt.title('Variation in apparent temperature in diffrent months')
plt.show()
- Looking at the above visualization we can tell whether the apparent temperature for a given month from 2006 to 2016 increased or decreased.
- For example, if we consider the month of December then we can tell that it has increased.
- While if we see the month of October then it has actually decreased.
Variation in Humidity in different months
plt.figure(figsize=(21,12))
x_axis = np.arange(2006,2017)
plt.plot(x_axis,weather_final_grouped_df['Humidity'][:11].values, label = 'January')
plt.plot(x_axis,weather_final_grouped_df['Humidity'][11:22].values, label = 'February')
plt.plot(x_axis,weather_final_grouped_df['Humidity'][22:33].values, label = 'March')
plt.plot(x_axis,weather_final_grouped_df['Humidity'][33:44].values, label = 'April')
plt.plot(x_axis,weather_final_grouped_df['Humidity'][44:55].values, label = 'May')
plt.plot(x_axis,weather_final_grouped_df['Humidity'][55:66].values, label = 'June')
plt.plot(x_axis,weather_final_grouped_df['Humidity'][66:77].values, label = 'July')
plt.plot(x_axis,weather_final_grouped_df['Humidity'][77:88].values, label = 'August')
plt.plot(x_axis,weather_final_grouped_df['Humidity'][88:99].values, label = 'September')
plt.plot(x_axis,weather_final_grouped_df['Humidity'][99:110].values, label = 'October')
plt.plot(x_axis,weather_final_grouped_df['Humidity'][110:121].values, label = 'November')
plt.plot(x_axis,weather_final_grouped_df['Humidity'][121:132].values, label = 'December')
plt.legend(loc = 'lower right')
plt.title('Variation in apparent temperature in diffrent months')
plt.show()
- Looking at the above visualization we can tell whether the average humidity for a given month from 2006 to 2016 increased or decreased.
- For example, if we consider the month of February then we can tell that it has increased.