Meteorological Data Analysis

Analysis of the Weather data of Finland to check if the monthly average temperature and humidity have increased from the years 2006 to 2016 or not.

Aditya Jetely
4 min readOct 12, 2020
Photo by NASA on Unsplash

In this notebook I have performed the analysis of Meteorological Data, the dataset used in this notebook can be obtained from the following link: https://www.kaggle.com/muthuj7/weather-dataset

Problem Statement

Find whether the average Apparent temperature for a month starting from 2006 to 2016 and the average humidity for the same period have increased or not.

Importing Libraries

import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

Loading Data

weather_df = pd.read_csv("weatherHistory.csv")
weather_df.head()
png

Understanding the data

weather_df.shape(96453, 11)

We have 96453 rows and 11 columns but we are interested in only 3 columns which are Formatted Date, Apparent Temperature and Humidity. So lets just use these three columns from here on.

weather_df.columnsIndex(['Formatted Date', 'Summary', 'Precip Type', 'Temperature (C)',
'Apparent Temperature (C)', 'Humidity', 'Wind Speed (km/h)',
'Wind Bearing (degrees)', 'Visibility (km)', 'Pressure (millibars)',
'Daily Summary'],
dtype='object')
weather_final_df = weather_df[['Formatted Date', 'Apparent Temperature (C)', 'Humidity']]
weather_final_df.head()
png

Let's see some basic info about our dataset

weather_final_df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96453 entries, 0 to 96452
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Formatted Date 96453 non-null object
1 Apparent Temperature (C) 96453 non-null float64
2 Humidity 96453 non-null float64
dtypes: float64(2), object(1)
memory usage: 2.2+ MB

We see that we have 2 numeric and one categorical column and all of them are free from null values. Though the data type of the numeric columns is correct, there is a problem in the data type of Formatted date, which should have been DateTime, with which we will deal later.

The statistical description of our dataset

weather_final_df.describe()
png

We don't see a large difference in the mean and median values for both of our numeric columns so they don't require any transformation.

Estimating Skewness and Kurtosis

weather_final_df.skew()Apparent Temperature (C)   -0.057302
Humidity -0.715880
dtype: float64
weather_final_df.kurt()Apparent Temperature (C) -0.706844
Humidity -0.462170
dtype: float64

Parsing the Formatting Date column

weather_final_df['Formatted Date'] = pd.to_datetime(weather_final_df['Formatted Date'],utc=True)
weather_final_df = weather_final_df.set_index('Formatted Date')
data = weather_final_df[['Apparent Temperature (C)','Humidity']].resample('MS').mean()
data
png

Making new columns

In this step, we will divide our Formatted time column into Year and Month columns because of the need of our problem statement

weather_final_df = weather_final_df.reset_index()
weather_final_df['Formatted Date'] = pd.to_datetime(weather_final_df['Formatted Date'],utc = True)
weather_final_df['Month'] = weather_final_df['Formatted Date'].dt.month
weather_final_df['Year'] = weather_final_df['Formatted Date'].dt.year
weather_final_df.drop(columns= ['index', 'level_0'], inplace=True)
weather_final_df.head()
png
weather_final_grouped_df = weather_final_df.groupby(['Month','Year']).mean()
weather_final_grouped_df.head(11)
png

Variation in apparent temperature in different months

plt.figure(figsize=(21,12))
x_axis = np.arange(2006,2017)
plt.plot(x_axis,weather_final_grouped_df['Apparent Temperature (C)'][:11].values, label = 'January')
plt.plot(x_axis,weather_final_grouped_df['Apparent Temperature (C)'][11:22].values, label = 'February')
plt.plot(x_axis,weather_final_grouped_df['Apparent Temperature (C)'][22:33].values, label = 'March')
plt.plot(x_axis,weather_final_grouped_df['Apparent Temperature (C)'][33:44].values, label = 'April')
plt.plot(x_axis,weather_final_grouped_df['Apparent Temperature (C)'][44:55].values, label = 'May')
plt.plot(x_axis,weather_final_grouped_df['Apparent Temperature (C)'][55:66].values, label = 'June')
plt.plot(x_axis,weather_final_grouped_df['Apparent Temperature (C)'][66:77].values, label = 'July')
plt.plot(x_axis,weather_final_grouped_df['Apparent Temperature (C)'][77:88].values, label = 'August')
plt.plot(x_axis,weather_final_grouped_df['Apparent Temperature (C)'][88:99].values, label = 'September')
plt.plot(x_axis,weather_final_grouped_df['Apparent Temperature (C)'][99:110].values, label = 'October')
plt.plot(x_axis,weather_final_grouped_df['Apparent Temperature (C)'][110:121].values, label = 'November')
plt.plot(x_axis,weather_final_grouped_df['Apparent Temperature (C)'][121:132].values, label = 'December')
plt.legend(loc = 'lower right')
plt.title('Variation in apparent temperature in diffrent months')
plt.show()
png
  • Looking at the above visualization we can tell whether the apparent temperature for a given month from 2006 to 2016 increased or decreased.
  • For example, if we consider the month of December then we can tell that it has increased.
  • While if we see the month of October then it has actually decreased.

Variation in Humidity in different months

plt.figure(figsize=(21,12))
x_axis = np.arange(2006,2017)
plt.plot(x_axis,weather_final_grouped_df['Humidity'][:11].values, label = 'January')
plt.plot(x_axis,weather_final_grouped_df['Humidity'][11:22].values, label = 'February')
plt.plot(x_axis,weather_final_grouped_df['Humidity'][22:33].values, label = 'March')
plt.plot(x_axis,weather_final_grouped_df['Humidity'][33:44].values, label = 'April')
plt.plot(x_axis,weather_final_grouped_df['Humidity'][44:55].values, label = 'May')
plt.plot(x_axis,weather_final_grouped_df['Humidity'][55:66].values, label = 'June')
plt.plot(x_axis,weather_final_grouped_df['Humidity'][66:77].values, label = 'July')
plt.plot(x_axis,weather_final_grouped_df['Humidity'][77:88].values, label = 'August')
plt.plot(x_axis,weather_final_grouped_df['Humidity'][88:99].values, label = 'September')
plt.plot(x_axis,weather_final_grouped_df['Humidity'][99:110].values, label = 'October')
plt.plot(x_axis,weather_final_grouped_df['Humidity'][110:121].values, label = 'November')
plt.plot(x_axis,weather_final_grouped_df['Humidity'][121:132].values, label = 'December')
plt.legend(loc = 'lower right')
plt.title('Variation in apparent temperature in diffrent months')
plt.show()
png
  • Looking at the above visualization we can tell whether the average humidity for a given month from 2006 to 2016 increased or decreased.
  • For example, if we consider the month of February then we can tell that it has increased.

--

--

Aditya Jetely

Final Year Electronics and Communication Engineering Student with a keen interest in data science and open source. https://www.linkedin.com/in/aditya-jetely