Making Sense of Stockholm Historical Temperatures Data. Some Hacker Statistics in Python (Part 1)
This week I completed a month living in Sweden and as I have read a lot of times, weather is always a hot topic here (I said hot? 😆). As a data practitioner I felt obliged to make some visualizations and a sound statistical analysis of what the temperatures here look like to make an impression in the coming Fika.
By reading this article you will have a sense of how to:
- Build and plot Empirical Cumulative Distribution Functions (ECDF) using Matplotlib.
- Visually verify whether your data are normally distributed by drawing samples from a normal distribution using Numpy in order to derive a CDF and compare with the ECDF built from empirical data.
- To calculate useful summary statistics by grouping the data in a meaningful way using Pandas.
We will use data gathered in Bolin Centre Database, from Stockholm University, back to 1756.
Let’s import the python packages with their usual aliases.
From Pandas version 0.19.2 on we can pass a URL directly to the method pandas.read_csv() and retrieve a dataframe.
We have now a Pandas dataframe loaded in our Jupyter Notebook. Let’s dive in :)
We have 95694 daily observations of temperatures organized by year and month. By reading the database website one can learn that the column Temperature_Processed_2 has the data with the proper measurement corrections.
Daily observations are too much granularity for the kind of questions we want to answer, so we will transform the data set to get monthly temperature averages by the following code.
Below we can see the data for the 1756 year.
The data set now contains the monthly average temperature for each combination of month and year. We also put a more informative name on the temperature column : Monthly_Average_Temperature.
Okay!
A little piece of code used to translate the month’s number (1 to 12) in more meaningful names to build the coming plots.
Now we are in the position to inspect visually the stuff. Let’s plot the monthly average temperatures and have a feeling of what it is look like.
Visually we see the bells indicating normality (in a statistical sense) of the data 🔔!
But there are also some long left tails, specially in January and February, which could indicate that the normal distributions does not apply for this months, but it is not time to jump into any serious conclusion.
The statsmodels module has a great tool to do the task, the qqplot, but in this article we are statistical hackers willing to build customized tools to fit our very specific needs 💛.
The cumulative distribution function (CDF) is a way to visualize how a random variable is distributed by evaluating the probability of it having a certain value or less. The mathematical function to describe this is
F(x) = P(X <= x)
Where F is the CDF and x is a value which the random variable X might assume.
The function above works for continuous distributions, which is not our case as we have 3144 discrete data points, specifically the monthly temperatures average from 1756 until 2017. We will derive an empirical distribution function (f) using our discrete observations as
f(t) = (number of elements in the sample <= t)/n
where t is the random variable possible values and n the number of data points.
Armed with this definitions, shall we code a function to compute the empirical CDF?
It is time to get an empirical CDF for a month and plot it, January for instance.
With this plot we can answer some questions like, what is the probability of a January in Stockholm with an average temperature below -2.5 C ?
By inspecting the plot you can assert that in the last almost 300 years this probability is 0.60, or 60%.
You can use code to answer the question.
Conversely, the the empirical probability of having a January in Stockholm with average between -5 C and -2.5 C is 31.3%.
The general formula to get the probability of a random variable lying in a interval by using a CDF is
P(a < x <= b) = f(b) - f(a)
Now that we know how to derive an empirical CDF, it is time to verify if the process generating the Stockholm average temperatures follows a Normal Distribution by comparing the CDF’s.
To do so we have to know how to draw samples from a normal distribution.
The Numpy package always have good tools to save us.
We can draw 10000 samples from a normal distribution with average equals 10 units and a standard deviation of 1 unit by using the code below.
With the samples in hands one can derive the CDF
It is no coincidence that the CDF value for the temperature value of 10 is 50% (0.5), as it is exactly the median of the 10000 samples generated.
Now we can verify if our monthly average temperatures are normally distributed by plotting the empirical CDF against the ones from the normal distribution samples.
Also it is necessary to calculate the mean and the standard deviation for the normal samples for each month to get the samples from the distribution.
Time to plot the comparisons! ⌚️
The plot contains the CDF derived from the data set compared with the ones we got from the Normal Distribution 10000 samples in the left side as well as the average temperature distributions itself in the right side. One can see a good agreement between the empirical CDF’s and the normal ones.
I am very confident now to claim probabilities in my forecasts with work colleagues in the next Fikas 👏
The normal distribution seems to explain reasonably well the monthly average temperatures in Stockholm.
In the next article we will dig deeper and write code to make inferences on this data by using a method named Bootstrap.
All the code is available in my GitHub.
Feel free to also reach out to me on LinkedIn. Thanks for reading!