Statistical analysis using F# and Jupyter notebooks
Some time ago I took Coursera “Data Science” specialization. Data science is not something I get paid for professionally. It’s just a hobby so bear with me if I do some mistakes or if I use the wrong vocabulary.
This post is part of the F# Advent Calendar 2017 initiative — make sure to go check out rest of posts as well.
One of the homeworks I had to do for the specialization was to make the comparison of exponential distribution and Central Limit Theorem. At the time, I made the analysis with R language which was used throughout the specialization for obvious reason. R is very convinient for statistics and data analysis as it was designed for that purpose. Many built in statistical functions, great support for plotting with ggplot2
make the analysis focused on domain problem and not fighting with language, tools and libraries. This was exactly the problem I faced at the time when I tried to do the same analysis in F#. The language itself was great but the surrounding tooling and integration with external libraries bore no comparison with R at the time.
More than two years has passed. In the meantime F# tooling for data analysis has much improved. Jupyter Azure Notebooks supports now F# so I decided to give it another try and to reiterate my analysis. If your are interested in trying it yourself go to Azure Notebooks https://notebooks.azure.com. Let’s start the analysis.
Comparison of exponential distribution and Central Limit Theorem with F#
The goal of this article is to investigate in F# the exponential distribution using and to compare it with the Central Limit Theorem. The exponential distribution can be simulated in F# with MathNet.Numerics Exponential.Samples(new Random(randomSeed), lambda)
where lambda is the rate parameter. The mean of exponential distribution is 1/lambda
and the standard deviation is also 1/lambda
. I set lambda = 0.2
for all of the simulations. I will investigate the distribution of averages of 40 exponentials. Note that I will need to do a thousand simulations.
The result of this post will be to illustrate the properties of the distribution of the mean of 40 exponentials and:
- Show the sample mean and compare it to the theoretical mean of the distribution
- Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution
- Show that the distribution is approximately normal
Analysis
Let’s explore several aspects of exponential distribution. First of all we need to load some librairies into the notebook.
What we need is the library for statistical analysis. The best out there for .NET platform is MathNet.Numerics. We need also something for plotting. XPlot is the most widely used F# data visualisation package. Azure Notebooks support also Paket dependency manager used for referencing and managing all the needed dependencies. You have to #load
F# scripts for Paket and XPlot to make dependencies visible in the notebook but I didn’t dig deeper to check what it is doing under the hood.
Sample mean vs theoretical mean
The first step is to explore the sample mean and compare it to the theoretical mean. In order to have a significant data we would need to run 1000 simulations each containing 40 observations so we would be able to take means for each simulation. We can achieve it by running the method Exponential.Samples(new Random(randomSeed), lambda) |> Seq.take (n * nbsim) |> List.ofSeq
and arranging it in the matrix of 100x40:
I’m using here MathNet.Numerics Exponential.Sample(...)
method to generate 1000 samples containing 40 observations. MathNet.Numerics provide many additional classes and functions for statistical analysis. Here I’m using Matrix
and Vector
types to arrange my samples in 100x40 matrix. Note that I’m setting randomSeed=1111
. It’s just a constant value which guarantees the reproducibility of the same samples, but you can change it to another value if you want. The key is that it should remain constant so other people can repeat your experience.
The next step is to check the mean of 1000 averages of exponential distribution:
Let’s compare it to the theoretical mean of exponential distribution:
The simulated mean of exponential distribution 4.997430667
is very close to the therotetical mean 5
.
Next step is to look at the distribution of the simulated means. I’m going to use XPlot.Plotly to draw a histogram of the distribution with a theoretical mean line. Here is the F# to draw the chart:
The code is quite straightforward and easy to understand. Note that XPlot will draw two overlapping charts; one for histogram and another one for mean line. This is very handy because each chart can be constructed independently and combined in the list just before calling the Chart.Plot
function. All of the plots are documented on XPlot page with many examples. This is really a great resource to learn from.
Let’s look at our plotted histogram:
What do you think? It’s rather cool. More, it’s interactive. You can hover with your mouse to show some interesting values:
Sample variance vs theoretical variance
The next step is to compare the variance present in the sample means of the 1000 simulations to the theoretical variance of the population.
The sample variance is :
And the theoretical variance is:
The theoretical and sample variance are quite similar.
Now we need to calculate the distribution of means of the exponential distribution with mean equal to sample mean and theoretical mean. The standard deviation is equal to sqrt
of sample variance and theoretical variance. We then can compare the similitude.
Again for that I’m using Normal.PDF
function from MathNet.Numerics package. Let’s plot those distributions along with the previous histogram so we can visually check that they are almost the same:
And here is our plot:
Conclusion of the analysis
The analysis showed that averaging over 1000 simulations of 40 observations for exponential distribution is very close to the theoretical distribution mean. We can say the same about the variance which is very close. The distribution is also normal. I could have shown that on Q-Q plot but unfortunately XPlot doesn’t support it (or probably the Plotly API).
Conclusion about F# and Azure Notebooks
I was really nicely surprised about how good the integration of F# is inside the Azure Notebooks. While writing, you get the live type checking, intellisense, tooltips which is really what you would expect when working with F#. Plots are really looking good too and are very easy to generate. This was a simple analysis with just MathNet.Numerics and XPlot as a dependence but I’m sure it would also work in more complex scenarios. Azure Notebooks works very well, on the other hand I had hard time trying to make work F# Jupyter Notebooks on my Mac and my local Anaconda environment. I hope that with recent release of .NET Core 2.0 things will change in the long run and there will be no différence running .NET code on Mac and Windows. For now enjoy Azure Notebooks and F# and don’t hesitate to give it a try on your free time.
References
Here are some references you might find usefull:
- F# notebook for this analysis: https://notebooks.azure.com/tjaskula/libraries/simple
- The same analysis I’ve done before using R: https://github.com/tjaskula/Coursera/blob/master/Specializations/Data%20Science/Statistical%20Inference/CompareExpDistribWithClt.pdf
- Azure Notebooks: https://notebooks.azure.com/
- MathNet.Numerics: https://numerics.mathdotnet.com/
- XPlot : http://tahahachana.github.io/XPlot/
- F# Foundation: http://fsharp.org/