The basics of plotting with Python

Thomas Gamsjäger
3 min readMay 11, 2018

--

In almost any kind of data analysis, it is of paramount importance to look at your data in graphical form. The sheer numbers you get from descriptive statistical analyses can sometimes be manifestly misleading. Anscombe’s quartet springs to mind. So let’s dive right in.

First, we need some data. For the sake of simplicity and convenience, we create a dataset of, say, 10,000 normally distributed random numbers. (More on generating random numbers and distributions will be left for another article.)

Python proper is not terribly good at handling vectors, arrays or, actually, random numbers. Therefore, we call in some assistance by way of the numpy library.

import numpy as np

Now for the random numbers. The randn method needs only one argument, the number of numbers, 10,000 in our case, and pulls them from a normal distribution (i.e. with a mean of 0 and a standard deviation of 1.

sample1 = np.random.randn(10000)

To make sure that we are on the right track first some numerical statistics:

print('mean:', np.mean(sample1))
print('standard deviation:', np.std(sample1))

With the results:

mean: 0.010788647575314076
standard deviation: 1.0063352118906288

Not too bad for the beginning. But now for the real thing: Plotting. Again, Python itself has other strengths than that, but there is yet another handy library by the rather apt name of matplotlib.

import matplotlib.pyplot as plt

The only missing thing is:

plt.hist(sample1)

Voila! But… doesn’t this plot appear a little crude given the fact that we have thrown in no less than 10,000 numbers? I would say the answer is yes. But help is on its way. Its name: Seaborn.

Seaborn is a much more modern Python visualization library (based, actually, on matplotlib), which is able to generate somewhat fancier plots than its predecessor.

import seaborn as sns

The standard command for a histogram is distplot(), which by default fits a kernel density estimator (KDE). Whatever that may be, let’s just try it.

sns.distplot(sample1)

Much better. Very neat and sciencey looking compared to matplotlib of old.

And without this mysterious KDE (but if you are interested, here is more):

sns.distplot(sample1, kde=False)

So far, we have always directed Python to generate a single plot. It would not be too far-fetched to assume that you can use multiple plotting commands in a Python script. Alas, it does not work this way. A little more effort is needed, which will be shown presently. In addition, with multiple plots, plot titles would be in order. We will add these as well. And for overall clarity, also the commands for invoking the necessary libraries and for generating the random numbers are stated again in the following script:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sample1 = np.random.randn(10000)

plt.figure(1)
sns.distplot(sample1, kde=False)
plt.title('Seaborn histogram')

plt.figure(2)
sns.distplot(sample1)
plt.title('Seaborn histogram with KDE')

plt.show()

plt.figure() provides the containers for the different plots, plt.show() displays them.

With that, we can conclude our first foray into the vast lands of data visualization with Python. Happy coding!

Featured image: Gushes of water spilling over the edge of the dam at the hydropower plant on the Danube river near Altenwörth, Austria.

Originally published at antreith.wordpress.com on May 11, 2018.

--

--