Data Visualisations: Part III

Photo by Edward Howell on Unsplash

Mankind invented a system to cope with the fact that we are so intrinsically louse at manipulating numbers. It’s called the graph. — Charlie Munger

Introduction

In the third article on our data visualisation series with python we familiarise the users with line and box plots as powerful tools for representing quantitative values. Line plots are majorly used in time -series data whereas to get summary statistics about the data we have box plots. We continue with the same dataset from our previous articles .

Line Graphs

Line graphs are best suited when we have to analyse quantitative data with respect to time . It’s mostly used to identify trends. The insights are clearly apparent With upward trends indicative of increase in numbers .

Plotting single line plots

Plotting a single line graph in python is as simple as just using one function plot() from matplotlib library. Similar to our last article, we will use the historic data for Ireland Local Electoral Areas (LEA) with their 14 days incident rates per 100k population. We will follow the steps below to plot a line graph:

  1. Read the data with read_csv() function from pandas
  2. Extract months and years from the Eventdate column of the data. This can be done with the function DatetimeIndex() from pandas library, with the argument as the EventDate column
  3. We will filter the dataset to get the data for the month of April, 2021
  4. Now, we have dataset only for the month of April, 2021
  5. We will now extract the data for 5th April, 2021 and plot the line graph with plot() function.
Comparative analysis of incidence rates across LEA for 5th April, 2021

Plotting multiple line plots

For Grouping of different data series through multiple lines we have multiple line plots. Here each individual line can be compared with one another . Care should be taken that if we have a large number of lines it may have the danger of coming across cluttered, so the ideal number of lines should be at the most 3–4 .

For plotting multiple line plots in a single graph, we need to follow the same steps till step number 5 in above approach. We will plot the incidence rates for three weeks in April, 2021. In the example below, we plot incidence rates for 5th April, 12th April and 19th April, 2021 across LEAs. Matplotlib library by default adjusts the colours of different line graphs or we can specify the colour argument in the function.

Comparative analysis of incidence rates across LEA for the month of April, 2021

Plots with area under curve

Line graphs with an area filled under the line curve is another way of representing the line graphs. These graphs are used for visualising the data when there is a significant difference in the values of two line graphs.

For plotting line area with area filled under curve, matplotlib provides a function fill_between(). The arguments are x, y1, y2. This fills the area between (x, y1) and (x, y2). In the graph below, we observe a significant decrease in the 14 day incidence rate from 5th April to 19th April, 2021.

Comparative analysis of incidence rates across LEA for the month of April, 2021 with area plot

Box Plots

Box plots are primarily used for explanatory data analysis. It is a powerful way of getting valuable insights such as mean, minimum and maximum value within the data. It is useful for understanding the shape of data through distributions. One can easily identify the variance from mean or central value within the data. It is also a powerful mechanism for detecting outliers from the data. Box plots are best used when we want to show comparisons among multiple groups.

Box plot is plotted with boxplot() function in pandas specifying the column as the argument.

Comparative analysis of incidence rates across LEA for the month of April, 2021 with boxplots

From the below graph, we can infer that some of the LEAs had an incidence rate below the mean while most of the LEAs were above mean. However, there were two LEAs that had an incidence rate above the maximum value, namely Tullamore and Balbriggan which are clearly outliers.

They come in many flavours as shown below.

Notch Boxplots

Notch box plots are useful to depict the confidence interval around the median. If the notches don’t overlap it is interpreted as a strong evidence that the medians differ fixated at 95%confidence.

Comparative analysis of incidence rates across LEA for the month of April, 2021 with notch boxplots

Notch boxplot is plotted with the same function boxplot(), just specifying the argument as notch=True

Violin Plots

Violin box plots are similar to traditional box plots as they display the same summary statistics but in addition to that depict the shape/distribution of data.

We can plot violin plot with seaborn library as:

ax = sns.violinplot(x='date',y='P14_100k',data=df1)
ax = sns.stripplot(x="date", y="P14_100k",data=df1)
Violin plot for incidence rate across LEAs for 5th April, 2021

Takeaways

This article acquainted the readers with line and box plots as a powerful medium towards gaining insights from numerical data. We discussed different representations for both line and box plots by highlighting the context for each representation. The code for the visualisations can be found here.

Do you have any questions?

Kindly ask your questions via email or comments and we will be happy to answer :)

--

--

Insights on Modern Computation
Perspectives on data science

A Communal initiative by Meghana Kshirsagar (BDS| Lero| UL, Ireland), Gauri Vaidya (Intern|BDS). Each concept is followed with sample datasets and Python codes.