Where do Data Scientists begin when analyzing data?

Learn to understand and interpret the shape of your data

Utsav Chatterjee
The Startup
6 min readApr 25, 2020

--

In my quest to learn more about analytics over the past seven years, I have learnt to appreciate the absolute basic concepts of statistics that help analysts and data scientists to take the first steps towards understanding what they’re working with. Until a few years ago, I didn’t quite understand how do Data Scientists know where to start looking for trends and inferences in data. Is there a checklist that they follow? Or is it intuition and they randomly just know what to look for?

Understanding the data we are working with usually begins with finding out what center of the data looks like and how ‘spread out’ the data points are. While measures of central tendency such as mean, median and mode give us an idea about the mid values in the data, they are not sufficient to give us an overall picture of the entire population. Measures of spread such as Range, Interquartile Range, Standard Deviation and Variance help us to get a bit more acquainted with the data and tell us how ‘spread-out’ the data is. However, it is only when we understand the shape and visualize the distribution of our data, do we get a much clearer picture of what our entire dataset looks like. These are some of the very first steps that Data Scientists and Analysts begin with, to understand their data.

The focus of this article is to:

  • Explain how easily anyone can interpret quantitative data and spot outliers using simple visualizations.
  • Illustrate examples with Python code.

Shape

Histograms are undoubtedly the most popular visual for quantitative data. Consider the following histograms.

Histograms that have shorter bins on the left and taller bins on the right are said to be left-skewed.

Histograms that have shorter bins on the right and taller bins on the left are said to be right-skewed.

Any distribution where you can draw a line down the middle, and the right side mirrors the left side, is considered symmetric. One of the most common symmetrically shaped distributions is the normal distribution. It is also called the bell curve.

The shape of the distribution can tell us a lot about the measures of center and measures of spread. In symmetric distributions like in the bell curve, the mean is typically equal to the median, which in turn is equal to the mode (the tallest bar in our histogram.

Mean = Median = Mode

Let us see how this can be illustrated in Python using the matplotlib library:

If we were to run the above function using data below:

The output would look as follows:

When we have skewed distributions, the mean is pulled by the tail of the distributions, and the median moves closer towards the mode.
For example, in a right-skewed distribution, the mean will be pulled higher than the median. Some examples of right-skewed distributions are the amount of alcohol left in your bloodstream over time, human athletic abilities with age, etc.

Alternatively, in a left-skewed distribution, the median will be pulled higher than the mean. Examples of such distributions include the GPA of students in a class, age of death, etc.

Typically, if something results from a lot of small influences that are not too correlated with each other, you get a Normal Distribution. Height, for example, is controlled by a lot of genes, nutrition and other factors that work more or less independently.

Outliers

The final aspect used to describe and understand quantitative variables is outliers. These are data points that lie very far away from the rest of the values in our data set. While there are several methods to determine how far is “very far”, to begin with, however, you could just look at a histogram and visually find out if there is any value that is very far off from the rest of the data points. A quick plot of your data can often help you understand a lot in a short amount of time.

Let us consider an example where we look at the revenue generated by a few companies that are headquartered in Vancouver (values are in millions $).

45, 68, 92, 53, 105, 56, 23, 15, 155, 500

A histogram for this data would look as follows:

The mean (which is a measure of center) = 111.2

This value of mean is not representative of the entire population of the data because the majority of the data values don’t even come close to the value of the mean. A better measure of the center would be the median, in this case.

Median = 62

This is a better indication of the kind of revenue that is likely generated by companies that are headquartered in Vancouver.

Let us look at the Standard Deviation

Standard Deviation = 135.35

All this suggests is that the data is really spread out, which is not true in reality. These numbers are just highly influenced by the one data point that lies really far from the rest of the values.
In such cases, reporting the Five Number Summary values is often a better indication than measures like mean and standard deviation.

Summary

  • While working with data, you can always build a quick plot to take a look at the shape. Distributions can be symmetric or skewed (left or right).
  • The shape of the distribution can tell us a lot about the measures of center and measures of spread.
  • When we have skewed distributions, the mean is pulled by the tail of the distributions, and the median moves closer towards the mode.
  • Outliers are data points that lie very far away from the rest of the values in our data set.
  • In cases where the mean and standard deviation are highly influenced by outliers, it might be better to report the Five Number Summary values to represent the nature, spread and shape of the data-set.

--

--

Utsav Chatterjee
The Startup

Data Engineering Manager @ Tally | Vancouver, Canada 🇨🇦