Visualizing Data

A quick fix

Kalpa Vrikshika
5 min readJun 8, 2018

‘Knowledge is the art of piling of facts; wisdom lies in their simplification’

On finishing the bigger part of the Data Science course — <Visualizing Data> , I realized I needed to brush though it once more. On that note, I have tried my best to simplify the first few lessons in the hope of a better visualization and a short summary.

This is for the Udacity Data Science scholars who need a quick summary on the measures of center.

Before the number game begins; let’s brush up and understand the elements of visualizing data.

Data is best visualized through histograms.

Histogram
How a histogram looks.

You might say ‘Hey, that looks like a bar graph’ — A histogram is NOT a bar graph for several reasons as below:

Histogram v/s bar graphs

Differences

The bin-size is critical for the histogram : When the bin size is too small(If the range is too small), the underlying pattern will not be visible and when it’s too big(If the range is too big), the required details cannot be seen. The bin size should be chosen with care.

Reference: https://statistics.laerd.com/statistical-guides/understanding-histograms.php

Out of a histogram, a distribution curve is outlined that can tell us about the nature of the data:

Let’s look at the three main ones:

Some different histograms

Normal distribution is a bell shaped curve that has more concentration in the middle seemingly balanced. The right-skewed or positively skewed, the mean is typically greater than the median. The left-skewed or negatively skewed has the exact opposite properties to right-skewed, the mean is lesser than the mean. Outliers are abnormal data in comparison to the rest of trend in the data set. Sometimes outliers happen due to errors in measurement or perhaps the data set naturally has some few extremes.

Outliers create skewed distribution.

The second part of this quick fix-up is to briefly understand on measures of center.

“Measures of spread describe how similar or varied the set of observed values are for a particular variable (data item)” — Australian Bureau of Statistics

Measures of center are used to make better decisions and to make comparisons.You only have to know the following for now: Mean, Median, and Mode. (The three big M’s)

The normal/left/right distribution (seen above) affects each of these to some extent. To know how, let’s understand the properties of each.

The measure of center are mostly affected by certain factors such as:

Key points on how distribution curves affect the relationship between the measures of center.

Now, let’s put to practice the measures of center with two examples:

Example 1: if we have a random data set; we can find the measures of center using excel functions :

(i) Finding mean: Type in the functions (=sum or =average) and select the cells you want the functions to be applied to(like A1: A14)

(ii) Finding Mode: Use the function(=mode(cell range)) to find mode as shown below.

(iii) Finding median: To find median , first you have to order the data either from small to big or vice-versa. To do this, highlight your data-set, go to ‘Data’ and click on sort range. Once it’s sorted.

Use the function(=Median(cell range))

IQR

The interquartile range(IQR) for this data set is Q3 — Q1

Therefore, the IQR for this is 19.5.

The lower quartile is the 25th percentile, the mean is the 50th percentile and the upper quartile is the 75th percentile.

The above example can also be done on paper starting with the mean formula, searching for the mode, sorting the data and getting the mean — I used the spreadsheet method as I figured it’s faster and allows us to practice more on how our course work takes us.

Other elements of the median are the first quartile and the second quartile. To find these, the data is divided into three parts.

Let’s try to use a different set of data to calculate relative frequencies and proportions. Proportions give you insight on how that particular element in the data set corresponds to the rest.

Example 2: Just for the fun of it I took a random sampling of my favorite harry potter characters to know how many times they appear in the first half of a movie.

P.S Except Harry Potter himself (That would be too obvious 😐).

Number of times characters appear in the first half of Harry Potter (Except Harry)

To know how the proportion that each character has in the film, we need to know the frequency, which is divided by the total to give relative frequency and thereby multiplying by 100 for the proportion.

Relative frequencies and Proportion

From the proportion we can see that Hermione(with 20%) is leading in her appearance in the first half of the film. She is closely followed by Ron Weasley. Ofcourse, this isn’t surprising but hey it’s always fun to confirm with data!

Thanks for reading through. I’ll try to make the second part of data visualization summary soon.

Read my previous article on:

Let’s talk more on:

or

--

--

Kalpa Vrikshika

~Data foundations graduate~ ~Udacity Bertelsmann Data Science Scholar~ ~Believing until I become it~ ~Happy place~