Histograms

Ashish Agarwal
Analytics Vidhya
Published in
4 min readSep 18, 2020

For anyone who is new to Statistics, they might find histograms slightly amusing. What are these? Are they similar to Bar-graph ? Why do we have them?

First of all, let’s build an intuition.

fig: Number line

We have learned in school about number-line where every number gets its place on the line. Here, number-line is continuous in the sense, every point in the line represents a distinct real number. Also, we know every point represent only one number i.e. numbers are not stacked on top of one another on the same point. Here, we have represented our data( An infinite real number system) as a 1-D line.

Now, consider a situation, we need to describe the data having age as the only attribute. Clearly, our above line cannot help us in this problem. Why?

Fig: Stacked points on 2D
Fig: Stacked points on 3D

As, the amount of data increases, we find ourselves stacking points( A point represent a single person in this case) on top of each other. This does not represent the data in a clear format and doesn’t help us to make any further analysis. So, we describe our data with a bar-graph. Remember, here, our problem statement deals with discrete data.

Fig: Bar Graph

Woah! This graph can be used for useful analysis.

Now, that was easy. But, how would we solve a problem which had continuous data-points. For example, let’s say, we want to analyze people’s height as an only attribute. Now, clearly, we can have infinite amount of height to consider, which is similar to our first problem on Number-line. However, we have a possibility of stacking people of same height, which is similar to our second problem on Bar-graph.

Hmm. So, this problem is a hybrid version of both our initial problems. How do we tackle this to describe the data?

Fig : Histogram

Now, comes the topic of our discussion, Histogram to the rescue!

So, histograms are basically continuous modeled bar-graph. So, here we model our continuous data as discrete data. For example, let’s say, we have minimum height as 130 c.m and maximum height as 190 c.m. Here, we can divide data on the basis of 10 c.m difference gap i.e. 130–140 c.m belongs to one bar and so on. Note : In histogram, we refer to bars as bins.

So, we can basically say, our histogram varies as we change the number of bins. There is no hard and fast rule about the number of bins. Taking bins=1 would just put all your data in a single bin which is not useful at all. Also, taking bin=(no. of data-samples) would assign every data-point a bin which is also not helpful. Hence, general rule of thumb is to hit-and-try number of bins to gain useful distribution or analysis.

Fig: Histogram with bins=1
Fig: Histogram with bins=N
Fig: Histogram with bins=2

Histogram is a very useful tool in data analytics and can infer powerful insights on continuous data. If we want to use a distribution to approximate our data, then, histogram is a very good way to justify our decision.

More about Distribution on the next blog. 😀

--

--

Ashish Agarwal
Analytics Vidhya

Data Science || Data Engineer || Machine Learning || Linux || Infosec Enthusiast || Math-Physics-Philosophy Trilogy Admirer || Computer Engineer