# How to Describe Quantitative Data?

Before we get started with this question, the primary question that we are guided with when **describing Qualitative Data** is “**What is the frequency of different categories?**” and we plotted the frequencies using Frequency Table, Frequency Plots, Relative Frequencies Plots, and so on.

Now the point is “**Does the same question(as for Qualitative Data) make sense for Quantitative Data?**” To explain what we mean by this question, let’s take an example and start with some data from the ODI’s played by Sachin Tendulkar.

This dataset(subset is depicted above) has **many fields** over here like we have the **runs scored** in a particular match, **minutes spent on the crease**, **Opposition team**, and so on.

There are **several Quantitative attributes** here for example **Runs scored is a Discrete Quantitative attribute** and it’s Discrete because it can not be fractional values. Similarly, **minutes on the crease in this case at least is Discrete**(we are not taking seconds into consideration), **balls faced is also Discrete** but we also have other attributes in the data such as **Strike Rate which would be a Continuous Quantitative attribute**.

Let’s focus on the Discrete Quantitative attributes first.

Now for the **Runs scored** attribute which is a **Discrete Quantitative data item**, does it make sense to ask the question of the form “**How many times did he get out on 0 or 40 or 99?**”, what we are asking here is the frequencies of the each of unique values of **Runs scored** attribute that we have in the dataset which is the same as the question that we are interested in when analyzing Qualitative Data and that’s “**What is the frequency of different values?**”.

It just so happens that right now we are dealing with Quantitative Data(Discrete attributes) instead of Qualitative Data and also given the domain of cricket we know that **in ODIs** the **range of runs is 0 to 200** at least for Sachin Tendulkar, so we know that **there are 201 possible unique values here**, it’s not like we have very large set and asking the frequency would make sense at least for some of these values like **90’s** or **99** or even say **149** and so on.

So, it's reasonable to ask these questions like “**What is the frequency of different values?**” So, the main point is that the same question still holds here for Quantitative Data as well(at least for this example), and to answer this question, we use Histograms.

## Histogram

This is what a histogram looks like. Here the **x-axis** reflects the ‘**runs scored**’ and there are 201 distinct values here(although we have not shown all values here and shown it in intervals of 10).

The **y-axis value reflects the frequency of the corresponding x-axis** **value** that is the number of matches in which that particular score appeared for example the number of matches in which score 0 appeared in the dataset and so on.

In the case of Qualitative data, there was no natural ordering among various categories and we sorted the frequency plot based on the frequency of different categories and that makes it easier for us to read the plot but here, **in this case, the labels themselves have a natural ordering so we just sort the data according to the natural ordering of the labels and not by the frequency of the labels**.

Most of the tall bars are at the beginning of the plot say under 10 or under 20 and there are very few tall bars as we go towards the right on the **x-axis**.

Here, every bar corresponds to that score, for example, the height of the bar at score 0 is 30+, which means 30+ times the player scored 0, and so on for other labels on the x-axis.

We can make some interesting observations from the Histogram by **looking at where are the missing values** for example, for the above plot, if we look at the labels from 0 to 100, we can see that there are 6 scores/values where Sachin did not get dismissed.

In some domains, it might make sense to look for the missing values and further analysis on the same. So, a Histogram reveals some interesting insights in a quick glance.

Let’s look at some of the issues that we have this plot:

- It looks like
**a very very long figure in the sense that there are 201 unique values here**and if a histogram is plotted for cases where the number of unique values could be even larger for example if we plot Brian Lara’s test score, where we can go all the way up to 400, so that’s like a very very long x-axis and**it would become very difficult to read it**. - Also from a domain point of view, are we really interested in knowing how many times the player scored exactly 4 or for that matter any of the random number from the set of possible unique values. Does that really makes sense or are we okay with not knowing how many times he scored less than 5 because for 1, 2, 3, 4, 5 there is not much distinction at least from cricket point of view as all of these are low scores?

Similarly, for the scores in the range of 11 to 20, we don’t want to look at exactly how many times did he score 14 or say 16.

The finite distinctions here don’t make sense and it in fact makes it harder for us to answer some interesting questions like how many times did he score in single digits as that would require summing up frequencies of multiple values.

**The solution for this is to use class intervals or bins**.

Again the **x-axis** reflects the ‘**runs scored**’ attribute and the y-axis represents the corresponding frequency value, and now we **group the x-axis values into bins**, so instead of saying that this value/label is 4 or this is 5, we will just say that this is a group or a bin or a class interval from 0 to 9, and **the height of the bar corresponding to the bin/group from 0 to 9 would be the sum of the frequencies of all values/numbers from 0 to 9**.

So, essentially each bar will be the sum of frequencies of all the values contained in that class interval.

So, this is what the histogram would look like:

Now from this plot, we can say that he got dismissed on a single-digit score for around 140 matches out of the total ODIs he played. And we can easily answer the question “How many times he was dismissed in the 90s?” and so on.

**Bin Size**(also called **Class Interval**) is an important concept in Histogram and the key thing here is that when we are drawing histograms, **how to decide the class interval?**

It’s quite obvious that the bin size as 1 would be quite cumbersome(at least for this example) as the plot would be a bit tedious to read because when there are 201 unique values on the x-axis that would make it harder to some questions of interest.

The bin size of 10 seems to be reasonable, it is not hiding a lot of details and at the same time, it reflects some interesting patterns in the data as well.

The question that arises is which way should we go? Should we try bin size as 5 or 20 or 30 and so on? **What would be a good bin size and how do we decide that?**

Let’s look at the plot for each of these bin sizes and arrive at the conclusion of what would be a good bin size. Let’s start with bin size as 5:

Now bit size as 5 is again a bit subjective, some of us might feel that with bin size as 5, there again would be too many bins(for 201 unique values there are 40 bins there), it may be okay may not be okay and it boils down to the question whether we care about the distinction from 0 to 5 and from 5 to 10 or we just want to look at it together as 0 to 10.

So, this is a bit subjective, some of us might find it okay, it’s not too cumbersome, we can look at it and still answer questions like ‘**the number of matches in which in which he scored in single digits**’, it’s just adding up two bars(which is not very difficult) to answer these type of questions. So, this may be okay.

But then should we use larger bin sizes, 5 looks okay, 10 looks okay, should we go for 20, 30, or even 40, would that be better?

So, there is a trade-off there with a larger bin size.

As we increase the bin size, the granularity is compromised, so when we are operating at a bin size of 1, that’s the maximum granularity that we have, we can’t have bin size as 0.75 or 0.5 or anything like that, so that’s the maximum granularity that we have, that’s the maximum zoom in that we can do in the data.

**Now the more we increase the bin size, the more we are zooming out and that’s the trade-off**.

Now with bin size as 20, if we look at the interval 20 to 40, it contains the score 21, it also contains the score 39, now someone might categorize 21 as a low score but 39 is a respectable decent score; at least in the initial years of Sachin’s career when he used to play in the middle order, a score of 39 when the overall team score is around 230 or 240 is not that bad. So, 39 is a good score and say 21 is bad and both of these are grouped in the same interval.

So, that does not look meaningful from an analysis point of view, we are losing a lot of granularity and distinction in some patterns which we would like to see in the data, we can’t see those patterns anymore with larger bin size.

And clearly, this becomes more apparent when we increase the bin size to 40, now 1 and 39 are grouped in the same interval, 1 is clearly a bad score and 39 is a decent score and we are looking at it as one single bucket and the granularity is clearly compromised and we can’t make any interesting observations from this plot.

So, in this case, increasing the bin size from 1 to 5 was good, from 5 to 10 was also okay but then going to 20, 40 looks bad.

**There is clearly a tradeoff here, we don’t want to look at very small bin sizes but at the same time, we don’t want to look at very large bin sizes as well.**

And if we take it to the extreme of having bin size as 100 then there is nothing being revealed from that plot, there is absolutely no interesting observation we can make from this plot and bin size as 1 though arguably has a lot of these details, we can answer a lot of questions from it but its just a bit cumbersome to understand and read out that plot.

Let’ look at another example which is from the Agriculture domain and again we are interested in the right bin size.

So, we have **data about the total yield of farms**. Let’s describe this data first:

Let’s say we have some 10000 farms in the dataset, for every farm, we know the total yield (this is some value rounded off to nearest integer) and this total yield ranges from 0 to 579441.

Now if we use a bin size of 10 here then we will have 57944 unique bins on the x-axis which is just impossible to read. So, **the bin size of 10 which was good for the previous example or in the batting domain within Cricket does not seem to be a good bin size for this Agricultural dataset**.

Even with a bin size of 1000, there will be 579 unique values on the **x-axis** which also seems very very large to read and understand the plot. So, in this case, probably it makes sense to have a bin size of 10000 and if we use the bin size of 10000, let’s see what the plot looks like:

This is what the plot looks like, if we look at the tallest bar which is the first bar, it includes all farms which had production from 0 to 10000 units and it looks like there are some 450 farms out of 1000 farms in the dataset that had a production in the range of 0 to 10000 units, that means most of the farms are smaller farms and then there were a few bigger farms, there was a farm with a yield near to 580000 units, and these larger farms were very very few in number and we see short bars as we go along towards the right of the **x-axis** where the total yield keeps increasing.

We could also have tried a bin size of 5000 as well because 0 to 5000 in agricultural might be significantly different from a yield of 5000 to 10000 units, so a little granularity over here might be helpful for analysis or even a bin size of 2500 might be okay.

In practice, we need to plot different histograms with different bin sizes and see for ourselves which one makes the most sense, and then choose the final one accordingly.

The same bin size that was useful for the Cricket domain is very impractical to use with this Agricultural dataset. So, **it’s very clear that bin size varies across domains and it depends on the range of the data that we have and also the kind of patterns we are interested in analyzing**.

## Histograms for Continuous Data

Let’s again look at the Cricket domain, here is a subset of the same dataset that captures each of the ODI’s played by Sachin Tendulkar and reflects the different attributes of interest:

The attribute ‘**Strike Rate**’ happens to be a Continuous data item with fractional values as well.

If we show the frequency plot of this attribute, then we can’t really find any interesting patterns or answers to meaningful question like “In how many matches did he score at a low pace say less than 60 or something, or in how many matches did he score at a brisk pace of say greater than 90 or 100 or something?”.

There are so many small bars and we cannot do anything interesting with this level of granularity.

Again here the solution would be to bin these values into intervals as we did for the discrete data, we don’t really care about the distinction between a Strike Rate of 20.25 or 20.5 or 22.5, we would rather call them as belonging to the interval 20 to 30. The 20 to 30 bin represents some low scoring rate and at that granularity, it makes sense for us to look at the data.

Now, this pattern makes sense to us, most of the tall bars in the above plot lies in the range of 60 to 100, which means in most matches Sachin played/scored at a good pace, in some matches he was scoring at a very very brisk pace which was going beyond 100 and there were also matches when he scored at a very small pace.

If we use a bin size of 20, it is again debatable if it is a good bin size or not, we still get the overall idea from the histogram and the answer depends very much on the domain knowledge for example would it make sense if we club a record with a Strike rate of 51 with a record of strike rate 69?

If we increase the bin size to 40, then that’s clearly not a good choice because then all the values from 40 to 80 would be grouped into one bucket. So, once again, as we increase the bin size, we lose granularity, we lose details, and hence some of the interesting patterns, details will get hidden. Also, as always neither we want to hide too many details nor we want to reveal too many details and in this example, 10 to 20 seems a reasonable bin size.

Let’s take another example from the Cricket domain and this time let’s analyze the Economy Rate of Zaheer Khan. This is what the sample data looks like:

‘**Economy Rate**’ is also a continuous data attribute, it has fractional/decimal values, if we plot this out as a Histogram with a bin size of 5 that means all the values with Economy Rate of 0 to 5 would get clubbed into one bucket and so on. This is what the plot looks like:

Now, this is very bad because we know that an ‘**Economy Rate**’ of 2 is actually very good and ‘**Economy Rate**’ of 5 at least in the older days of the 1990s was considered to be bad, now clubbing 2 and 5 into one interval does not reveal any interesting patterns.

It just tells us that in most of the matches his economy rate was in this range, which does not help us in understanding the final distinction between these economies rate. So, a bin size of 5 clearly looks bad in this case.

If we use a bin size of 3, then we get groups from 0 to 3, 3 to 6, 6 to 9; which again tells us that in most of the matches, his economy rate was between 3 to 6, again 3 is actually a good economy rate and 6 is leaning on the bad side, so in this case, as well, we have a bucket in which we have clubbed a good value and a bad value and clearly, we don’t want buckets of this type. So, we try to further reduce the bucket size here instead of increasing it.

Let’s plot it out with a bin size of 2.

From the above plot, we can say that bin size 2 seems reasonable and it actually reveals a very interesting pattern that in this 2 to 4 range which is considered good, and in most of the matches that’s where his economy rate was and there were a few matches in which it was less than that and proportionally the number of matches in which the economy rate was greater than 4 is much much smaller.

We could go even further and use a bin size of 1 and this granularity is also okay as well because it allows us to answer the questions like “In how many matches his economy rate was 2 to 4?” and so on and it also zooms in a bit and tells us that 2 to 4 looks slightly broader range and if we really want to know in how many matches it was close on the lower side i.e between 2 to 3 and as is clear from the below plot, there are quite a few matches there, then 3 to 4 is the most dominant which is again good.

So, here in this scenario having the bin size as 1 seems to be the most reasonable choice and bin size of 2 was also okay. The main point here is that the class interval is a very very important thing when we are plotting histograms and it clearly depends on the domain of interest, range of the data.

We haven’t answered one question yet, let’s say we have a class interval that goes from 2 to 4 and we also have a class interval from 4 to 6, now the question is: the value 4 goes into which group/bucket?

There is a simple way of resolving this known as **Left end inclusion convention** which says that **a class contains its left end boundary but not its right end boundary** so what that means is if we have the exact value say 4, then it won’t go to the 2 to 4 bucket, and it will actually go in the interval 4 to 6.

In this article, we discussed histograms for Discrete Quantitative Data, for Continuous Quantitative Data, the right bin size depends on the domain and the range of the data and it really needs to be tuned so that we get interesting patterns. Here is the overall procedure for creating a histogram:

- Sort the values in increasing order(based on
**x-axis**value and not based on frequency) - Choose the class interval such that all values are covered
- Compute the frequency of each interval
- Draw the bars for each interval such that the height of the bar is proportional to the frequency

References: PadhAI