Statistics 101: Basics Visualization- Its good to be ‘seen’!

Rohan Bali
Analytics Vidhya
Published in
5 min readDec 5, 2019

Welcome to the second blog of the statistics series!

Before we dive into visualization graphs/charts and techniques, let’s ponder about the importance of visualization!

It is a known fact that images last longer in the human brain than some normal text. When some data is represented in the form of a graph or chart then we say that the data is visualized. The importance of data visualization is very crucial especially for someone who is working in the field of data science. A visual aid helps us in identifying:-

(i)the trend and pattern of the data.

(ii)outliers present in the data.

(iii)determining population diversity.

(iv)correlation level among different variables present in the data.

(v)and more..

Graphs look Cool!!

So, now let's understand some basic terminology before jumping into making super-awesome graphs and charts.

The following are very basic formal definitions of some of the terms that will be used while we are plotting the data.[These are damn easy!]. Let’s begin.

(i) Frequency Distribution: The most basic need for a plot is the distribution of the data with respective frequencies. For example, we have an employee database and we have different class-intervals based on the ages of employees(21–25,26–30,31–35…..).The number of people in each on the interval in the frequency for this scenario.

(ii) Range: As the name suggests, the range for the particular data is the gap between the largest and the smallest value present in the data. For example, the range for the ages of employees will be the age of the youngest employee to the age of the oldest employee.

(iii)Class Mid-point: Also known as class mark. It is the center position(the midpoint) between an interval. For example, the class midpoints for the age intervals of employees ‘21–25’,’26–30' and ‘31–35’ will be 23,28 and 33 respectively.

Just two more, sit tight!

(iv) Relative Frequency: It is part of the whole! The frequency of a particular class-interval w.r.t the total frequency in the data is known as the relative frequency of that class interval. For example, let the number of employees in the age class-interval of ‘21–25’ be 15 and the total employees in the firm be 100. Then, the relative frequency of the class-interval ‘21–26’ will be 15/100 = 0.15.

(v)Cumulative Frequency: It represents the running frequencies of all intervals until the current one.

Let's make the above easier and more practical!

NOTE: The total frequency will always be equal to the last class interval’s cumulative frequency

Now, let's have a look at some of the basic yet very useful graphs!

(i) Histograms: They are a sequence of consecutive rectangles. Each rectangular represents the frequency of its respective class interval.

(ii) Frequency Polygons: Similiar to histograms they also represent the frequency of class intervals, however, they don’t use rectangles, they use a dot of each class interval to represent the frequency.

(iii) Ogives: It is an extension of the frequency polygon. Cumulative frequencies are used for the representation of data. This graph enables us to visualize the growth in the frequencies of all the class-intervals.

NOTE: Unlike the frequency polygon, in ogives, the graph begins from the origin.

(iv) Dot Plots: Dot plots uses each value of an attribute and plots it on the horizontal axis. It represents the count of that attribute. It is very useful in understanding the overall shape of the data. When we have multiple entries of the same value, they get pilled upon each other.

(v) Stem and Leaf: This is a very interesting plot. Two groups have formed the stem and the leaf in the representation. The leftmost are stems that hold a higher value. The rightmost is the leaf which is less valued. For example, if we have a data point(say, 34), then, 3 will be the stem and 4 will be the leaf. The following represents the marks scored by 50 students in a class.

It helps in visualizing the overall spread of the data

So, the first entry in the stem and leaf representation is (1| 024678 ). This represents the following marks (10,12,14,16,17,18).

(vi) Pie Chart: It is a circular depiction of the given data. Different categories of a single attribute are shown on the circular chart where each is sharing the space with its respective frequency. The division of circular space for every entry takes place in form of percentages or degrees.

(vii) Bar Graphs: These are either horizontal or vertical bars that represent frequency of the data. Unlike, histograms they are not continues in nature.

(viii) Pareto Graphs: They are the same as bar graphs with an extra feature. A cumulative frequency line is drawn from the first bar in the graph to the last graph. The bars are arranged in Pareto graphs. The longest bar is on the left and the shortest one is on the right side of the representation space.

(ix) Scatter-Plot: These are used to study the relationship between two variables in a given data. These are very useful when we will deal with correlation and regression(to be discussed in future blogs).

Dot plots represent only one variable while a scatter plot represents two variables.

The above representation of the average given by cars with car weight. We can understand that with the increase in the weight the miles per gallon decreases.

So, that's all!

Coming up Statistics 101: Grouped and Ungrouped Data- Let’s talk with data!

Previous Blog: Statistics 101: Let’s not be ‘mean’ always!

--

--

Rohan Bali
Analytics Vidhya

Data Analytics professional with majors in Computer Science Engineering. Enjoys problem-solving and propelling data-driven decisions.