Data Analytics Using Python (Part_1)

Published in

Budding Data Scientist

10 min readApr 4, 2020

This is the first among the 12 series of posts in which we will discuss about Data Analytics using Python. This is an online course offered by IIT Roorkee through the NPTEL (Swayam) portal. In this post, we will be learning about the basics of data analytics .

Index

Importance of Data Analysis
Types of Variables
Visual Representation of Data
Measures of Central Tendency
Measures of Dispersion
Measures of Shape

You might know how to work with real data, and might have learned many different methodologies but choosing the right methodology is important. The real threat is lack of fundamental understanding of:

–Why to use a particular technique of procedure

–How to use it correctly and

–How to correctly interpret the result

First let us look into the properties of data and why it is so important.

As per Wikipedia ‘Data are characteristics or information, usually numerical, that are collected through observation. In a more technical sense, data is a set of values of qualitative or quantitative variables about one or more persons or objects, while a datum is a single value of a single variable.’

Data is important since:

•Data helps in make better decisions

•Data helps in solve problems by finding the reason for under-performance

•Data helps one to evaluate the performance.

•Data helps one improve processes

Data helps one understand consumers and the market

Data Analytics is defined as “the scientific process of transforming data into insights for making better decisions”.

Analytics, is the use of data, information technology, statistical analysis, quantitative methods, and mathematical or computer-based models to help managers gain improved insight about their business operations and make better, fact-based decisions — James Evans

Data Analytics have many advantages like determining credit risk, developing new medicines, finding more efficient ways to deliver products and services, preventing fraud, uncovering cyber threats, retaining the most valuable customers, and the list goes on.

Based on the phase of workflow and the kind of analysis required, there are four major types of data analytics.

•Descriptive analytics: Descriptive Analytics, is the conventional form of Business Intelligence and data analysis. It seeks to provide a depiction or “summary view” of facts and figures in an understandable format. Descriptive analysis or statistics can summarize raw data and convert it into a form that can be easily understood by humans

•Diagnostic analytics: Diagnostic Analytics is a form of advanced analytics which examines data or content to answer the question “Why did it happen?”. Diagnostic analytical tools aid an analyst to dig deeper into an issue so that they can arrive at the source of a problem. It uses techniques such as data discovery, data mining, correlations.

•Predictive analytics: Predictive analytics helps to forecast trends based on the current events. Predicting the probability of an event happening in future or estimating the accurate time it will happen can all be determined with the help of predictive analytical models

Prescriptive analytics: It gives the set of techniques to indicate the best course of action. It tells what decision to make to optimize the outcome.

Types of Variables

The picture below explains all the types of variables.

Visual Representation of Data

We have various kinds of graphical representations to plot various kinds of data. Few of them are:

Histogram: A histogram is a graphical display of data using bars of different heights. In a histogram, each bar groups numbers into ranges. Taller bars show that more data falls in that range. A histogram displays the shape and spread of continuous sample data.

Frequency Polygon: A frequency polygon is a graph constructed by using lines to join the midpoints of each interval, or bin. The heights of the points represent the frequencies. A frequency polygon can be created from the histogram or by calculating the midpoints of the bins from the frequency distribution table.

Ogive: The Ogive is defined as the frequency distribution graph of a series. The Ogive is a graph of a cumulative distribution, which explains data values on the horizontal plane axis and either the cumulative relative frequencies, the cumulative frequencies or cumulative percent frequencies on the vertical axis.

Pie Chart: A pie chart is a circular statistical graphic, which is divided into slices to illustrate numerical proportion. In a pie chart, the arc length of each slice, is proportional to the quantity it represents.

Scatter Plot: A scatter plot is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data.

There are several other kinds of visual representations of data which will be discussed as the series of posts continue.

Measures of Central Tendency

Arithmetic Mean:

It is commonly called ‘the mean’. It is the average of a group of numbers. It is applicable for interval and ratio data but not for nominal or ordinal data. It is affected by each value in the data set, including extreme values. It is computed by summing all values in the data set and dividing the sum by the number of values in the data set.

Population Mean : For a population of size N, population mean is

Sample Mean : For a sample of size n, sample mean is

Mean of Grouped Data: For class midpoints M and corresponding frequencies f, the mean of grouped data is given by,

Weighted Average: If x is a data value and w is the weight assigned to that data value, the sum is taken over all data values and weighted average is given by,

Median:

It is the middle value among an ordered array of numbers. It is applicable for ordinal, interval and ratio data but not for nominal data. It remains unaffected by extremely large or small values.

After arranging the terms in an ordered array, the median is found by:

a) For odd number of terms, position of median= (n+1)/2.

b) For even number of terms, position of median is the average of (n/2)th as well as [(n+1)/2 ]th terms.

c) Median of Grouped data:

Mode:

The most frequently occurring observation in the data is called the mode. It is applicable to all data types. Data that has two modes are called bimodal data and that has more than two modes are called multimodal data. To find the mode in a data, the data point with the highest frequency of occurrence is found out. For grouped data, modal class is the class with the highest frequency and we find mode using the formula

where:

L is the lower class boundary of the modal group
d1 is the difference between the frequency of the modal group and the frequency of the previous group
d2 is the difference between the frequency of the modal group and the frequency of the next group
w is the group width

Percentiles:

A percentile is a measure used in statistics indicating the value below which a given percentage of observations in a group of observations falls. For example, the 20th percentile is the value below which 20% of the observations may be found. It is applicable for ordinal, interval, and ratio data but not for nominal data. To calculate the P th percentile for a data of size n, we use the formula

Measures of Dispersion

Common Measures of Variability or Dispersion are Range, Inter-quartile range, Mean Absolute Deviation, Variance, Standard Deviation, Z scores and Coefficient of Variation. Let us look into it deeper.

Range: It is the difference between the largest and the smallest values in a set of data and very simple to compute.
Quartile: These are the measures of central tendency that divide a group of data into four subgroups:

•Q1: 25% of the data set is below the first quartile (25th percentile)

•Q2: 50% of the data set is below the second quartile (50th percentile)

Q3: 75% of the data set is below the third quartile (75th percentile)

3. Inter — Quartile Range: The range of values between the first and third quartiles, ie, IQR=Q3-Q1 .

4. Mean Absolute Deviation: It is the average of the absolute deviations from the mean.

5. Population Variance: it is the average of the squared deviations from the arithmetic mean of the population values.

6. Sample Variance: It is the average of the squared deviations from the arithmetic mean of the sample values.

7. Population/Sample Standard Deviation: It is the square root of the variance.

Mean Absolute Deviation, Population Variance, Sample Variance

8. Coefficient of Variation: It is the ratio of the standard deviation to the mean, expressed as a percentage. It is also called the measurement of relative dispersion.

Empirical rule and Chebyshev’s Theorem

The empirical rule if the histogram is bell — shaped or normally distributed is that:

Approximately 68% of all observations fall within one standard deviation of the mean.
Approximately 95% of all observations fall within two standard deviations of the mean.
Approximately 99.7% of all observations fall within three standard deviations of the mean.

A more general interpretation of the standard deviation is derived from Chebyshev’s Theorem, which applies to all shapes of histograms (not just bell shaped).

Chebyshev’s Theorem: The proportion of observations in any sample that lie within k standard deviations of the mean is at least:

For k=2 (say), the theorem states that at least 3/4 of all observations lie within 2 standard deviations of the mean. This is a “lower bound” compared to Empirical Rule’s approximation (95%).

Measures of Shape

Skewness:

Skewness refers to distortion or asymmetry in a symmetrical bell curve, or normal distribution, in a set of data. If the curve is shifted to the left or to the right, it is said to be skewed. Skewness can be quantified as a representation of the extent to which a given distribution varies from a normal distribution.

The skewness of a distribution is measured by comparing the relative positions of the mean, median and mode.

• Distribution is symmetrical implies Mean = Median = Mode.

• Distribution skewed right implies Median lies between mode and mean, and mode is less than mean.

Distribution skewed left implies Median lies between mode and mean, and mode is greater than mean.

Coefficient of Skewness: Summary measure for skewness is given by

•If S < 0, the distribution is negatively skewed (skewed to the left)

•If S = 0, the distribution is symmetric (not skewed)

If S > 0, the distribution is positively skewed (skewed to the right)

2. Kurtosis

Kurtosis is a statistical measure that defines how heavily the tails of a distribution differ from the tails of a normal distribution. In other words, kurtosis identifies whether the tails of a given distribution contain extreme values.

There are three kurtosis types.

– Leptokurtic: Where the tail is high and thin.

– Mesokurtic: Where the tail is normal in shape.

– Platykurtic: Where the tail is flat and spread out.

Box and Whisker Plot

In descriptive statistics, a box plot is a method for graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending from the boxes indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot or box-and-whisker diagram.

Five specific values are used:

–Median, Q2

–First quartile, Q1

–Third quartile, Q3

–Minimum value in the data set

–Maximum value in the data set