# 8 Important Statistics topics for Data Science

## The “must learn” topics for aspiring data scientists.

Published in

--

Do you aspire to be a data scientist? Are you willing to pass all the data science interviews? Do you enjoy experimenting with data? Do you enjoy analysing data and drawing conclusions from it? If you answered “YES!” to all of these questions, this story will help you become a better-prepared data scientist.

There are some statistical topics that you will frequently use while working on any data science projects. In this story, I will highlight eight of them that you must learn, and I will provide a little introduction to all the topics.

# Regression:

Regression is a statistical method used in finance, investment, and other fields to identify the strength and character of a relationship between one dependent variable (typically represented by Y) and a sequence of other variables (known as independent variables).

Types of Regression:

Regression is broadly classified into 2 major types:

1. Linear Regression: Linear regression is perhaps the most basic of all statistical methods. It’s a way of modelling the relationship between one variable and another, using a line to describe the relationship. You can use linear regression to predict values of one variable given other known values of that variable (or other variables). The equation for linear regression is y = mx + b, where y is the predicted value of x, x is an independent variable, m is the slope and b is the intercept.
2. Logistic Regression: Logistic regression is a statistical technique that takes continuous inputs and outputs a probability of a categorical target. It’s useful as a classification tool when you have multiple binary targets (such as whether a patient has disease X or not, or if an email is a spam or not) and can’t use decision trees.

# Central Tendency:

A measure of central tendency (also known as a measure of centre or central location) is a summary measure that aims to summarize an entire collection of data with a single number representing the middle or centre of its distribution.

In statistics, the three most common measures of central tendency are the mean, median and mode.

Mean: The mean is calculated by adding up all the values in a set of data and dividing by the number of values. For example, consider a set of scores on an exam: 70, 80, 90, 95, 100. The mean (average) score is found by adding up all these numbers and dividing by 5: (70 + 80 + 90 + 95 + 100) / 5 = 87.

Median: The median is the middle value in a list of ordered scores or measurements. For example, if we were looking at salaries for five employees in an organization, their salaries might be $35K,$45K, $50K,$60K and $80K. The median salary would be$50K, since it is halfway between $35K and$60K.

Mode: The mode is the most commonly occurring value in a set of data when there are two or more than two scores that occur frequently. If no single value occurs frequently, then there is no mode; this occurs when all values are equal.

# Dispersion Measure:

Dispersion measures describe the spread of data around a central value (mean, median or mode). They indicate the degree of variability in the data.

The most commonly used measures of dispersion are the range, variance and standard deviation.

The range is the simplest measure of dispersion. It’s simply the difference between the largest and smallest values in your dataset. For example, if you have the values {10, 20, 40} then the range is 40–10= 30.

The variance and standard deviation are similar, but they take into account how far apart each value is from the mean, rather than just looking at how far apart the two extremes are. If we have our same dataset with values {10, 20, 40}, we can calculate its variance as follows:

s² = \frac{(10–20)(10–20)}{5}

s² = \frac{1}{5}

s² = 0.025

This means that there’s only a 2% difference between each value, so they’re pretty closely clustered together (as opposed to being spread out).

# Estimation:

Estimation is the process of determining an accurate value for a set of data points by using their mean and standard deviation. The mean is used to determine the centre of the data set, while the standard deviation is used to determine how far individual values are from the mean.

The data points in a sample have been taken from a larger population of data points, so we can use them to estimate what the whole population would look like if we could measure it directly. In this way, we can use samples to draw conclusions about populations.

In statistics, there are two types of estimation: point estimation and interval estimation. Point estimation uses only one number to represent an unknown quantity, while interval estimation uses two numbers — one representing an upper bound and one representing a lower bound.

# Hypothesis Testing:

Hypothesis testing is a type of statistical inference that used sample data to draw inferences about a population parameter or probability distribution. First, a tentative guess is formed regarding the parameter or distribution.

It is used to determine whether a particular hypothesis is supported by the data, or whether it can be rejected. The most common type of hypothesis test is the z-test, which compares the mean of an experimental group and a control group.

The hypothesis test for a single mean uses the following null and alternative hypotheses:

Null hypothesis — The mean difference between the two groups is equal to zero (H₀).

Alternative hypothesis — The mean difference between the two groups is not equal to zero (H₁).

In order to conduct a hypothesis test, you must first calculate your test statistic. The test statistic will then be compared with its critical value from a table of critical values for that particular test statistic and α level. If your calculated value falls in the rejection region (the area under the curve), then you reject H₀; otherwise, H₀ cannot be rejected.

# Population:

In statistics, a population is the pool of individuals from which a statistical sample is drawn for a study.

Most of the statistics are concerned with a population. The term Population refers to a group of people or items that have something in common. For example, in an election, we can take the total number of people who voted in an election as our population. In a survey, we can take the total number of people who were interviewed as our population.

Sometimes we have a sample instead of a population. A sample is a part of the population which we use to make our inferences about the whole population. For example, if we want to know how many people are going to vote in an election, then we can take a sample of voters and find out how many people would vote for each candidate by interviewing them. We cannot say that all voters will vote for any particular candidate because this might not be true at all times due to various reasons like gender bias or religious bias etc., but still, we can make some inferences about how many voters would vote for each candidate based on their responses from the sample only.

# Scatter Plot:

A scatter plot is a collection of points plotted on two axes, horizontal and vertical. Scatter plots are useful in statistics because they illustrate the extent, if any, of correlation between the values of observed quantities or phenomena (called variables).

It is a graph that shows the relationship between two variables. The data points are plotted on a graph, with the x-axis representing an independent variable and the y-axis representing a dependent variable.

In scatter plots, each point represents a pair of data points. Each point may represent a single measurement or multiple measurements. The value of an individual point can be determined by the x-coordinate, which represents one variable, and the y-coordinate, which represents another variable. For example, a scatter plot might use exam scores to predict students’ grade point averages (GPA). The expected result would be that higher grades would mean higher GPAs — as measured by test scores — meaning that there is a positive correlation between test scores and GPAs. If this were true, then you would expect most points on your scatter plot to fall somewhere along an expected line of best fit (also known as an “association line”) that connects high test scores with high GPAs.

# Forecasting:

Simply put, statistical forecasting is the use of statistics based on historical data to anticipate what might happen in the future. This can be applied to any quantitative data, such as stock market outcomes, sales, GDP, housing sales, and so on.

It is the process of using historical data to predict future outcomes. Forecasting is often used in business, finance, and decision-making. In this guide, we’ll look at the basic concepts behind forecasting, how to get started with forecasting in Python, and some examples of how to use forecasting algorithms in practice.

It involves predicting future outcomes based on past events. The easiest way to think about it is as a prediction of what will happen next based on what happened before. For example, if you wanted to predict how many customers will walk into your store tomorrow, you might look at historical data from previous days and use that information to make your prediction for tomorrow.

There are two main types of forecasting models: time series models and regression models. Time series models take into account historical data as well as other factors such as seasonality (the fact that certain events tend to happen at certain times) or trends (the long-term increasing or decreasing pattern). Regression models predict an outcome based on historical data without taking into account other factors like seasonality or trends.

# Outro:

I will be sharing stories about programming languages, Data Science, Machine Learning, Artificial Intelligence, and Blockchain. If you like my works, do follow me on my socials to stay updated with my life and my works.