Basic Statistics Every Data Scientist Should Know

Data science, Sigma Six, analytics, business intelligence, all are different sides of the same multi-sided polygon. Each have different tools, vocabularies, projects, and certifications.

However, they all serve the business to reduce costs and increase revenues. These are practical tools that help businesses be more effective at what they do!

And each comes with their own set of new styles of management practices and necessities for leadership to understand the actual value that can be gained from properly utilizing these tools.

We wanted to help create a quick guide to help management and refresh data scientists memories on some of the concepts that data science utilizes.

If you are a data scientists you should have a basic understanding of statistics. Perhaps you just need to be able to describe a few basic algorithms at a dinner party.

We want to help arm you with some concepts, equations and theorems that will make it sound like you aced your advanced statistical computing course in college.

For instance, what is the probability density function? How about a joint distribution function and what role do those play in modern data science?

The key to knowing any subject well is understanding it’s base parts.

Libraries like scki-learn and tensorflow abstract almost all the complex math away from the user. This is great, in the sense that you don’t have to worry about accidently forgetting to carry the 1 or remember how each rule in calculus operates.

It is still great to have a general understanding of some of the equations you can utilize, distributions you can model and general statistics rules that can help clean up your data!

**If you would like to read a more in depth look at some of these statistics, check out this post!

Statistics For Data Science Teams And Leadership

Discrete vs. Continuous

We need to quickly lay out some definitions.

In this post we will talk about discrete variables. If you have not heard the term before this references variables that are of a limited set.

It actually could include numbers that are decimals pending on the set of variables you are using. However, these rules need to be established.

For instance, you can’t have 3.5783123 medical procedures in real life. That doesn’t happen. Even if is the average, it is misleading.

You can’t really say, what is the probability someone will have 3.5783123 procedures in real life. They either had 3 or 4.

That is not how we count procedures.

You would actually have to say something along the lines of

In comparison, a continuous variable can’t be truly visualized in a tabular way as above.
 Instead, a continuous variable has to be referenced as a formula as the variables could be infinite.

An input variable could be 1, 1.9,1.99,1.999,1.99999…n.

Examples of this could be weight, age, etc. You’re not 25 years old. You are typically 25 years old, 200 days, 1 hour, 3 seconds, 2 milliseconds, and so on. Technically, it could be any moment in time and each interval has infinite intervals inside.

Statistical Distributions

Poisson Distribution

The poisson distribution is used to calculate the number of events that might occur in a continuous time interval. For instance, how many phone calls might occur at any particular time period or how many people might show up in a queue.

This is really an easy equation to memorize.

The funny looking symbol in this equation λ is called lambda. This represents the average number of events that occur per time interval.

Another good example that could be used to calculate loss in manufacturing would be a machine producing sheets of metal that has X flaws occur per yard. If for instance, the error rate was 2 error per yard of sheet metal.

What would be the probability that two errors would occur in a yard?

Here is a quick graph that shows what the probability of a specific number of error happening over a specific interval.

Binomial Distribution

A binomial distribution is a very common and one of the first distributions taught in a basic statistics class.

Let’s say you had an experiment. Like flipping a coin.

To be more specific, you were running an experiment where you flipped a coin 3 times.

What is the probability distribution that your coin will be heads?

First, based on combinatorics. We can find out that there are 2³ or 8 possible combinations of results.

Now, if we were to plot the probabilities that there would be 0 heads, 1 head, 2 heads and finally 3 heads as a result.

That would give you your binomial distribution. When graphed you will notice it will look like something very similar to your typical normal distribution.

That is because the binomial distribution is very similar to the normal distribution.

One is the discrete version(e.g. We only had 5 coin flips, there was a limit to the tests) the other is continuous.

Probability Density Functions And Cumulative Density Function

Probability Density Function (PDF)

The probability density function, also known as PDF is a function that you actually know better than you think if you have taken a basic statistics course.

Do you remember standard deviations?

Do you remember calculating the probablity between the average and a standard deviation. Did you know you were kind implementing a calculus concept called integrals.

What is the area underneath the curve!

In this case, the assumption is that the area underneath the curve could be from — ∞ and + ∞ or a set of numbers such as for a 6 sided dice.

The value underneath that curve though, is 1. Thus, you are calculating the area underneath 2 points in that curve.

Let’s go back to the poisson example.

We could ask, what is the probability that 2 errors occur in this case? Well this is kind of a trick question. These variables are discrete rather than continuous.

If the value was continuous it would be 0%!!

But, because this value is discrete, that means it is a whole integer. So there are no values in between 1–2 and 2–3. Instead, it is about 27% for just 2.

Now if you were to ask between 2–3, what would it be?

PDF as well as the next function we will talk about called the Cumulative Distribution Function can take on both a discrete and continious form.

Either way, the purpose is to figure out the density of probabilities that fall underneath a discrete point or range of points.

Cumulative Distribution Function

The cumulative distribution function is the integral of the PDF. Both the PDF and CDF are used to display the random variables.

Cumulative Distribution Functions tell us the probability that a random variable is less than a certain value.

As the name suggests. This graph displays the cumulative probability. Thus, when referring to discrete variables, such as a 6 sided die. We would have a stair case looking graph. Each upward step would have ⅙ of the value + the previous probablity.

By the end the 6th step would be at 100%. This states that each discrete variable has a ⅙ chance of rolling face up and at the end there is a total of 100% (which it should always end with either 1-%100).

Accuracy Analysis And Testing Data Science Models

ROC Curve Analysis

The ROC analysis curve is very important both in statistics and in data science. It signifies the performance of a test or model by measuring it’s overall sensitivity (True Positive) vs. its fall-out or (False positive) rate.

This is crucial when determining the viability of a model.

However, like many great leaps in technology, this was developed due to war…

In World War 2 they needed to be able to detect enemy aircrafts. It has since then moved into multiple fields. We have used it to detect similarities of bird songs, response of neurons, accuracy of tests and much much more.

How does ROC work?

When you run a machine learning model, you have inaccurate predictions. Some of these inaccurate predictions are because it should have been labeled true for instance but instead it was labeled false.

Others should have been false when they were true.

What is the probability your prediction is correct then? Since predictions and statistics are really just very well supported guesses.

It is important to have an idea of how write you are!

Using the ROC curve, you can see how accurate your prediction is and with the two different parables you can figure out where to put your threshold.

Your threshold is where you decide whether your binary classification is positive or negative, true or false.

It is also what creates what your X and Y variables are for your ROC curve.

As the two parables get closer and closer, your curve will lose the area underneath it.

This means your model is less and less accurate. No matter where you put your threshold.

The ROC curve is one of the first tests used when modeling with most algorithms. It helps detect problems early on by telling you whether or not your model is accurate.

Some Theorems and Algorithms

We are not going to spend a lot of time here. Google has a plethora of every algorithm underneath the sun!

There are classification algorithms, clustering algorithms, decision trees, neural networks, basic deduction, boolean, and so on. If you have specific questions, let us know!

Bayes Theorem

Alright, this is probably one of the most popular ones that really…most computer focused people should know about!

There have been several books in the last few years that have discussed it heavily.

What we personally like about Bayes theorem is how well it simplifies complex concepts.

It distills a lot about statistics in very few simple variables.

It fits in with “conditional probability”(e.g. If this has happened, it plays a role in the some other action happening)

What we enjoy about it is the fact that it lets your predict the probability of a hypothesis when given certain data points.

Bayes could be used to look at the probability of someone having cancer based on their age or if email is spam based on the words in the message.

The theorem is used to reduce uncertainty. It was used in world war 2 to to help predict locations of U-boats as well as predicting how the Enigma machine was configured to translate German codes.

As you can see…it is quite heavily relied on. Even in modern data science we use Bayes and it’s many variants for all sorts of problems and algorithms!

K-Nearest Neighbor Algorithm

K nearest neighbor is one of the easiest algorithms to understand and implement.

Wikipedia even references it as the “lazy algorithm”.

The concept is less based on statistics and more based on reasonable deduction.

In lay-mans terms. It looks for the groups closest to each other.

If we are using k-NN on a two dimensional model. Then it relies on something we call Euclidian distance (Euclid was a greek mathematician from very long ago!). Another reference to this is called the Manhattan distance.

This is only if you are specifically referring to 1-norm distance as it references square streets and the fact that cars can only move in one direction at a time.

The point is, the objects and models in this space rely on two dimensions. Like your classic x, y graph.

k-NN looks for local groups around a specified number of focal points. That specified number of focal points is k.

There are specific methodologies to figuring out how large k should be as this is an inputed variable that the user or automated data science system must decide.

This model in particular is great for basic market segmentation, feature clustering, and seeking out groups amongst specific data entries.

Most programming languages allow you to implement this in one to two lines of code.

Bagging/Bootstrap aggregating

Bagging involves creating multiple decision trees each trained on a different bootstrap sample of the data. Because bootstrapping involves sampling with replacement, some of the data in the sample is left out of each tree.

Consequently, the decision trees created are made using different samples which will solves the problem of overfitting to the sample size. Ensembling decision trees in this way helps reduce the total error because variance continues to decrease with each new tree added without an increase in the bias of the ensemble.

A bag of decision trees that uses subspace sampling is referred to as a random forest. Only a selection of the features is considered at each node split which decorrelates the trees in the forest.

Another advantage of random forests is that they have an in-built validation mechanism. Because only a percentage of the data is used for each model, am out-of-bag error of the model’s performance can be calculated using the 37% of the sample left out of each model.

A Basic Data Science Refresher, Now What?

This was a basic run-down of some basic statistical properties that can help a data science program manager and or executive have a better understanding of what is running underneath the hood of their data science teams.

Truthfully, some data science teams purely run algorithms through python and R libraries. Most of them don’t even have to think about the math that is underlying.

However, being able to understand the basics of statistical analysis gives your teams a better approach.

Have insight into the smallest parts allows for easier manipulation and abstraction.

We do hope this basic data science statistical guide gives you a decent understanding. Please let us know if our team can help you any further!

If you would like to read more posts about leading data science teams to success, please read the posts below!

How To Deal With Corporate Politics as A Data Scientist

Why Start A Data Science Team

How Do Machine Learning Algorithms Learn Bias

Top 8 Python Libraries For Machine Learning

Practical Data Science Management Tips