Basic Statistics Every Data Scientist Should Know
Statistics for data science teams and leadership
Data science, Sigma Six, analytics, and business intelligence are all different sides of a multi-sided polygon. Each has different tools, vocabularies, projects, and certifications.
However, they all serve the business to reduce costs and increase revenues. These are practical tools that help businesses be more effective at what they do!
And each comes with their own new styles of management practices and necessities for leadership to understand the actual value that can be gained from properly utilizing these tools.
We wanted to help create a quick guide to help management and refresh data scientists memories on some of the concepts that data science utilizes.
If you are a data scientist, you should have at least a basic understanding of statistics. You just need to be able to describe a few basic algorithms at a dinner party.
We want to arm you with concepts, equations, and theorems that will make it sound like you aced your advanced statistical computing course in college.
For instance, what is the probability density function? What about a joint distribution function and the role they play in modern data science?
The key to knowing any subject well is understanding it’s base parts.
Libraries like scki-learn and tensorflow abstract almost all the complex math away from the user. This is great, in the sense that you don’t have to worry about accidentally forgetting to carry the one or remember how each rule in calculus operates.
It is still great to have a general understanding of some of the equations you can utilize, distributions you can model and general statistics rules that can help clean up your data!
Discrete vs. Continuous
We need to quickly lay out some definitions.
In this piece, we will talk about discrete variables. If you have not heard the term before these references variables that are of a limited set.
It actually could include numbers that are decimals pending on the set of variables you are using. However, these rules need to be established.
For instance, you can’t have 3.5783123 medical procedures in real life. That doesn’t happen. Even if is the average, it is misleading.
You can’t really say, what is the probability someone will have 3.5783123 procedures in real life. That is not how we count procedures. They either had 3 or 4.
You would actually have to say something along the lines of:
By comparison, a continuous variable can’t be truly visualized in a tabular way as above.
Instead, a continuous variable has to be referenced as a formula as the variables could be infinite.
An input variable could be 1, 1.9,1.99,1.999,1.99999…n.
Examples of this could be weight, age, etc. You’re not 25 years old. You are typically 25 years old, 200 days, 1 hour, 3 seconds, 2 milliseconds, and so on. Technically, it could be any moment in time and each interval has infinite intervals inside.
The poisson distribution is used to calculate the number of events that might occur in a continuous time interval. For instance, how many phone calls might occur at any particular time period or how many people might show up in a queue.
This is really an easy equation to memorize.
The funny looking symbol in this equation λ is called lambda. This represents the average number of events that occur per time interval.
Another good example that could be used to calculate losses in manufacturing would be a machine producing sheets of metal that has X flaws occur per yard. If for instance, the error rate was 2 error per yard of sheet metal.
What would be the probability that two errors would occur in a yard?
The above graph shows the probability of a specific number of errors happening over a specific interval.
A binomial distribution is a very common and one of the first distributions taught in a basic statistics class.
Let’s say you had an experiment. Like flipping a coin.
To be more specific, you were conducting an experiment where you flipped a coin 3 times.
What is the probability distribution that your coin will land on heads?
First, based on combinatorics. We can find out that there are 2³ or 8 possible combinations of results.
Now, if we were to plot the probabilities that there would be 0 heads, 1 head, 2 heads and finally 3 heads as a result.
That would give you your binomial distribution. When graphed, you will notice it looks very similar to your typical normal distribution.
That is because the binomial distribution is very similar to the normal distribution.
One is the discrete version (e.g. We only had 5 coin flips, there was a limit to the tests) the other is continuous.
Probability Density Functions And Cumulative Density Function
Probability Density Function (PDF)
The probability density function, also known as PDF, is a function that you actually know better than you think if you have taken a basic statistics course.
Do you remember standard deviations?
Do you remember calculating the probability between the average and a standard deviation? Did you know you were kind of implementing a calculus concept called integrals?
What is the area underneath the curve?
In this case, the assumption is that the area underneath the curve could be from -∞ to +∞ or a set of numbers.
The value underneath that curve though is one. Thus, you are calculating the area underneath two points in that curve.
Let’s go back to the poisson example.
We could ask, what is the probability that two errors occur in this case? Well, this is kind of a trick question. These variables are discrete rather than continuous.
If the value was continuous it would be 0%!!
But, because this value is discrete, that means it is a whole integer. So there are no values in between 1–2 and 2–3. Instead, it is about 27% for just 2.
Now if you were to ask between 2–3, what would it be?
PDF, as well as the next function we will talk about called the Cumulative Distribution Function, can take on both discrete and continuous forms.
Either way, the purpose is to figure out the density of probabilities that fall underneath a discrete point or range of points.
Cumulative Distribution Function
The cumulative distribution function is the integral of the PDF. Both the PDF and CDF are used to display the random variables.
Cumulative Distribution Functions tell us the probability that a random variable is less than a certain value.
As the name suggests, this graph displays the cumulative probability. Thus, when referring to discrete variables, such as a six-sided die, we would have a graph resembling a staircase. Each upward step would have ⅙ of the value + the previous probability.
By the end, the sixth step would be at 100%. This states that each discrete variable has a ⅙ chance of rolling face up and at the end, there is a total of 100% (which it should always end with either 1-100%).
Accuracy Analysis and Testing Data Science Models
ROC Curve Analysis
The ROC analysis curve is very important both in statistics and in data science. It signifies the performance of a test or model by measuring its overall sensitivity (true positive) vs. its fall-out or (false positive) rate.
This is crucial when determining the viability of a model.
Like many great leaps in technology, this was developed due to war.
In World War 2 they needed to be able to detect enemy aircraft. Its usage has since then spread into multiple fields. We have used it to detect similarities of bird songs, the response of neurons, the accuracy of tests and much, much more.
How does ROC work?
When you run a machine learning model, you have inaccurate predictions. Some of these inaccurate predictions are because it should have been labeled true for instance but instead it was labeled false.
Others should have been false when they were true.
Since predictions and statistics are really just very well supported guesses, what is the probability your prediction is correct?
It is important to have an idea of how right you are!
Using the ROC curve, you can see how accurate your prediction is and with the two different parables you can figure out where to put your threshold.
Your threshold is where you decide whether your binary classification is positive or negative, true or false.
It is also what creates what your X and Y variables are for your ROC curve.
As the two parables get closer and closer, your curve will lose the area underneath it.
This means your model is less and less accurate. No matter where you put your threshold.
The ROC curve is one of the first tests used when modeling with most algorithms. It helps detect problems early on by telling you whether or not your model is accurate.
Theorems and Algorithms
We are not going to spend a lot of time here. Google has loads of information on every algorithm beneath the sun!
There are classification algorithms, clustering algorithms, decision trees, neural networks, basic deduction, boolean, and so on. If you have specific questions, let us know!
Alright, this is probably one of the most popular ones that most computer focused people should know about!
There have been several books in the last few years that have discussed it heavily.
What we personally like about Bayes theorem is how well it simplifies complex concepts.
It distills a lot about statistics in very few simple variables.
It fits in with “conditional probability”(e.g. If this has happened, it plays a role in some other action happening)
What we enjoy about it is the fact that it lets you predict the probability of a hypothesis when given certain data points.
Bayes could be used to look at the probability of someone having cancer based on their age or if an email is spam based on the words in the message.
The theorem is used to reduce uncertainty. It was used in World War 2 to help predict the location of U-boats, as well as predicting how the Enigma machine was configured to translate German codes.
As you can see it is quite heavily relied on. Even in modern data science, we use Bayes and it’s many variants for all sorts of problems and algorithms!
K-Nearest Neighbor Algorithm
K nearest neighbor is one of the easiest algorithms to understand and implement.
Wikipedia even references it as the “lazy algorithm”.
The concept is less based on statistics and more based on reasonable deduction.
In layman's terms. It looks for the groups closest to each other.
If we are using k-NN on a two-dimensional model. Then it relies on something called Euclidian distance (Euclid was a Greek mathematician from very long ago!).
This is only if you are specifically referring to 1-norm distance as it references square streets and the fact that cars can only move in one direction at a time.
The point is, the objects and models in this space rely on two dimensions. Like your classic x, y graph.
k-NN looks for local groups around a specified number of focal points. That specified number of focal points is k.
There are specific methodologies to figuring out how large k should be as this is an inputted variable that the user or automated data science system must decide.
This model, in particular, is great for basic market segmentation, feature clustering, and seeking out groups amongst specific data entries.
Most programming languages allow you to implement this in one to two lines of code.
Bagging involves creating multiple models of a single algorithm such as a decision tree. Each trained on a different bootstrap sample of the data. Because bootstrapping involves sampling with replacement, some of the data in the sample is left out of each tree.
Consequently, the decision trees created are made using different samples which will solve the problem of overfitting to the sample size. Ensembling decision trees in this way helps reduce the total error because variance continues to decrease with each new tree added without an increase in the bias of the ensemble.
A bag of decision trees that uses subspace sampling is referred to as a random forest. Only a selection of the features is considered at each node split which decorrelates the trees in the forest.
Another advantage of random forests is that they have an in-built validation mechanism. Because only a percentage of the data is used for each model, an out-of-bag error of the model’s performance can be calculated using 37% of the sample left out of each model.
A Basic Data Science Refresher, Now What?
This was a basic run-down of some basic statistical properties that can help a data science program manager and or executive have a better understanding of what is running underneath the hood of their data science teams.
Truthfully, some data science teams purely run algorithms through python and R libraries. Most of them don’t even have to think about the math that is underlying.
However, being able to understand the basics of statistical analysis gives your teams a better approach.
Have insight into the smallest parts allows for easier manipulation and abstraction.
We do hope this basic data science statistical guide gives you a decent understanding. Please let us know if our team can help you any further!