Linear regression is an extremely powerful tool for making predictions. But how do we know that we can trust the results we get from linear regression?

We can only get the best and reliable results if three conditions are met. Without these, we will get biased and unreliable predictions. In this blog, I will discuss the three assumptions and how you can check for them to ensure that you can trust the results when performing linear regression. For simplicities sake, I will assume a simple linear regression.

The first assumption may be the most obvious assumption. Linearity means that there…

In my most recent blog, I discussed the two most common metrics in decision trees, the entropy/information gain and the Gini index. In this post, I will discuss how to use Python to code a decision trees and the dangers that can occur using decision trees.

To begin coding our trees, let’s assume that we have a Pandas data frame called `df`

with a categorical target variable. In addition to Pandas you should also import the following to create the decision tree.

`from sklearn.model_selection import train_test_split`

from sklearn.tree import DecisionTreeClassifier

Note that in this particular case I am using a…

In the simplest way possible, a decision tree is a flow chart that your computer has generated to make a prediction. While this is a nice overview of decision trees, how do we create this flow chart, why is this important? This is my introduction blog to Decision Trees where I will overview what decision trees are and the two main metrics used to create decision trees.

A decision tree is a directed acyclic graph (DAG). This means that all information flows in one direction and it never backtracks on itself. …

The Z and T-tests are both used for hypothesis testing. Using a quick block of code, it is easy to generate results from these tests, but what is going on behind the scenes. In this blog, I will discuss the math and assumptions behind each of these test, I will also assume you understand the basics of running a hypothesis test to focus on the math.

A Z test is used when we want to measure if a sample comes from a specified population. In other words, does is the sample different than what we expect from a given population.

…

There are a huge amount of statistical tools and machine learning models that assume a normal distribution. What is the normal distribution, what other distributions are out there. In this blog post I will describe some of the most common types of distributions.

Every distribution will either be discrete or continuous so it is important to define these two terms first.

A distribution that is discrete means that each event or trial is boolean. Think of rolling a die. When you roll a die you can have 1, 2, 3, 4, 5, or 6. However, you cannot roll a 1.9…

Data scientists generally work in the back end. Building an interactive app is not a strength of mine, but it can be a great way to show off the work that you can do. In this post, I will discuss some tips for creating a Flask App that I have learned while building a website for my Travel Recommendation System.

Before we even start thinking about beautifying our app, you should have all your functions in a separate py file as opposed to within a Jupyter Notebook. …

If you have ever watched Mythbusters, you may have noticed that rather than diving straight into the myth, they run a control test first. They run the control so that they can see how much of a difference the conditions of the myth makes on the result. In a sense, they are attempting to see if the myth truly does cause the result that is experienced, or if it would happen without the myth conditions.

While a Mythbusters experiment, is a bit more extreme than everyday business, a control is extremely important when a business wants to see an observable…

Imagine that you are a lab scientist and after you run one test, you have shown that the value used for gravity is false. Are you going to accept those result? No! One of the principals of science is that your results should be reproducible. So instead of accepting a new value for gravity, you will re-run your test to validate that your new value for gravity is correct. Or you may find that something was wrong with that first test because it was not reproducible and throw out your findings. …

Living in age of the internet, recommender systems are all around us. Social media, ads, online retailers, music/video steaming, even services such as Stitch Fix use recommender systems to deliver results to their customers. If you are an entrepreneur with items/products/services to sell you may also want to use a recommender system. One of the most well-known types of recommender systems is known as the alternating least-squares (ALS) model through Spark. While a very effective model, there is a major downside, you need to have information about which product/service your intended users already are interested in. This can be quite…

In biology, it is understood that a single gene is usually not the sole reason for a specific phenotype. Rather it is a collection of genes working together that create that phenotype. It is potentially more valuable to discover this collection of genes than to find one gene that is correlated with an outcome. With this collection of genes identified, more research can be completed to uncover the method these genes interact to create that specific outcome.

Should a single gene be determined to be significant, it is much hard to determine which other genes interact with it. In addition…

Data Scientist