For those catching up here, bootstrap sampling refers to the process of sampling a given dataset ‘with replacement’… And this is where most people get lost. You take many samples and build a distribution to mark your confidence interval.

Let’s take a quick example.

Let’s say that you want to find out how the general population at a college feels about cryptocurrency; well, you likely won’t be able to gather responses from everybody in the school; what will probably happen is that you’ll distribute some survey and you’ll get back a handful of responses that you hope are indicative of…

Logistic regression can be pretty difficult to understand! As such I’ve put together a very intuitive explanation of the why, what, and how of logistic regression. We’ll start with some building blocks that should lend well to clearer understanding so hang in there! Through the course of the post, I hope to send you on your way to understanding, building, and interpreting logistic regression models. Enjoy!

Logistic regression is a very popular approach to predicting or understanding a binary variable (hot or cold, big or small, this one or that one — you get the idea). …

If you’re not using XTS objects to perform your forecasting in R, then you are likely missing out! The major benefits that we’ll explore throughout are that these objects are a lot easier to work with when it comes to modeling, forecasting, & visualization.

XTS objects are composed of two components. The first is a date index and the second of which is a traditional data matrix.

Whether you want to predict churn, sales, demand, or whatever else, let’s get to it!

The first thing you’ll need to do is create your date index. We do so using the `seq`

…

Regression is a staple in the world of data science, and as such it’s useful to understand it in its simplest form.

I recently wrote a post that gave us more detail into regression. You can find that here. To follow on the ideas that we explored there, today we will be exploring the creation of regression models where the explanatory variable is a categorical datapoint.

As I mentioned, it’s important to have good understand of the application & methodology from the ground up. …

Version control is all about managing changes to files and directories by one or many contributors. Git is an incredibly popular system for version control and the one we will be running through for this course.

There are many benefits to version control, and Git specifically. Including a view of historical changes made to your project, automatic notification of conflicting work, where two individuals effectively write conflicting lines of code, allows for collaboration across many individuals which allows teams to grow.

Version control is a staple to software engineering and something that is slowly being adopted across data science teams…

Welcome to this lesson on calculating p-values.

Before we jump into how to calculate a p-value, it’s important to think about what the p-value is really for.

Without going into too much detail for this post, when establishing a hypothesis test, you will determine a null hypothesis. Your null hypothesis represents the world in which the two variables your assessing don’t have any given relationship. Conversely the alternative hypothesis represents the world where there is a statistically significant relationship such that you’re able to reject the null hypothesis in favor of the alternative hypothesis.

Before we move on from the…

When it comes to your typical product or engineering org, team members are often left wondering whether the thing they did had an impact, or whether the option they went with among many different design options was actually the best. As these organizations want to move towards data-informed design decision, AB testing is first in line.

AB testing is a methodology of comparing multiple versions of a feature, a page, a button, etc. by showing the different versions to customers or prospective customers and assessing the quality of interaction by some metric (Click through, purchase, following any call to action…

Over the course of this post, we’re going to learn about using simulation to understand probability and we’ll use the classic example of the Monty Hall gameshow problem.

Monty Hall had a gameshow back in the day, where he showcased the following problem. If you’re not familiar with him or they game it was also referenced in 2008’s 21. Anyways, if you’re not familiar here’s the problem.

He would give his contestant three doors to choose from. Behind two of the doors were goats… but behind 1 of them was a sports car.

So let’s say it’s you on the…

Today I want to break down the central limit theorem and how it relates to so much of the work that a data scientist performs.

First things first, a core tool to any data scientist is a very simple chart type called a histogram. While you’re sure to have seen many a histogram, we often look past its significance. The core purpose to a histogram is to understand the distribution of a given dataset.

As a refresher, a histogram represents the number of occurrences on the y-axis of different values of a variable, found on the x-axis.

Here is an…

Without diving into the specifics just yet, it’s important that you have some foundation understanding of decision trees.

From the evaluation approach of each algorithm to the algorithms themselves, there are many similarities.

If you aren’t already familiar with decision trees I’d recommend a quick refresher here.

With that said, get ready to become a bagged tree expert! Bagged trees are famous for improving the predictive capability of a single decision tree and an incredibly useful algorithm for your machine learning tool belt.

The main idea between bagged trees is that rather than depending on a single decision tree, you…

I’m a Senior Data Scientist & Data Science Leader sharing lessons learned & tips of the trade! Twitter: @data_lessons, Linkedin: linkedin.com/in/robertwoodiii