I’ve been going over many *many *Reddit posts, glassdoor salary profiles, and different forum posts to find one location where everyone can find exactly how much a Data Scientist actually makes. Because right now, I don’t think there is a good answer which is easily available. And, the answers I’ve been seeing vary *a lot.*

Is $80,000 a good place to start as a junior? What about my total package? Or should I be pushing for $150,000 and the corner office?

So with these questions, why not just have a place where everyone can see and contribute so we’re in…

**Preface**

Cleaning data is just something you’re going to have to deal with in analytics. It’s not great work, but it has to be done so you can *produce* great work.

I’ve spent so much time writing and rewriting functions to help me clean data, that I wanted to share some of what I’ve learned along the way. If you have not gone over this post, on how to better organize data science projects check it out as it will help form some of the concepts I’m going over below.

After starting to organize my code better, I’ve started keeping…

One thing prevalent in most data science departments is messy notebooks and messy code. There *are* examples of beautiful notebooks out there, but for the most part notebook code is rough … really rough. Not to mention all the files, functions, visitations, reporting metrics, etc. scattered through files and folders with no real structure to them.

I have bee guilty of going back to a previous projects, searching for the `matplotlib`

function I’ve written 50 times before that creates a certain graph with a certain metric. And, I found it in cell 200 after searching through a few different notebooks…

**Preface**

If you’re not familiar with PyData, it is a non-profit organization which provides a forum for developers to share ideas and learn from each other. Specifically using Python and on the topic of data.

They are also a great resource for online tutorials and lectures about everything to do with data and Python, which you can find here.

The video I want to go over is titled “So you want to be a Python expert” by James Powell from his lecture (talk?) at PyData Seattle 2017. …

**Preface**

While I was thinking about this post I came across this post and subsequent repos, which helped solidify a lot of the concepts here. William Koehrsen, in particular, has some great posts on Medium, and I highly suggest you check out his other articles. He was also quite helpful in working through some of the problems I had when trying to use distributed computing with `featuretools`

, so thank you for all your help William.

**Preprocessing Takes Time**

So the reality of data science or any sort of actions that revolve around data, is a majority of the time spend…

I had a colleague ask me recently what the purpose of an F-test was, and I wanted to provide him with some additional resources with my answer. After diving in a bit deeper on some of the subjects, I thought a post around the assumptions of the F-test and the digesting of its results, would provide a better answer.

**One-Sided F-test**

Just as you would perform a t-test to determine if a sample mean (test group) came from another distribution (control group) with the same mean, an F-test can compare the means of various groups and determine if they are…

I came across this interesting algorithm called the Josephus Problem while doing some research, and it really started to make me think about how I approach problem-solving. At the start, the problem presented was seemingly quite difficult to solve, however the more I worked through the problem I found the solution could be eloquently expressed with a few lines of code.

**The Josephus Problem**

Named after a Jewish historian named Flavius Josephus, it was reported he came up with this problem after a battle between Roman and Jewish forces in the 1st century.

He, a companion, and others, ended up…

**TL;DR**

It describes the probability of an event occurring between two timeframes. Easy.

**A Bit More Than TL;DR**

Suppose we have some random variable X, which can be distributed through a Poisson process. From our observations we can see:

- This variable has been occurring at some fixed rate over a period of time;
- The chance of the event occurring past has had no effect on it occurring again, and;
- The event never occurs more than once per interval.

With these observations in hand, we can assume it is an independent sequence of random variables, and we start to estimate the…

Binning, bagging, and stacking, are basic parts of a data scientist’s toolkit and a part of a series of statistical techniques called ensemble methods. The GitHub for this project can be found here.

There are three main terms describing the ensemble (combination) of various models into one more effective model:

**Bagging**to decrease the model’s variance;**Boosting**to decreasing the model’s bias, and;**Stacking**to

**What is an ensemble method?**

The idea here is to train multiple models, each with the objective to predict or classify a set of results.

Most of the…

The first part of the series is here.

Building off the work I did in the previous part, this article is going to cover these main points:

- Understanding how to make a similar churn model yourself;
- Making a front-end application using Flask, and;
- Deploying the model to Herkou.

**New Project Structure**

Here is the new structure of our Flask app. Because, I wanted to keep everything in one place, I simply moved `/churn_project`

from the previous post into a separate folder within the new project structure.

It’s been kept so should I ever want to update the model, add new…

Sometimes Chef ◦ Sometimes Data Scientist ◦ Sometimes Developer