Image for post
Image for post

I’ve been going over many many Reddit posts, glassdoor salary profiles, and different forum posts to find one location where everyone can find exactly how much a Data Scientist actually makes. Because right now, I don’t think there is a good answer which is easily available. And, the answers I’ve been seeing vary a lot.

Is $80,000 a good place to start as a junior? What about my total package? Or should I be pushing for $150,000 and the corner office?

So with these questions, why not just have a place where everyone can see and contribute so we’re in…


Preface

Cleaning data is just something you’re going to have to deal with in analytics. It’s not great work, but it has to be done so you can produce great work.

I’ve spent so much time writing and rewriting functions to help me clean data, that I wanted to share some of what I’ve learned along the way. If you have not gone over this post, on how to better organize data science projects check it out as it will help form some of the concepts I’m going over below.

After starting to organize my code better, I’ve started keeping…


One thing prevalent in most data science departments is messy notebooks and messy code. There are examples of beautiful notebooks out there, but for the most part notebook code is rough … really rough. Not to mention all the files, functions, visitations, reporting metrics, etc. scattered through files and folders with no real structure to them.

I have bee guilty of going back to a previous projects, searching for the matplotlib function I’ve written 50 times before that creates a certain graph with a certain metric. And, I found it in cell 200 after searching through a few different notebooks…


Preface

If you’re not familiar with PyData, it is a non-profit organization which provides a forum for developers to share ideas and learn from each other. Specifically using Python and on the topic of data.

They are also a great resource for online tutorials and lectures about everything to do with data and Python, which you can find here.

The video I want to go over is titled “So you want to be a Python expert” by James Powell from his lecture (talk?) at PyData Seattle 2017. …


Preface

While I was thinking about this post I came across this post and subsequent repos, which helped solidify a lot of the concepts here. William Koehrsen, in particular, has some great posts on Medium, and I highly suggest you check out his other articles. He was also quite helpful in working through some of the problems I had when trying to use distributed computing with featuretools, so thank you for all your help William.

Image for post
Image for post
As with most thing, getting the final product processing raw material into different stages

Preprocessing Takes Time

So the reality of data science or any sort of actions that revolve around data, is a majority of the time spend…


I had a colleague ask me recently what the purpose of an F-test was, and I wanted to provide him with some additional resources with my answer. After diving in a bit deeper on some of the subjects, I thought a post around the assumptions of the F-test and the digesting of its results, would provide a better answer.

One-Sided F-test

Just as you would perform a t-test to determine if a sample mean (test group) came from another distribution (control group) with the same mean, an F-test can compare the means of various groups and determine if they are…


I came across this interesting algorithm called the Josephus Problem while doing some research, and it really started to make me think about how I approach problem-solving. At the start, the problem presented was seemingly quite difficult to solve, however the more I worked through the problem I found the solution could be eloquently expressed with a few lines of code.

Image for post
Image for post
Oh look … a tree made from recursion

The Josephus Problem

Named after a Jewish historian named Flavius Josephus, it was reported he came up with this problem after a battle between Roman and Jewish forces in the 1st century.

He, a companion, and others, ended up…


TL;DR

It describes the probability of an event occurring between two timeframes. Easy.

Image for post
Image for post
An exponential distribution with different values for lambda.

A Bit More Than TL;DR

Suppose we have some random variable X, which can be distributed through a Poisson process. From our observations we can see:

  • This variable has been occurring at some fixed rate over a period of time;
  • The chance of the event occurring past has had no effect on it occurring again, and;
  • The event never occurs more than once per interval.

With these observations in hand, we can assume it is an independent sequence of random variables, and we start to estimate the…


Binning, bagging, and stacking, are basic parts of a data scientist’s toolkit and a part of a series of statistical techniques called ensemble methods. The GitHub for this project can be found here.

Image for post
Image for post
Decisions …

There are three main terms describing the ensemble (combination) of various models into one more effective model:

  • Bagging to decrease the model’s variance;
  • Boosting to decreasing the model’s bias, and;
  • Stacking to increasing the predictive force of the classifier.

What is an ensemble method?

The idea here is to train multiple models, each with the objective to predict or classify a set of results.

Most of the…


The first part of the series is here.

Building off the work I did in the previous part, this article is going to cover these main points:

  • Understanding how to make a similar churn model yourself;
  • Making a front-end application using Flask, and;
  • Deploying the model to Herkou.
Image for post
Image for post
Now on on part 2 …

New Project Structure

Here is the new structure of our Flask app. Because, I wanted to keep everything in one place, I simply moved /churn_project from the previous post into a separate folder within the new project structure.

It’s been kept so should I ever want to update the model, add new…

Robert R.F. DeFilippi

Sometimes Chef ◦ Sometimes Data Scientist ◦ Sometimes Developer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store