A Guide to Git for Data Scientists

You don’t have to be terrified of git anymore.

For quite some time, git had been this nebulous, terrifying, thing for me. It was kind of like walking on a tight rope holding a bunch of fine china. Yeah I knew how to `git add`, `git commit`, and `git push`. But if I had to do anything outside of that, I'd quickly lose my balance, drop the fine china, and git would inevitable break my project into unrecognizable pieces.

If you’re a developer, you probably know git pretty well. But git is now becoming an inescapable skill for anyone in a field involving programming and collaboration, especially data science.

Understanding Maximum Likelihood Estimation

Let’s say you collect some data from some distribution. As you might know, each distribution is just a function with some inputs. If you change the value of these inputs, the outputs will change (which you can clearly see if you plot the distribution with various sets of inputs).

It so happens that the data you collected were outputs from a distribution with a specific set of inputs. The goal of Maximum Likelihood Estimation (MLE) is to estimate which input values produced your data. It’s a bit like reverse engineering where your data came from.

In reality, you don’t actually…

Understanding how the perceptron really works

The perceptron model is a binary classifier whose classifications are based on a linear model. So, if your data is linearly separable, this model will find the hyperplane that separates it. The model works as such:

A Practical Look at Vectors and Your Data

I remember a feeling of utter confusion when I first learned vector spaces in my first course of Linear Algebra. What’s a space? And what are these vectors, really? Lists? Functions? Kittens? And how the hell does that relate to all this data that I analyze for some scientific endeavor? In this article, I attempt to explain the topic of vectors, vector spaces, and how they relate to data in a way that my former self would have appreciated.

The Utility of the Vector

Imagine a two-by-two graph as you might have seen in school. A vector is an object that’s typically seen in the…

Why Doing Good Science is Hard and How to Do it Better

Doing good science is hard and a lot of experiments fail. Although the scientific method helps to reduce uncertainty and lead to discoveries, its path is full of potholes. In this post, you’ll learn about common p-value misinterpretations, p-hacking, and the problem with performing multiple hypothesis tests. Of course, not only are the problems presented, but their potential solutions as well. By the end of the post, you should have a good idea of some of the pitfalls of hypothesis testing, how to avoid them, and an appreciation for why doing good science is so hard.

P-Value Misinterpretations

There are many ways…

Essentials of Hypothesis Testing and the Mistakes to Avoid

Hypothesis testing is the bedrock of the scientific method and by implication, scientific progress. It allows you to investigate a thing you’re interested in and tells you how surprised you should be about the results. It’s the detective that tells you whether you should continue investigating your theory or divert efforts elsewhere. Does that diet pill you’re taking actually work? How much sleep do you really need? Does that HR-mandated team-building exercise really help strengthen your relationship with your coworkers?

Social media and the news saturates us with “studies show this” and “studies show that” but how do you know…

How to Really Prove Something

“Prove it” is a phrase thrown around with glee. But rarely are you really challenging a friend to provide a rigorous proof of some statement you disagree with. In science, there is no proof. Instead, the scientific method encourages you to form a hypothesis, collect data, and test that hypothesis. Repeating this process advances you closer and closer to the truth, but nowhere in this process can you say you’ve proven a hypothesis. This would be impossible as you would have to exhaustively collect all data in the known universe to see if your hypothesis still holds and even then…

Docker for Data Scientists

Docker is hot in the developer world and although data scientists aren’t strictly software developers, Docker has some very useful features for everything from data exploration and modeling to deployment. And since major services like AWS support Docker containers, it’s even easier to implement Continuous Integration Continuous Delivery with Docker. In this post, I’ll show you how to use Docker in a data science context.

What is Docker?

It’s a software container platform that provides an isolated container for us to have everything we need for our experiments to run. Essentially, it’s a light-weight VM that’s built from a script that can be…

A Few Useful Things to Know About Machine Learning

A Few Useful Things to Know about Machine Learning is a high-level machine learning paper written by Pedro Domingos of the computer science and engineering department at the University of Washington. His paper details some useful machine learning guidelines and the following are some highlights I took from it.

When building models, we’re after generalization so there are a few things to note:

• Cross-validation is a must — that is, randomly dividing your training data into n subsets, holding out each subset while training on the rest, validating your model on the held-out subset and then averaging the results.
• Although…

Remote Computing with Jupyter Notebooks

Awhile ago, I had AWS set up to provide me a unique URL that I could navigate to and use Jupyter Notebooks. I admired the convenience and the ability to just start a computation and close my laptop knowing full well my computations continued working away. However, using an AWS P2 instance can get very costly depending on your usage, which for me would be around \$600 per month. So, I figured I could just build a computer with that kind of money which could serve as a deep learning rig along with the occasional video gaming.

This post describes…

Bobby Lindsey

ML Specialist @ AWS. Author of bobbywlindsey.com. @bobbywlindsey. Views are my own.

Get the Medium app