Gradient boosting algorithms have proved to be some of the most successful and accurate machine learning algorithms. XGBoost, for example, has proved invaluable in Kaggle competitions. In this tutorial we’ll be developing our own gradient boosted trees from scratch.

Gradients will turn up a few times in this tutorial, so it’s important we get our terminology straight.

Gradient boosting is an ensemble learning algorithm. That means it combines multiple models together to arrive at a prediction.

The very first model that a boosting algorithm creates is a constant model. …

The support-vector machine is one of the most popular classification algorithms. The SVM approach to classifying data is elegant, intuitive and includes some very cool mathematics. In this tutorial we’ll take an in-depth look at the different SVM parameters to get an understanding of how we can tune our models.

Before we can develop our understanding of what the parameters do, we have to understand how the algorithm itself works.

Support-vector machines work by finding data points of different classes and drawing boundaries between them. The selected data points are called the support-vectors and the boundaries are called hyperplanes.

The algorithm considers each pair of data points until it finds the closest pair that are in different classes and draws a straight line (or plane) midway between them. …

So you’ve heard of data science and you’ve heard of Python.

You want to explore both but have no idea where to start — data science is pretty complicated, after all.

Don’t worry — Python is one of the easiest programming languages to learn. And thanks to the hard work of thousands of open source contributors, **you** can do data science, too.

If you look at the contents of this article, you may think there’s a lot to master, but this article has been designed to gently increase the difficulty as we go along.

One article obviously can’t teach you everything you need to know about data science with python, but once you’ve followed along you’ll know exactly where to look to take the next steps in your data science journey. …

Churn prediction is difficult. Before you can do anything to prevent customers leaving, you need to know everything from who’s going to leave and when, to how much it will impact your bottom line. In this post I’m going to explain some techniques for churn prediction and prevention using survival analysis.

The way many data analysts try to model this problem is by thinking in black-and-white terms: churn vs no-churn. It’s really easy to view the problem in this way as it’s a pattern we all know — supervised classification.

But doing so leaves out a lot of the nuance of the churn prediction problem — the risk, the timelines, the cost of a customer leaving. …

Imbalanced learning problems often stump those new to dealing with them. When the ratio between classes in your data is 1:100 or larger, early attempts to model the problem are rewarded with very high accuracy but very low specificity. You can solve the specificity problem in imbalanced learning in a few different ways:

- You can naively weight the classes, making your model preferential to the minority class.
- You can use under-sampling, oversampling or a combination of the two.
- You can switch your goal from trying to balance the dataset, to trying to predict the minority class using outlier detection techniques.

In this post, I’ll show you how, and more importantly, when to use the last of these methods and compare the results to the weighting and rebalancing approaches. …

A lot of the projects I work on are time-bound in one way or another. My clients need to know the churn rate next week, the risk of fraud next month, their anticipated revenue next quarter. But what features does a model need to do this well?

Feature engineering is one of the most creatively challenging aspects of a data science project. When you follow a tutorial or read a book, it’s easy to forget that someone had to go through the difficult work of creating the features you use in your model.

In practice, creating features from raw data requires a great amount of foresight and some intuition about what would help your model do its job best. One the best ways I’ve found to increase the accuracy of a time-based predictive model (especially one that’s trained on an imbalanced data set) is to use slopes. …

Tomorrow I’ll be going to have a look around the CS department that I’ll be doing my PhD in.

The past few weeks have been hectic, I’ve been getting my work and home situations set up to ease the transition, but on the eve of what will essentially be the first day, I thought I’d take some time to write about how I got in to the programme in the first place!

Most getting into Graduate School stories are pretty straightforward: you get a degree and then maybe a masters, you apply and you’re in!

I may be oversimplifying, but in any case that’s not how my story went. …