Giving ML Models a Turn at Customer Churn
Project #3 Started with a flurry of intense googling to find a good data set. The project had to be a classification project, and I wasn’t sure what I wanted to classify. I looked high and low on Kaggle, UCI, and other data repositories for a project that I thought would be interesting.
My first idea was to try and predict colony collapse in honey bee hives, but I wasn’t able to find any very good datasets for that. It seems no one is really keeping that level of detail on honey bee hives yet. At least not publicly.
I also looked for a nice soil dataset. I thought maybe I could strike it rich by identifying good soils for growing wines that weren’t in use. That didn’t pan out either, but that was a stroke of luck since another student did a wine quality project and noted that wine consumption is actually in decline. It’s hard to strike it rich in a shrinking market.
I then stumbled upon a telecom company dataset with on BigML that had a listed objective of predicting customer churn. Prior to coming on this adventure to be a Data Scientist, and attend Metis my previous employer asked me about predicting customer churn, and honestly, I wasn’t sure how to do it at the time. This seemed like a good opportunity to try it out.
I looked at the data and it seemed to have a decent number of features, and so I decided to give it go. My proposal was accepted and off I went to see if I could save the telecom industry from going extinct.
As I began tearing into my data and doing some exploratory data analysis I noticed some oddness to my data. All of the area codes were from the San Fransico Bay area, but the people with those phone numbers were living in all 50 states and the District of Columbia. Not only were they in all 50 states, but the distribution was fairly even. Maybe all these people had moved away from San Fransico, and that was why they were in the dataset, but I wouldn’t have expected former SF inhabitants to disperse themselves evenly to all 50 states. I started to have an ominous feeling about my dataset.
I pressed forward due to not having a lot of time and wanting to try and start over after being a few days into an already short timeline. The next item I noticed was that not only were the customers in the dataset fairly evenly distributed to all the states but so were they area codes were evenly distributed between the states as well. Most ominous feelings crept in. There wasn’t much information about this dataset online. It supposedly originally came from the UCI Machine Learning Repository, but I couldn’t find it there.
I decided to try some feature engineering so see if I could add some extra dimensionality to the data. I found a listing of economic regions of the US. I decided to add those to the data. After importing the data I found there wasn’t all that much variation in the churn data even among economic regions. I began to wonder if I’d be able to predict anything at all from this data.
Moving forward I began modeling. I began with logistic regression which could predict the no-churn customers to a degree, but the recall (true positives / true positives + false negatives) on the churn customers was terrible. The dataset was imbalanced and we learned how to rebalanced it. I tried RandomOverSampler, ADASYN, SMOTE, and SMOTENC in an attempt to get better results. None of these had much of a beneficial effect. I couldn’t beet the naive model with logistic regression.
I also tried with SVCs, and Naive Bayes both with oversampling techniques. These didn’t perform very well either, but SVC in a GridSearchCV loop does turn your laptop into a good handwarmer so it wasn’t without some benefit.
All this testing, and not getting any good results really had me in a pinch. I was afraid I was going to have to get up and give a presentation on how no model could predict churn from my dataset. But there was one more trick left in the bag, and I just hadn’t learned about it yet. Random Forests! There should be a random forest superhero. They really saved the day. A single trial run with the random forest model, and I finally saw some good results. A recall of 0.54. Not all that impressive by itself, but it was leaps and bounds better than any other model. I was thrilled! I wouldn’t have to give a presentation on how bad my dataset was. With further parameter tuning and I was able to get my recall into a respectable range of 0.74. Hooray!
Next on the menu was Flask. I didn’t know that I would be learning this when I came to Metis, but since I came from a small company it did answer a lurking question I had. How would I deploy a machine learning model if I worked for a small company without the other infrastructure people to push it out the rest of the world? Flask to the rescue! I set a goal of using Flask in this project, and I decided I wanted to create like a whole integrated website to demo over the last weekend before we had to give presentations.
These big ideas caused me to go down some wrong paths and waste part of my weekend. Note to self, remember these letters the next time you start a project with new tools: M.V.P. What is the minimum viable product? I spent too much time on peripheral aspects of my flask design and that slowed me down on actually deploying my model. I was able to get it done and demo it on presentation day.
On the whole, this turned out to be a great project. I was able to experiment with multiple models, do oversampling, build an ugly, but functional Flask application that incorporated a pickled machine learning model on Heroku, and not have a heart attack on presentation day.
If I had it to do over again, I would make the following changes:
- Investigate my data more thoroughly before submitting it for a project.
- Create a better plan for my flask app so I wouldn’t go down a dead-end path before having an actual MVP.
- Spend more time investigating feature importances. Getting lost on my flask app prevented me from getting back to those and getting a better understanding of the reason why customers might decide to leave a company.
Humor: My first attempt to create a plotly pie chart gone wrong!

