This article is not an exhaustive machine learning tutorial, but rather outlines general guidelines on how to approach a problem with data. We’ll use the example of determining application seriousness to prioritize interviews here at Flatiron School. Our admissions team does their darnedest to read through the many awesome applications we receive every day and schedule interviews in a prompt manner. To support their efforts, our tech team decided to help sort applications using a machine learning pipeline triggered as applications come in. The tools we’ll use for this walkthrough are Python, Pandas and scikit-learn.
What’s Your Problem?
The first, and oftentimes hardest, part of any analytical project is clearly defining the problem you’d like to solve. It’s particularly important to check your assumptions and deepen your project understanding before jumping into data exploration. If you have external stakeholders, this is the point where you should gather background information, define terminology and discuss what a successful project looks like.
Write a Solution Statement
With this groundwork laid, it’s time to write a solution statement. For the application priority problem we’re solving, it could be something as simple as “When a student submits a high priority application to Flatiron School, our admissions team is notified so they can schedule an interview with the student immediately.”
Ponder the Problem
Using the your own domain knowledge and the information gathered from stakeholders, you will want to think about potential factors impacting the problem’s outcome. In the case of application prioritization, we investigated the FlatironSchool.com application form and thought about how students interact with Learn prior to applying.
Determine the Problem Type
Another key aspect of an analytical project is determining what kind of problem you’re solving, which broadly translates to whether you’re solving a supervised or unsupervised learning problem. Unsupervised learning is trained on unstructured data and is harder to implement than supervised learning, which requires a pre-labelled dataset and solves for a pre-defined outcome. Supervised learning is comprised of regression (fitting the data) and classification (separating the data). In the case of our application seriousness example, we chose to use a supervised approach with a binary classification target of conversion. Binary classification means classifying the elements of a given dataset into two groups and conversion defined as acceptance into Flatiron School.
Get to Know Your Data
While the data exploration phase is oftentimes breezed over in an eagerness to start testing models, spending ample time sifting through your data and understanding how it relates to the real world is paramount to a successful data project! Looping back on this phase can be done at any point during your analysis; in fact, returning to it often suggests a thoughtful and nuanced analysis.
Data exploration includes investigating each individual field in your core dataset, looking for complementary datasets and thinking about how a feature relates back to the problem you’re solving. I normally approach this on a field by field basis, looking at the value counts for each field and keeping a close eye on NaN values. Acquainting yourself with the data can also include engineering new features from existing fields or a combination of fields. For example, our student application data is relatively limited, but we derived a new feature using the language-check package to determine error / length ratio. Our thinking was that this may correlate to bots and non-proofread applications.
Rough Feature Importances
After defining your features, you can investigate how each feature impacts the problem’s outcome. It’s important to note that there are many techniques to quantify feature relevance and it’s ultimately determined by the model you use. This first, rough pass is not a definitive answer to which features matter, but rather provides a high-level view (like peering out of an airplane window at your data landscape). It can also serve as an important weeding out step, removing features that appear to be weak predictors and allowing you to focus on the variables that matter.
To get a sense of which features are strong indicators of application seriousness, we ran a classifier over our dataset and looked at its feature_importances property. A classifier is a model or algorithm that sorts input data into categories. For this particular classifier, ExtraTreeClassifier, the strongest features were a student’s lab completion count on Learn, the application form’s free response lengths and the engineered language-check feature.
Side note on interpretability: In the context of machine learning, interpretability means a model that is intelligible to a human and is transparent about how it solved a given problem. Decision trees are highly interpretable because their branching logic can be explained and visualized, in contrast to more opaque techniques like neural networks and deep learning.
Cinderella Story: Trying on Different Models
With your problem defined, data explored and feature set refined, it’s time to start trying out different models. Since we’re approaching the interview priority problem using binary classification, we can use one of scikit-learn’s many supervised classification models. To start, we split the dataset into train and test sets with 80% allocated to training and 20% for testing.
We used our training set to try out several models including LogisticRegression, DecisionTreeClassifier, RandomForestClassifier and SVC and then compared how each model performed on the test set using the binary classification metrics of precision, recall and F1 score.
- Definition: a model’s ability to identify only the relevant data points
- Formula: true positives divided by total positives
- Definition: a model’s ability to find all the relevant cases within a dataset
- Formula: true positives divided by true positives + false negatives
- The F1 score is a function of precision and recall, which balances the two metrics and is especially useful when you have imbalanced classes.
When deciding on performance metrics, you should consider what you’re optimizing for and balance the tradeoffs between false positives and false negatives. For our interview priority model, a false positive occurs when the model predicts a student will be admitted to Flatiron School, but isn’t, and a false negative occurs when the model predicts a student will not be admitted, but is! While prioritizing interviews is relatively low stakes, there are some tasks such as medical screening and self-driving cars where the repercussion of false predictions can be quite serious.
If the Model Fits, Tune its Parameters
While we were happy with the overall performance of the RandomForestClassifier, there is no excuse not to do a little parameter tuning. The parameter tuning process runs over various parameter combinations using a cross-validation procedure to optimize model accuracy. There is a handy module called GridSearchCV, which automates this process, and provides the additional benefit of cross-validating your model. Cross-validation involves splitting your dataset into a fixed number of folds, running your model on each fold and then averaging the overall error estimate. It’s a strong indicator of how well your model will generalize to new data and protects against overfitting.
Deployment and Monitoring
Once you have a working model, it’s time to release it into the wild to see how it performs against new cases. Even though the hard work is done, a model is a living thing and should be continually monitored and assessed for accuracy. There are many ways to manage model deployment; in our interview priority example, we deployed our model using a Google Cloud Function triggered by an HTTP POST request each time a new student application is submitted.
Happy modeling 🚀
Thanks for reading! Want to work on a mission-driven team that empowers their software engineers to play with data? We’re hiring!