I’m constantly fascinated by machine learning and always excited to find new projects for it. But as trendy as ML has become, sometimes a SQL query or IF statement can accomplish the same job as an ML model in much less time. I wanted to gauge interest in this topic before diving in, so I sketched a quick flowchart while on a plane and posted it on Twitter:
I guess this is something a lot of people are thinking about! In this post I’ll go through the paths in the flowchart with specific examples using real datasets. This series will focus on supervised learning — using labeled data to train a model. This is Part 1, so I won’t cover all the paths in the flowchart just yet. Let’s dive in.
How will you use your data?
Starting from the top of the chart, the first thing to think about when solving any problem with data is how you’ll use the data when you’re finished with the task. ML is primarily for using data to make future predictions. If you’re only looking at historical trends in your data there’s no need to build a machine learning model.
Also, chances are if you’re analyzing historical data you’ll want to know why certain events occurred and exactly how two pieces of data are related. A machine learning model, on the other hand, creates a sort of black box between your inputs and predictions. For example, it may be able to predict that there’s an 80% likelihood a specific person will buy X brand of shoes given their age, past purchase history, and interests. But the model won’t be able to tell you how it came to those conclusions.
What’s the best approach if we only care about analyzing historical trends? Let’s take a look at this public domain dataset of U.S. Congressional bills as an example. It contains data on ~400k bills that have gone through Congress since 1947. For each bill, it includes a description, whether it was passed, data on who introduced it, the topic, and much more. I want to figure out which factors have contributed to passing a bill in the past. I’ve put this data on BigQuery so I can easily run some SQL queries on the data. First, I want to see the percentage of all bills in the dataset that have passed. I can do this with a simple SQL query:
On average, only 4.03% of proposed bills pass. Let’s see if the topic of a bill affects whether or not it will pass:
Here are the results, visualized using Data Studio:
It looks like certain topics do indeed have a higher percentage of passing. Bills related to Public Lands and Government Operations are much more likely to pass than bills about Social Welfare, Macroeconomics, or Labor. It would also be worth investigating whether the percentage of bills passed for each of these categories has changed over time, or whether any data related to the person who proposed the bill affects its likelihood of passing.
It’s worth reiterating that the dataset we choose doesn’t dictate whether or not it’s a good fit for ML — it all depends what you want to do with the data. The bills dataset I used for SQL analysis above could also be used for ML if I define a clear set of inputs and outputs for my model. For example, I could use the bill title to train a model to predict the topic of a new bill. Or, I could use several data points for each bill to predict whether or not a new bill will pass.
If we do decide to use our data to train an ML model, the analysis above is an important first step. We want to understand any relationships between the data we’re feeding into our model and the thing we’re predicting before we ML-ify it, especially if we’re dealing with structured data like the example above (more on that in the next post).
What type of data are you working with?
Next let’s move along to the right side of the flowchart. For the rest of the post I’ll assume you want to use your data to generate future predictions.
Videos, images, audio
With unstructured data like videos, images, or audio files, machine learning will typically be your best bet. To understand why, think about how you’d go about analyzing these types of files without machine learning. If you wanted to build a model to determine whether an image contains an apple or an orange, you’d need to iterate over all the pixels in the image with a series of IF statements or rules you’ve specified in advance. You’ll quickly discover quite a few edge cases that don’t fit into the rules you’ve defined, like black & white images, or an image of a cross-section of an orange. That’s exactly why ML is a good fit — you can train a model on thousands of images of apples and oranges at all different angles and it’ll be able to generalize on images it hasn’t seen before.
I haven’t yet made a distinction between using already available pre-trained models vs. building your own custom deep learning models. This largely depends on the type of task you’re solving. Let’s take this image of an orange as an example:
I could run this through a pre-trained model or build my own from scratch, it really depends what I want the output of my model to be. If I’m building some sort of orange photo contest app and need my model to tell me simply whether or not an image contains an orange, I could utilize a pre-trained model like the one provided in the Cloud Vision API. Here are the labels I get back:
The Vision API will work great for my orange / not orange use case. But this image is actually of a blood orange, and now I want my app to identify what type of orange is in the image. I can’t reliably get this back from the Vision API, but I could use a tool like AutoML Vision to train my own custom model to identify blood oranges, navel oranges, and mandarin oranges. I don’t need to write any of the underlying model code to use AutoML, I just need to upload 10+ images of each type of orange into the UI and click a train button. AutoML makes use of a technique called Neural Architecture Search, which uses an ML model to generate different types of model architectures to find the optimal one for a particular task. But the great thing about AutoML is that I don’t need to understand how this works. When my model training completes I can evaluate its accuracy in the UI:
When I’m happy with the results I can generate predictions via a custom REST API endpoint.
So far I haven’t had to write any of the model code since the prediction I want from my orange image hasn’t required it. But now I’m getting a little fancier, and I want my app to take the same types of images as input and then highlight the regions of the image that contain orange rind. We’re getting pretty specific here, so this is not a task for your average pre-trained model.
There are many ways to solve this, but one approach would be to build a TensorFlow model that makes use of transfer learning. With transfer learning, we can take a model that’s already been trained on lots of images to accomplish a task similar to ours (like identifying regions, or “masks” in an image), and then use that as a starting point for our training. This way we don’t need as much training data as we would if we were building a model from scratch. TensorFlow has a few models built for this exact use case, listed at the bottom of the table here. This blog post provides a great walkthrough of how to implement this in TensorFlow. If you’re looking to build a similar model with bounding boxes (identify boxed regions in the image instead of masks), I’ve got a post on doing this to detect Taylor Swift.
Video and audio data
I haven’t touched on generating predictions from video or audio data with machine learning, but the approach is very similar to what I’ve outlined above for images. If I wanted to analyze videos without machine learning, I’d need to take the pixel approach I mentioned above one step further and apply it across each frame in a video. This would quickly get messy with a rule-based approach. To get started analyzing videos with machine learning, you can check out the Cloud Video Intelligence API to get scene-level video annotations. For a deep dive on building your own custom model for analyzing videos, check out this Medium post.
Audio data is also unstructured, but the process for analyzing and generating predictions on it without machine learning is different than looking at the pixels in an image. Let’s say I want to write a rule-based program to determine whether or not an audio file contains a human speaking. First, I’d need to read the raw audio data and run it through a Fourier Transform (FT). I am not an expert on FTs, but suffice it to say it will transform the data from time domain to frequency domain. Once I have the frequency data (measured in hertz), I can find the average frequency and compare that number to the range of frequencies a human voice typically occupies (85–255 Hz).
Sounds simple enough, right? You could write a script to calculate the average frequency of a file and then check if it falls within the average 85–255 human range. But what about cases where a person’s voice falls slightly outside the range? Even with frequencies only slightly less than 85 or more than 255, the rule-based approach would classify these as “not human.” Or, take a file with a person speaking and a cat meowing. If the meows take up more time, this will bring down the average and the script will classify it as “not human” as well. There are also many other sounds that occupy a similar frequency to a human voice. You get the idea.
It would be better if we could train a program to learn what the data associated with a human voice looks like, or how to identify a meow, or the word “go.” For a simple way to convert audio to text or text to audio, try the Cloud Speech APIs. And to get started building your own model for custom audio analysis (like training a model to predict the specific person speaking), check out this TensorFlow tutorial.
Stay tuned for part 2 where I’ll cover when to use machine learning to analyze text, numerical, and categorical data. And for more info on the tools I’ve covered in this post, check out:
I’d love to hear what you think about this approach for evaluating whether to solve a problem with ML. Find me on Twitter @SRobTweets or leave a comment below.