This is the second post in a two-part series going through my “when to use machine learning” flowchart:
The first post focused on two things:
- Figuring out whether you want to use your data for generating future predictions or historical trend analysis. If it’s the latter, you don’t need ML
- Why ML is often a good solution for generating predictions on video, image, or audio data
In this post I’ll focus on the bottom quadrant of the flowchart: taking text, numerical, and categorical data and deciding whether ML is a good fit for your prediction task. This is the tricky part. With the unstructured nature of images and audio, it’s almost always obvious that ML is the best choice since defining rules like “if the majority of pixels are orange” quickly gets out of hand. But with text, numerical, and categorical data the answer isn’t always so obvious.
Finding patterns in your data
For this section let’s assume you’ve found yourself at this step in the flowchart:
First, use SQL or other data analysis tools to identify patterns between your inputs and the thing you’re predicting.
We can’t get much further in this discussion without looking at some actual data. Let’s get started using log files as an example. I’m choosing log files because logs generate tons of data, which might lead people to think they’d be a good fit for running through an ML model. Are they? It depends.
“The thing you’re predicting” is the important piece here. Let’s say I want to flag error logs generated by my application. Since I’ve already got lots of log data, I could classify each log as “error” or “not_error,” build a binary classification model with a framework like TensorFlow, train the model, figure out where to serve it, and then make an API call to my model whenever new logs come in. That sounds like a lot of trouble, which is why we should see if there’s a simpler approach before we go the ML route.
Time to test this with real data. I stumbled upon this jackpot of log datasets on GitHub. For this example I’ll look at the BlueGene/L logs — a dataset of logs generated by the BlueGene/L supercomputer. Disclaimer: I am not an expert in analyzing logs or HPC 😮
The GitHub repo provides a small preview of the original dataset. Each log consists of a message, timestamp, ID, location, type, and more (all detailed here). The original dataset is a newline delimited text file, with a log statement on each line:
Before I write a script to split the log strings into a more structured format, let’s look at the data to see if we can spot any trends. It looks like nearly every log with an error contains the word “error,” “fatal,” or “failed.” Rather than building out an entire ML system, it would be way easier to run new logs through something like this:
In the sample file with 2000 logs, the above IF statement is able to flag 207 of them as errors. By clearly defining what I wanted to predict and doing a quick search for patterns in my data, I’ve saved myself lots of time and built a solution in five lines of code.
The same type of analysis could be applied to any dataset, not just log files. For example, I if I worked at a clothing company and had a dataset of customer demographic and purchasing data, I’d want to do the same “what am I predicting?” analysis before jumping to ML. If all I want to do is figure out which customers to email about a new winter coat, it might be as simple as “IF user lives in cold climate, promote the coat.” But if I want to use their past purchasing history to recommend new shirts they might like, it might be time for ML.
ML on text, numerical, and categorical data
Working with the same log file dataset as we did above, I’d now like to change the thing I’m predicting. Instead of flagging logs that indicate errors, I want to identify logs that indicate anomalies. The dataset is already labeled for this, using a “-” in the first character of each log to indicate “non-alert” logs. Logs without a dash symbol are alerts. This makes it a good dataset for alert detection, but I’m still not 100% convinced that machine learning is the right tool for the job. Before I decide, I want to compare alert vs. non-alert logs to see if I can spot any patterns.
To do this I’ve written a script to extract a few pieces of data from each log (using the full BlueGene dataset here) and write it to a CSV. I’ve then uploaded it to BigQuery for exploration. The result is a 340MB table with 4.7M rows and four columns that look like this:
alert column contains the code for the alert, or the string
non_alert. With a simple query I can see how many of the 4.7M rows contain alerts. There are 348,698 alerts, about 7% of the total dataset. Let’s see if we can spot any trends by doing some queries on the alert logs. In this phase of data analysis, simple SQL aggregations are almost always the solution — sometimes with
COUNT(), and sometimes with
AVG(). In this query we’ll look at a breakdown of the location of alerts using
And here’s the result:
It looks like most of the alerts occurred at the kernel or app level. Does that mean I could write something like
IF location == 'KERNEL' or 'APP', it’s an alert? Let’s look at all the logs from the KERNEL level:
Hmmm, over 4 million of the ‘KERNEL’ logs are not alerts, so it doesn’t look like we can draw any conclusions from the location. We can repeat the same analysis for log type. Here’s a breakdown of log types for only the alert messages:
Almost all of them are FATAL. But if we look at a breakdown of alert type for the FATAL logs, most of them are not alerts:
We can conclude that FATAL logs have a higher likelihood of being alerts, but we definitely cannot say that if a log is FATAL, it is an alert.
Next I’ll look at the log message field. Since this is free-form text, it won’t be quite as easy to extract patterns as the
type fields in the log data. But it turns out there are many repeated log messages within the alert logs:
Can we conclude that logs with a “data TLB error interrupt” error message are more likely to be alerts? Let’s break down the alert type for logs with that message:
Boom! Now we’re getting somewhere. All of the “data TLB error interrupt” messages are alerts. Looking more closely at my
GROUP BY message query above, there are 49,042 rows in the output — meaning there are 49,042 unique log messages associated with alerts. Since the top error message accounts for nearly half of all alerts, in theory I could write an IF statement to classify all logs with ‘data TLB error interrupt’ as alerts. But that would only classify 45% of new error messages correctly.
For a moment let’s pretend that this approach for classifying logs is a “machine learning model”:
Of course it’s not actually an ML model, it’s just an IF statement. But pretend for a moment that the classifications it’s generating are coming from an ML model. If I only care about my model’s ability to correctly identify logs that are alerts (also known as precision), this model is doing very well. Precision measures: for all logs my model classified as alerts, what % of those were correct? In terms of precision, this model achieves 100%. Pretty impressive!
But what about all those logs with other messages that my “model” missed? That’s what recall measures — what % of all alerts did my model identify correctly? My model’s recall is 45%, which is not so great. But I still may not need machine learning. First I need to decide which I care more about:
- Option 1: My system doesn’t flag any non-alerts as alerts. In the process, it may miss some logs that are alerts.
- Option 2: My system flags a high percentage of possible alerts. In the process, it may surface some false positives (alerts that actually aren’t alerts).
This is often a tradeoff. If I prefer option #1, I can likely proceed without machine learning. I could even improve my system by adding more of the top alert log messages to my IF statement, along with some common features of non-alerts:
This rule-based approach will classify alerts and non-alerts correctly, but if I want it to be all-encompassing it’ll quickly get long and messy.
If I want my system to flag as many potential alerts as possible (Option 2), a machine learning approach may work best with this dataset. And since I’ve already done some data analysis to determine which factors influence the likelihood that a log contains an alert, I know the pieces of data I’ll be using to train my model.
We can use BigQuery ML to take existing data stored in BigQuery and use it to train a machine learning model. With BQML we can choose between three different types of models:
- Linear regression: predict a numerical value (like revenue)
- Binary logistic regression: predict the probability that your input belongs to one of two classes
- Multi-class logistic regression: predict multiple possible classes for a particular input
Our case of predicting whether or not a log is an alert is a great fit for binary logistic regression since we’ll have 2 classes:
non_alert. For our model we’ll use the log message, type of log, and log location to predict whether or not it’s an alert. We can create and train our model with a single SQL query in BQML:
Let’s break down what’s happening in our query:
- In the first line we create the model and give it a name
- Then we tell BQML the type of model we’re building (logistic regression) and the column from our table that will be the label: this is the thing our model is predicting. In our case, it’s the
- The rest is a regular old SQL query where we extract all of the data that will be used as input and output to our model. Remember that the alert column had more than 2 values:
non_alertor the particular alert messages if the log was an alert. I’ve used a
CASEstatement to convert this data to two classes
The model took about 10 minutes to train, and once it finished we can see the loss (error) reported after each iteration of training:
Here we want to focus on the Evaluation Data Loss, since this measures our model’s error on our test set. When we train our BQML model, it’ll automatically split our data: using the majority of the data to train the model and then reserving a subset of the data to see how our model performs on data it hasn’t seen before. We want to see loss decreasing as our model trains for more iterations, which is exactly what we get here.
We’ve got a trained model, now what? Let’s use it to generate a prediction, which we can also do with SQL. Here I’ll feed it one example for prediction, using a log with one of the most infrequently occurring alert messages (our model should predict
alert). When we run predictions, BQML creates a new field with the name of our output column prefixed with
And here’s the result:
The model has correctly predicted it’s an alert! We’re seeing very high accuracy here since BQML treats free-form text as categorical data. The model associates every unique log message with a numerical value, and it can memorize the combination of certain messages, locations, and log types that result in alerts. In other words, it is handling each message in its entirety rather than learning patterns within the text. If we pass it the same log message as the one above with a period at the end of the sentence, it would likely classify it as
non_alert. To train a model that is better at generalizing, we should build a custom model using some natural language approaches to learning patterns in our text (more on that in a future post).
There you have it — with the same dataset, we’ve solved one problem with a simple
IF statement and another with an ML model. Along the way we used some data analysis to help us figure out which use case was a better fit for ML.
Hopefully you’ve now got a better idea of the process you might go through to determine if ML is a good fit. Want more of what was covered here? Check out these resources:
- This ML crash course has good descriptions of precision and recall
- Lak has a great post on this topic
- Get started with BigQuery
- Train a model in BQML
- Learn how to build a custom model for text classification with this post
Got feedback on this post or something you’d like to see covered in the future? Leave a comment or find me on Twitter @SRobTweets.