ML.NET vs. Create ML: Toxic Text
A quick, low-level text-based comparison
Recently, I got hyped about Apple’s new Create ML GUI tool and gave it a quick whirl. When it came down to it, I was really impressed with how simple it was to input training data and have it spit out an actual working model.
Making a model using a Playground, I’ve come to learn, really isn’t that much more work. But, I think the new tool is about making machine learning more accessible, even if it just seems like it is.
And, clearly, it worked with me.
So, after working with images, the next logical test for me was to try and mess with text.
Thinking about how to test this, I remembered a simple ML.NET Tutorial I found, that ran through a simple binary classification model. The Wikipedia Detox use case that the tutorial takes you through is actually great.
It was more than simple to take Apple’s own text classifier model tutorial and fit the detox use case into it.
All of a sudden, I had a lightbulb moment and realized this also presented an opportunity to compare and contrast the two frameworks in yet another truly epic Apple vs. Microsoft showdown!
And, interestingly, it was an opportunity to compare and (sort of) rate both of their introductory tutorials in this area. So, after slight alterations, I created my own test data, ran them through both implementations and compared the results.
The first thing I compared was what each framework was telling me about their own level of confidence. Models giving themselves their own confidence level is really interesting and yet curious.
Stats are always welcomed but my skepticism lies in the fact that I wonder if test data is truly trustworthy and how vulnerable machine learning assumptions are to human error.
But I digress. We want results! First, ML.NET CLI:
Then, Create ML through Playgrounds (note: I kept the default random split for training and test data at 80%):
At first glance, the numbers suggest it’s a tight race with Apple having a slight lead.
There are a few variables to this, though. First, that’s based on Apple’s validation accuracy, which the output admits is self-generated from 10% of the data.
This is wonky because I had already split the 250 lines of data into training data (80% of the data) and testing (20%). If I were to change that random split percentage, the accuracy could technically change with it.
This is not necessarily me just being nit-picky. I’m basing my options on what is immediately shown to me from the respective tutorials. This means that, while there may be ways of configuring the ML.NET model, I’m choosing to go off whatever
I would probably want to consider the respective parameters and options more carefully if this were any other project, but this is essentially a comparison of beginner tutorials under the same use-case. Consider that another of the “loose controls” in this experiment.
My Own Test
Now, the real test to me is whether or not the models actually work, based on my own input. Granted, the data set has some pretty garbled comments that my models are now trained after.
Writing simple sentences may not necessarily yield the results I’m looking for (which makes it interesting that Microsoft would choose this limited set for an introduction tutorial).
There are 10 statements. The key, according to me, is:
- I hate you | Toxic.
- I love you | Non-toxic.
- Thank you so much | Non-toxic.
- Thank you so much! |Non-toxic (testing punctuation).
- This article sucks | Toxic.
- Go hurt yourself | Toxic.
- Get yourself some help. | Non-toxic (pulled from dataset).
- I saw it before watching the episode. Oh well. | Non-toxic (pulled from dataset).
- I agree | Non-toxic (pulled from dataset).
- That is rude | Toxic (example from tutorial, though without an expected result, so this is my own categorization).
And Create ML:
Interestingly, by my count, ML.NET had one fewer but, arguably, Create ML had a double with “Thank you so much” with and without an ‘!’. So really, the accuracy is similar by the numbers.
Then, there’s the context of the misses. Both models apparently don’t have a thing for “love”. Also, ML.NET was being a jerk when it said the comment about this article was Non-Toxic.
And, as for the “getting help” statement, that could go either way actually, without more context. Core ML was more glass-half-full with that statement, but tripped over the “thank you” statements and “I agree”, which seems simpler to pick up (I could say that I don’t know what the model teaches about that, but ML.NET picked them up correctly).
It’s really difficult to declare a winner. What we’re really stating here is who has the better framework at the most basic, introductory, and tutorial level.
And, while that could make a conclusion easy to brush off, there’s something to be said about that being a valid perspective to judge from.
I have to give it to ML.NET. The reason is that, ultimately, on basic tests with very little context, it seemed to do better.
“Thanks” and “agree” could’ve been said sarcastically in the dataset, which, in turn, Create ML could’ve picked up on. But, that wasn’t the context in which they were stated, and ML.NET agreed there was nothing more to prove they were Toxic.
While, again, this was a truly basic comparison test, it could speak volumes. These tutorials are possibly the first bit of exposure a curious developer has to either framework.
When deciding which to use for my projects, I want to ascertain a level of confidence discovered by the end of the guided tutorial. Ultimately, that will bear a lot of weight in determining whether or not it will land in my project.
Of course, there’s more that goes into the decision, like does ML.NET provide better fine-tuning of my model or would I prefer to use Create ML in my native iOS app?
For an example, I whipped up a quick iOS app that consumes the Create ML model from the test. It took just 21 lines of code (most of it SwiftUI) in a matter of minutes to create (see gif above to see it in action).
But, as the saying goes, first impressions are lasting.
Maybe that’s why Apple uses a different data set/scenario. It may also be why Microsoft suggests you try a phrase (“That is rude”), without giving you an expected outcome.
All in all, I’m already impressed by machine learning but there’s still plenty more learning for ME to do.