iOS 11: Are Apple’s new NLP capabilities game changers?

Why on-device NLP still lags behind server-side NLP

Publicly released in a few weeks, iOS 11 will introduce a handful of much anticipated machine learning frameworks in Vision, CoreML, and Language. On the NLP side, Apple builds upon its `NSLinguisticTagger` class (it’s actually been around for some time, but has been reimplemented from the ground up) and gives developers access to higher-level NLP features, in particular, Named entity recognition (NER). Yay 👩‍💻!

Named entity recognition is a super useful information extraction task in which we seeks to locate and classify mentions of real-world objects –persons 👩‍⚖️, locations 🏝, organizations 🏬, products 🎮– in some text.

As we use a lot of NER, we were curious what kind of accuracy Apple’s on-device framework would give us. The most straight-forward way of evaluating those models is to use the canonical CoNLL (2002 and 2003) shared task datasets.

TL;DR: Apple’s on-device NLP gets average accuracy on CoNLL datasets. Out of the box, spaCy (a server-side, Python based NLP framework) consistently gets better precision and recall.
Benchmarking code (Jupyter notebook and Xcode Playground) is on Github here:

In the “Natural Language Processing and your Apps” 2017 WWDC session, the Apple engineer presents a pretty typical Apple performance slide 😊:

Apparently, Apple’s NLP gets 85% accuracy on Named entity recognition in English — but we don’t know what accuracy they are talking about 😂. Let’s find out ourselves!

Each year, the CoNLL (“Conference on Computational Natural Language Learning”) participants agree on a shared task that researchers are going to compete on. Back in 2003 the shared task was a NER task, and the annotated dataset they produced is still widely used as a canonical dataset.

The actual dataset looks like this:

began VBD B-VP O
the DT B-NP O
defence NN I-NP O
of IN B-PP O
their PRP$ B-NP O
title NN I-NP O
. . O O

Each token is annotated with a Named entity type, using the IOB format: B is the beginning of an entity, I is inside an entity, O is outside any entity.

In an Xcode playground, we instantiate an `NSLinguisticTagger` object, and we extend Apple’s NSLinguisticTag to map Apple’s tags to canonical tags (PER for person names, LOC for location names, ORG for companies and organizations):

Note that NSLinguisticTagger doesn’t output MISC (miscellaneous) entities, so we’ll just exclude them from the dataset when computing F-1 scores.

Starting with iOS 11, NSLinguisticTagger has a method that takes any token’s position in the input string, and return its entity tag. This is a great way to work around problems that arise when using different tokenizers.

This method is a life-saver when comparing outputs from different tokenizers.

We use this to get the Named entities from the tokens, and output them in CoNLL format again:

The CoNLL dataset ships with conlleval, a Perl script that evaluates a model’s accuracy. Use it on the output file and you’ll get results that look like this:

Here’s the verdict: on CoNLL 2003, Apple’s NLP framework gets a F1 score of 54%, while the State-of-the-art F1 score is 90.9%.

The F1 score is a major evaluation measure in machine learning and takes into account both precision (the model didn’t get many false positives) and recall (it didn’t get many false negatives) of a model. See Wikipedia for a great visual explanation.

Important disclaimer: It is very important to note here, that we evaluate the model on a dataset that is different from the one that trained it.

NER is a general task, and the CoNLL dataset is designed to be generic enough and close to most use cases, so a good general NER model should perform well on the dataset. However, it’s not completely fair to directly compare F1 scores with those of models that were trained on this data (and only this data). Still, it gives us directionally correct indications.

Let’s now contrast this to a well-known server-side NLP toolkit, spaCy. 
spaCy ships with its own, small-size NER model for English.

Extracting a document’s Named entities is just a matter of writing these four lines of code:

Here, the mapping to the canonical NER tags is a bit more tedious as spaCy’s model exposes more entity types by default. We map it manually.

Running the conlleval script on the output files gives a higher F1 score out-of-the-box of 62%. This is not a huge difference (especially when comparing both to the State-of-the-art on the dataset), but it is still significant.

In the coming weeks, and as iOS 11 gets released, we’ll investigate how Apple’s NER model turns out in real-world use cases. Subscribe to be notified when we find out more!

Revelant references:

🤗 Build your own AI robot: 👍