TLDR: Don’t use componentsSeparatedByCharactersInSet: to split a string by words in Swift. Depending on your needs, use either CFStringTokenizer, enumerateSubstringsInRange:options:usingBlock: or NSLinguisticTagger. But be aware of some subtle differences.
NOTE: This post is from 2015 and Swift has evolved a lot since that so make sure to have that in mind when reading this post. See Apple’s video on Natural Language Processing and your Apps from WWDC 2017. (Thanks for the heads up, Rob Phillips.)
I will keep this fairly short. It is a quite common task to have to split a string by words. If you’re a native English speaker — or Danish like me — there’s a good chance your first thought is to separate by whitespace and/or punctuation characters. But don’t. It will not work well for English. And it will not work at all for languages like Japanese or Chinese that do not use whitespace to separate words.
Do Not Separate by Whitespace and Punctuation
In case you are wondering why splitting by whitespace and punctuation characters does not work well for English, here is an example. Take the string “You shouldn’t do that”. Splitting by whitespace and punctuation will result in the following elements:
["You", "shouldn", "t", "do", "that"]
Note that “shouldn’t” was tokenized as “shouldn” and “t”. That was most likely not what you wanted. Of course, to try and remedy this, you could separate by all the whitespace and punctuation characters except the apostrophe. But then problems will arise if the apostrophe is used in the input as a quotation mark. Further, it will not help you for languages that do not use whitespace to separate words like Japanese and Chinese. Unfortunately I speak neither Chinese nor Japanese. But we can do with a Google translation of the example sentence above. Given the impressive but not always 100% meaningful translations of Google Translate, the Japanese translation below may not make much sence. But nevertheless, it is an example of how words are not separated by whitespace in Japanese:
Splitting the above Japanese sentence by whitespace and punctuation will yield the following array:
In other words: no split at all. Just to make it clear that this is not the desired output, here is an example of what we might have wanted:
[“あなた”, “は”, “それ”, “を”, “行う”, “べき”, “では”, “あり”, “ませ”, “ん”]
Note that the sentence has not simply been split into elements containing exactly one character each. In the section below, I will show you how I split the string. But first, in case you really want to separate by whitespace and punctuation, here is how you can do it in Swift:
Enumerating Substrings by Words
The String class in Swift has a method called enumerateSubstringsInRange:options:usingBlock (enumerateSubstringsInRange from now on) which according to Apple’s documentation “[e]numerates the substrings of the specified type in the specified range of the string”. Note that the method enumerates substrings of a specified type. One of the types you can specify is NSStringEnumerationOptions.ByWords. Using the String extension below, we can easily split a string by words:
And there you have it, enumerateSubstringsInRange is what I used above when creating the ‘desired’ split for the Japanese string. Here is the result again:
["あなた", "は", "それ", "を", "行う", "べき", "では", "あり", "ませ", "ん"]
The method works on our English string as well:
[“You”, “shouldn’t”, “do”, “that”]
Another way to split a string by words is to use CFStringTokenizer. As the prefix CF indicates, this is part of Core Foundation. But do not fret, it is not too complicated. Here’s a simple String extension:
Our English sentence will be tokenized in the same way it did using enumerateSubstringsInRange. The Japanese translation will be tokenized like this:
["あなた", "は", "それ", "を", "行", "う", "べき", "では", "あ", "り", "ませ", "ん"]
Now, let’s compare this to the result from above. Below, the first column is the result of using enumerateSubstringsInRange and the second column is the result of using CFStringTokenizer. I’ve matched up each of the tokens from the two tokenizations. For some pairs of characters, enumerateSubstringsInRange created two tokens where CFStringTokenizer created only one containing both characters. For example “行う” is left as one token when using CFStringTokenizer but it is split in two when using enumerateSubstringsInRange. Those cases are shown as “none*” below:
In the code above which uses CFStringTokenizer, a locale is created using CFLocaleCopyCurrent(). Of course, the resulting locale will vary depending on what the current locale is. But even when experimenting with different locales I have not been able to make CFStringTokenizer match the output of enumerateSubstringsInRange for the Japanese string. As far as I can tell, Apple’s documentation does not shed any light on the differences between how CFStringTokenizer and enumerateSubstringsInRange work.
Finally it is time to look at how you can enumerate strings using NSLinguisticTagger. As the name suggests, NSLinguisticTagger can do a lot more than enumerating the words in a string. (In fact, it can do a lot more than tagging too.) But since NSLinguisticTagger is made to be used for natural language processing, it enumerates in a way you might not expect. Here is how it will split our English example string:
["You", "should", "n't", "do", "that"]
Note that “shouldn’t” is tokenized as “should” and “n’t”. This is because NSLinguisticTagger is created for part-of-speech-tagging (POS-tagging) and for that purpose it is relevant to separate “n’t” so it can have its own POS-tag. That way NSLinguisticTagger can let you do fancy stuff like extracting all the nouns from a string. But it may not be what you want if you are ‘just’ trying to separate the words. The example below shows how you can use NSLinguisticTagger:
We have seen how we can use either enumerateSubstringsInRange, CFStringTokenizer or NSLinguisticTagger to split our simple example string and its Japanese (Google) translation and how each method gives us slightly different results. But if we add an emoji or two to the string there are even more subtleties. Take the following string:
I'm not a 🐥.
Below are the results using the three different methods:
// enumerateSubstringsInRange (by words):
["I'm", "not", "a"]
["I", "'m", "not", "a", "🐥"]
["I'm", "not", "a", "🐥"]
As you can see, the 🐥 is removed from the output generated by enumerateSubstringsInRange but is included in the output from NSLinguisticTagger and CFStringTokenizer. Interestingly, if you add emojis to a Japanese string, they are not removed by enumerateSubstringsInRange. The Google translation of “I'm not a 🐥.” is “私は🐥ないよ。” and running enumerateSubstringsInRange on that yields this array:
["私", "は", "🐥", "ない", "よ"]
There is one more thing to consider when deciding on how to tokenize your strings: performance. I’ve made some very simple measures on my MacBook Pro. I took some English text from Wikipedia and tokenized substrings of it using each method described above. I varied the substring length to see how the running time depends on the input length. The results are shown in the chart below.
As you can see, NSLinguisticTagger takes way longer than the other methods. This is expected because it does a lot more work (remember, it creates POS-tags too). Interestingly, CFStringTokenizer runs quite a bit faster than enumerateSubstringsInRange. And it is even a bit faster than componentsSeparatedByCharactersInSet which means there really is no excuse for using the naive method.
Below are a few suggestions for further reading.
- Apple String Programming Guide: https://developer.apple.com/library/mac/documentation/Cocoa/Conceptual/Strings/Strings.pdf
- Article about NSCharacterSet on NSHipster: http://nshipster.com/nscharacterset/
- Article about CFStringTransform on NSHipster: http://nshipster.com/cfstringtransform/
- Article about NSString and Unicode on objc.io: http://www.objc.io/issue-9/unicode.html
- Discussion about string tokenization on Cocoabuilder: http://www.cocoabuilder.com/archive/cocoa/322350-splitting-cjk-text-into-words.html’
- Discussion about string tokenization on Apple Mailing Lists: http://lists.apple.com/archives/cocoa-dev/2008/Feb/msg00330.html