Strings, characters, and performance in Swift — a deep dive
You might be writing extremely inefficient code without even realizing it.
The content of this article applies to the current Swift version at the time of this writing (Swift 3.0.1). A future version of Swift could change some or all of the details discussed here.
A few weeks ago, I was running some benchmarks on a Swift project to which I contribute. A subset of our code base, which involves string scanning/parsing, was taking about twelve seconds to parse approximately 150K of text. Something obviously didn’t smell right. What follows is a deep dive into the internals of Swift strings that lead to an understanding of why the performance was so unsatisfactory and a solution that shaved it down to a fraction of a second without large-scale changes—with conclusions that you can adopt in your own code.
Before we jump in, let’s define some Unicode terminology that will show up throughout this discussion:
- Code point: the unique 21-bit numerical value assigned to a position in the Unicode code space. A code point can represent things like standalone characters (U+0065, Latin Small Letter E, “e”), glyphs that combine with the preceding character (U+0300, Combining Grave Accent, ◌̀), emoji (U+1F355, Slice of Pizza, 🍕), or control characters (U+000A, Line Feed), to name a few. Code points in the ranges U+0000…U+D7FF and U+E000…U+10FFFF are represented in Swift with the
UnicodeScalar
type. - Surrogate code points: Code points in the range U+D800…U+DFFF that are reserved for use by UTF-16 to encode code points that are larger than 16 bits. A pair consists of a “high surrogate” (U+D800…U+DBFF) followed by a “low surrogate” (U+DC00…U+DFFF).
- Code unit: the sequence of bits that corresponds to a code point in a specific encoding. UTF-8 uses 8-bit code units; UTF-16 uses 16-bit code units. A single code point may be represented by multiple code units; for example, the Greek lowercase letter phi (φ) is encoded in UTF-8 as the two-byte hex sequence <CF 86>. These are represented in Swift with types like
UTF8.CodeUnit
andUTF16.CodeUnit
(which are just aliases forUInt8
andUInt16
, respectively). - Extended grapheme cluster: A sequence of code points that form a single human-readable character, such as “a” + ◌̀, which together form “à”. Another example is the set of emoji that represent nations’ flags—for example, the Italian flag (🇮🇹) is represented by a pair of code points: U+1F1EE (Regional Indicator Symbol Letter I) followed by U+1F1F9 (Regional Indicator Symbol Letter T). This concept is represented in Swift with the
Character
type. - Canonical equivalence: Two sequences of code points are considered canonically equivalent if they represent the same character. For example, code point U+00E0 (Latin Small Letter A with Grave, “à”) is equivalent to the code point sequence <U+0061 U+0300> (Latin Small Letter A, “a”; Combining Grave Accent, ◌̀).
Swift string views
Swift’s String
type provides four different ways to access the underlying data, called “views”:
characters
, a collection of extended grapheme clusters, represented by the typeString.CharacterView
unicodeScalars
, a collection of (non-surrogate) Unicode code points, represented by the typeString.UnicodeScalarView
utf8
, a collection of 8-bit unsigned integers representing the UTF-8 encoding of the string, represented by the typeString.UTF8View
utf16
, a collection of 16-bit unsigned integers representing the UTF-16 encoding of the string, represented by the typeString.UTF16View
If you have experience with other programming languages, you’ve come to think of a string as a collection of characters, so characters
feels like the natural view to use in most cases. It is arguably also biased by the fact that its name sounds the most “friendly” of all the string views.
“Character” is a bit of an overloaded term whose meaning differs depending on the programming language. For example, char
in C/C++ is an 8-bit integer for which the language does not specify any encoding. Characters in Java and Objective-C (as returned by String.charAt()
and -[NSString characterAtIndex:]
, respectively) are actually UTF-16 code units.
In languages like those above, a single “human-readable character” may actually be represented by multiple values of its character type. This means that a word like “città” could report having six characters if the last character is encoded as “a” + ◌̀, even though it looks like five characters to our eyes. Swift’s Character
type is designed to represent that “human-readable character” concept and would treat those final two code points as a single character, which also makes it more attractive to use.
Since String.characters
is essentially the “obvious” way to work with elements of strings in Swift, let’s spend some time designing an algorithm around it and see how it performs.
Designing the tokenizer
We’ll recreate a similar problem to the one I faced. Imagine that we have some textual data about U.S. states, their populations, and capital cities:
"California", 39250017, "Sacramento"; "Texas", 27862596, "Austin"; "Florida", 20612439, "Tallahassee"; "New York", 19745289, "Albany"; "Illinois", 12801539, "Springfield"; (...and so on)
The data is made up of the following tokens:
- Strings, which start with double quotes and include everything up until the next double quote. (To keep this example simple, we don’t allow escaped characters like
\n
or\"
.) - Integers, which are sequences of one or more digits 0–9.
- The comma and the semicolon.
- Whitespace between tokens is allowed and ignored. (We’ll only worry about space, newline, carriage return, and tab.)
(If you’re looking at the data format above and wonder why I’m not just calling String.components(separatedBy:)
, bear with me—it’s a contrived example. There are a couple reasons we might not want to do that here: it wouldn’t work if we allow commas and semicolons inside the quoted strings, or we might just want to scan the input string no more than once.)
Given the description above, we want to write a tokenizer that will return each component one-by-one, like so:
string("California")
comma
integer(39250017)
comma
string("Sacramento")
semicolon
(...and so on)
Since the focus of this article is on performance rather than writing a lexical analyzer, I won’t belabor the details of specific components like the token and error types. The entire project can be viewed on Github.
Character-based scanning
Tokenizing the input string means scanning it element-by-element and returning meaningful tokens based on what we see. So, the natural thing that probably comes to mind is to use an iterator over the string’s characters
to build up those tokens. A snippet of the character-based tokenizer is shown below (the full source, with comments, can be seen here):
(The nextCharacter
method is a helper function that calls iterator.next()
but also handles a “push-back” case needed when scanning the end of an integer. The integerToken
and stringToken
methods scan those tokens specifically and have been factored out to keep the main loop small.)
Notice that our case
patterns use string literals. This is an important feature of Swift; unlike languages in the C family that distinguish character literals with single quotes, the Swift compiler allows a string literal containing a single extended grapheme cluster to be treated as if it were a Character
, based on context.
Let’s measure execution time. The benchmark is set up to fully tokenize our state population/capital data 1,000 times, and the measurement is repeated five times to smooth out noise. (It’s also executed a couple times without measuring it first, to “prime” the benchmark and ensure that the first measured run isn’t artificially longer because of static data that is lazily loaded.)
My sample data is 1,694 characters long. Scanning that 1,000 times gave me these results:
CharacterBasedTokenizer:
..... 2164.5144464 ms ± 6.15726058362173 ms (mean ± SD)
Over two seconds to make one pass over 1.6 MB of text on modern hardware? That can’t be right.
Instruments to the rescue
To figure out exactly where the tokenizer is spending all of that time, we can run the Time Profiler template in Instruments. If we do this and then focus on the CharacterBasedTokenizer.nextToken
method, we see something shocking:
A full two-thirds of the time taken to scan tokens is spent in an initializer for the Character
type. What is going on?
The first argument label of the initializer gives a small hint: _builtinExtendedGraphemeClusterLiteral
. Note the word “literal.” This initializer isn’t used to extract Character
values from the string I’m iterating over; it’s used to create Character
values from literal text elsewhere in the source code. The only place where those are found beneath nextToken
is in my tokenizer’s switch/case
patterns. Are they really causing that much overhead?
We can have the Swift compiler emit the “canonical SIL” (Swift Intermediate Language) using the -emit-sil
option to take a closer look at how those case
patterns are being compiled to lower-level code. (I’ve included a small script in the repository that does this and also demangles the Swift symbols.) Let’s find the SIL corresponding to the case
pattern where we match the comma character (reformatted with line numbers and scopes omitted for brevity):
%448 = string_literal utf8 ","
%449 = apply %26(%448, %23, %24, %25) :
$@convention(method) (Builtin.RawPointer, Builtin.Word,
Builtin.Int1, @thin Character.Type) -> @owned Character
The method signature shown on the second and third lines of %449
gives it away, but we can confirm it by looking back up at value %26
, which is the function being invoked:
%26 = function_ref @Swift.Character.init (
_builtinExtendedGraphemeClusterLiteral : Builtin.RawPointer,
utf8CodeUnitCount : Builtin.Word,
isASCII : Builtin.Int1) -> Swift.Character :
$@convention(method) (Builtin.RawPointer, Builtin.Word,
Builtin.Int1, @thin Character.Type) -> @owned Character
What does this all mean?
It means that every time through the scanning loop (that is, for every character in the string), Swift is invoking this initializer to create each character in the case
patterns that is compares the current character against until a match is found—and that initializer is significantly more expensive than you might expect if you’re used to languages where the character type is really just a numeric code unit.
By looking at the Swift standard library source, we can see the sequence of actions that occurs when a string literal like ","
is converted to a Character
:
- The compiler embeds the UTF-8 encoded representation of the literal in the executable’s data segment.
- At the point where the literal is used, Swift invokes
Character.init(_builtinExtendedGraphemeClusterLiteral:utf8CodeUnitCount:isASCII)
, passing it the address of the string data embedded in step 1 (source). - This initializer, in turn, invokes the same initializer on
String
. Eventually, aStringBuffer
is allocated (source). - Finally, the
String
is converted into back into aCharacter
value (source). Depending on the size of its UTF-8 representation (remember that a single “character” can contain multiple code units), it will either be stored compactly as an integer no larger than 63 bits, or the entire string buffer will be retained (source).
In other words, just to ask the question “does ch == ","
?” we end up allocating a string in memory that we don’t even need, because ","
easily fits within eight bits, let alone 63.
Why is this so complicated?
It’s the price we pay for Unicode correctness and for having a character model that corresponds to what users see rather than the string’s internal representation. It is both powerful and convenient to be able to take a string like “città” and know that it is five characters long, regardless of whether it is five or six code points long (“citt” + “à” vs. “citt” + “a” + combining grave accent). Likewise, it is desirable to be able to compare those strings and determine that they are canonically equivalent even when their internal representations differ.
Could the compiler do a better job here?
Possibly. In contexts where a string literal is being converted to a Character
, I would imagine that with a bit more work the compiler could determine that the value would fit in the compact representation and emit that code directly, bypassing the string-to-character dance described above.
You don’t have to live with execution times like this.
Let’s talk about things you can do to improve your run-time that don’t depend on changes to the compiler.
First, if the majority of the tokenizer’s time is spent initializing the Character
values being matched against, what if we preïnitialized them ourselves?
After this change alone, the new execution time is about one-third of what it was before:
CharacterBasedTokenizer:
..... 686.6116408 ms ± 14.3905875334629 ms (mean ± SD)
Looking at the time profile again, we can see that most of the time now is spent in the CharacterView
's iterator:
CharacterView
inherits the default IndexingIterator
from Collection
, so each call to next
does the following:
- Uses
subscript
to get the current character, which constructs a newString
from a slice of the buffer and then aCharacter
from that (source). - Advances its index to the next character, which is a relatively complex algorithm in order to achieve Unicode correctness (source).
This is still a lot of overhead—and pulling the characters being matched into separate variables is a pretty big sacrifice to code readability that we shouldn’t have to make. The set of characters we need to recognize in this example is fairly small, but a tokenizer for a more complex language could have many more. Can we do better?
Do you need characters at all? If not, use something else.
The power of Character
is that it treats code point sequences as a single “human-perceived character” and that it cleanly handles canonical equivalence, but it is not the natural representation of string elements.
Swift String
values contain an instance of the _StringCore
type, which is optimized to store either ASCII or UTF-16 encoded text. When Swift gives you a CharacterView
, it needs to transform that underlying representation into Character
values. Even though it does this lazily as you traverse the collection, if you iterate over the entire string, CharacterView
is doing a lot of memory allocations and complex calculations to give you those values.
So, if your use case doesn’t need the advanced capabilities that Character
provides—our scanner doesn’t—then you’ll get much better performance by using something closer to the string’s internal representation. We still have three string views to consider: UTF8View
, UTF16View
, and UnicodeScalarView
.
Digging into string internals
Let’s start by taking a closer look at _StringCore
. First, notice the leading underscore: this is one of those types that was made public as an implementation detail, but which we shouldn’t touch in our code. The Swift team can—and likely will—change it in ways that would break us if we accessed it directly. That said, it’s beneficial to understand how it works and affects performance. We won’t use _StringCore
directly in any code that we write, but we’ll explore its implementation so that we can make informed decisions about how to make our scanner more efficient.
As I mentioned above, _StringCore
is optimized to handle both ASCII and UTF-16 encoded text—it does so by using some clever bit-twiddling and arithmetic tricks. An instance variable stores a pointer to the underlying bytes that make up the string content. Another instance variable keeps track of the count of ASCII or UTF-16 code units in the string, but the most significant bit of this count is special. _StringCore
calls it elementShift
: if it equals 0, the buffer contains ASCII data; if it equals 1, the buffer contains UTF-16 data. In other words, adding 1 to the value of this bit gives us the number of bytes that we need to advance to get from one code unit to the next: 1 for ASCII and 2 for UTF-16. _StringCore
also ensures that internal consistency is maintained during other string operations; for example, if you append a string with UTF-16 data to one with ASCII data, the ASCII data will be widened to UTF-16.
Since _StringCore
can store both kinds of text, how does it know which one to use for a given string? The compiler examines string literals and detects whether it contains only 7-bit ASCII (code units in the range 0 to 127) or if it contains Unicode, and it uses different initializers to create the String
values. Let’s consider these two strings:
let s1 = "foo"
let s2 = "föö"
The first can be represented in 7-bit ASCII, while the second cannot. If we compile this and examine the SIL, we see that slightly different code is generated:
%13 = string_literal utf8 "foo"
%14 = integer_literal $Builtin.Word, 3
%15 = integer_literal $Builtin.Int1, -1
%16 = metatype $@thin String.Type
%17 = function_ref @Swift.String.init (
_builtinStringLiteral : Builtin.RawPointer,
utf8CodeUnitCount : Builtin.Word,
isASCII : Builtin.Int1) -> Swift.String :
$@convention(method) (Builtin.RawPointer, Builtin.Word,
Builtin.Int1, @thin String.Type) -> @owned String
%18 = apply %17(%13, %14, %15, %16) :
$@convention(method) (Builtin.RawPointer, Builtin.Word,
Builtin.Int1, @thin String.Type) -> @owned String%22 = string_literal utf16 "föö"
%23 = integer_literal $Builtin.Word, 3
%24 = integer_literal $Builtin.Int1, 0
%25 = metatype $@thin String.Type
%26 = function_ref @Swift.String.init (
_builtinUTF16StringLiteral : Builtin.RawPointer,
utf16CodeUnitCount : Builtin.Word) -> Swift.String :
$@convention(method) (Builtin.RawPointer, Builtin.Word,
@thin String.Type) -> @owned String
%27 = apply %26(%22, %23, %25) :
$@convention(method) (Builtin.RawPointer, Builtin.Word,
@thin String.Type) -> @owned String
The String
initializers in %17
and %26
call _StringCore.init
with the appropriate value for elementShift
to indicate whether the string is ASCII or UTF-16 (source).
Armed with this knowledge, we might guess that either the UTF8View
or the UTF16View
would be quite fast, depending on what kind of string we have. That last part is tricky, though: there’s no fast way to determine whether we have an ASCII string or a UTF-16 one. It’s an implementation detail of _StringCore
and not exposed by the public API (nor should it be). Even if we did know, how would we use that information? Would we write two separate scanners—one for ASCII text and one for UTF-16? That’s not a realistic approach, either.
Instead of jumping right into the details of how UTF8View
and UTF16View
are implemented, first let’s consider the four possible cases and use our intuition to hypothesize about how efficient they might be:
- A
String
with ASCII data accessed throughUTF8View
: Since ASCII is a subset of UTF-8, the data matches the way we’re viewing it, so this should be fast. - A
String
with UTF-16 data accessed throughUTF8View
: The UTF-16 data has to be transcoded to UTF-8 and the iterator must maintain state about where it currently sits within a character’s UTF-8 code units, so the computational overhead would make this somewhat slower. - A
String
with ASCII data accessed throughUTF16View
: ASCII characters can be cheaply widened to 16 bit integers that are valid UTF-16 code units, so this should be fairly fast. - A
String
with UTF-16 data accessed throughUTF16View
: The data matches the way we’re viewing it, so this should be fast.
Since there’s a case where we believe UTF8View
will perform poorly, we’ll set it aside. On the other hand, UTF16View
looks promising, so let’s try converting our tokenizer to one that uses it. We can start by switching from String.characters
to String.utf16
and updating uses of Character
to UTF16.CodeUnit
. A snippet is shown below:
There’s just one problem—this won’t compile yet. There are two big issues:
- Unlike with
Character
, Swift does not let single-character string literals be used as UTF-16 code units. We would have to replace the literals in ourcase
patterns with their raw numeric UTF-16 values, which isn’t readable or easily maintainable. - In the
integerToken
andstringToken
methods, we collect the token text by appending each character to a new string. There is no overload ofString.append
that can take a UTF-16 code unit,UTF16View
is not mutable, and converting a single UTF-16 code unit to aString
which could be appended is not always feasible (for example, if it is part of a surrogate pair). We would have to use an alternate approach, like collecting the UTF-16 code units in an array and then decoding that into aString
after the fact.
In other words, to use UTF16View
, the code we would have to write would be fairly low-level compared to what we originally had. Swift is not a low-level language and we shouldn’t have to make that sacrifice. Let’s reject this approach.
Help me, UnicodeScalarView—you’re my only hope.
With all this in mind, the only option left is UnicodeScalarView
. Either I’ve saved the best for last, or this article is going to have a very disappointing ending. Let’s look at what UnicodeScalarView
and its element type UnicodeScalar
are capable of.
First, UnicodeScalar
conforms to ExpressibleByUnicodeScalarLiteral
, so the compiler will let us create a UnicodeScalar
using a string literal that contains a single scalar. In other words, we can still write case ","
and have our code be readable.
In addition to the readability benefit, initializing a UnicodeScalar
from a string literal is fast. UnicodeScalar
is a wrapper around a 32-bit unsigned integer, and there are no string buffer allocations involved like we saw with Character
. Let’s look at the generated SIL for a simple assignment:
// let scalar: UnicodeScalar = "ö"%12 = global_addr @StringTest.scalar : Swift.UnicodeScalar :
$*UnicodeScalar
%13 = integer_literal $Builtin.Int32, 246
%14 = struct $UInt32 (%13 : $Builtin.Int32)
%15 = struct $UnicodeScalar (%14 : $UInt32)
store %15 to %12 : $*UnicodeScalar
The compiler has already extracted the scalar from the string (246 is the decimal Unicode code point for “Latin Small Letter O with Diaeresis”), and the remaining instructions are simple value type conversions. The resulting machine code will be nothing more than an instruction that loads the number 246 into a register or memory address.
Next, we can’t append UnicodeScalar
directly to a String
, but unlike UTF8View
and UTF16View
, UnicodeScalarView
is mutable and the append(UnicodeScalar)
operation can be found there. This makes the implementation of the integerToken
and stringToken
methods work as they did before with only minor modifications.
Finally, iterating over the UnicodeScalarView
is very fast as well. The iterator delegates most of its behavior to the UTF16
codec type, which decodes the underlying ASCII or UTF-16 code units in the string buffer and produces UnicodeScalar
values. The decode
method can be seen here and is fairly straightforward. The vast majority (96.875%) of UTF-16 code units are numerically equivalent to their Unicode scalar (those in the range 0x0000–0xD7FF or 0xE000–0xFFFF). The remaining ones are the surrogate code units, and well-formed UTF-16 surrogate pairs consist of one high surrogate followed immediately by one low surrogate. The decoder never has to look more than one position ahead to convert any UTF-16 code unit or surrogate pair into a UnicodeScalar
, and the conversion is simple bitwise arithmetic that does not require any additional memory to be allocated.
A downside is that if we switch to UnicodeScalar
, each element we process no longer necessarily corresponds to a single human-recognizable character and we lose the ability to easily check for canonical equivalence. Fortunately, this does not affect our tokenizer. The basic tokens we recognize, like comma and semicolon, are simple ASCII characters. The only location where Unicode characters might appear is inside double quoted strings, and those scalars will simply be appended to another string that we get back in the returned token.
All of this seems promising, so let’s test it out by creating a UnicodeScalarBasedTokenizer
. The conversion from CharacterBasedTokenizer
is very simple (view source). Running the benchmark:
UnicodeScalarBasedTokenizer:
..... 108.4403786 ms ± 2.50400540742841 ms (mean ± SD)
That’s a significant improvement—50 times faster than the version that used CharacterView
, with only minor changes.
Summary
Swift’s String
type is a powerful abstraction that provides access to different encodings through a set of “views.” In particular, CharacterView
and the Character
type provide a clean and convenient API that solves a long-standing problem with text processing—identifying clusters of code points that form a single human-recognizable character and determining canonical equality without having to roll your own algorithms to handle low-level Unicode details.
This power comes with a computational cost, however. As they are currently implemented, Character
values involve time-consuming heap allocations that aren’t immediately obvious to callers—especially for Character
literals. Although these allocations are often short-lived, the recurring allocate-and-free churn can significantly drag down the performance of string processing in tight loops.
Since String.characters
is the “obvious” way to access a string’s elements, it’s easy to write code that performs sub-optimally when you don’t need to take advantage of those features.
To improve the efficiency of your code when you don’t need the features that Character
provides, use String.unicodeScalars
instead. UnicodeScalarView
is a happy medium between the very high-level CharacterView
and the low-level UTF8View
and UTF16View
, and as we’ve seen above, very minor changes to your code can yield significant performance gains.