Taming Duplicate Strings

Norm Bryar
6 min readJul 9, 2022

After a couple of our containers hit out-of-memory, we wondered Where did all the duplicate strings come from? And what should we do to reduce them?

You may know, each object in .Net has a preamble, 16 bytes on a 64-bit cpu. Each UTF-16 char is 2-bytes, so “United States” takes 16+2*13 = 42 bytes. Loading a database with a million “United States” strings thus eats 42MB.

memory overhead of an object

(An enclosing object with a Country field has another 8 bytes for the reference to the string, so 1 million ‘Row’ objects holding distinct “United States” strings really eats 50MB. Perhaps better solved by picking a ushort to encode country-names, but … lets run with this example anyway.)

First Thought: String.Intern

We can explore how we determined string-proliferation was a problem in a later post, but here we’re tackling “now what?” You may have heard of BCL’s string.Intern(). It’s kind of a Fly-Weight caching for strings, one that includes strings stored in the data-segment of the assembly as well as ones we can programmatically populate.

A simple string-caching wrapper might then look like this:

public struct FrequentString
{
public string Value { get; }
public FrequentString(string repeatedString)
=> Value = string.Intern(repeatedString);
// Allow assignments to string w/o extra ceremony
public static implicit operator string(FrequentString fs)
=> fs.Value;
// equality/hashcode, ToString, etc. omitted for brevity
}

The string.Intern(…) call will lookup if the string’s been interned before and return that instance, or add this one.

Yay! Memory footprint is down by a factor of a million! (Well, neglecting the local vars used in the stream-read, or whatever, as those million strings are Gen0 GC’d as you go, so not hurting you if your million-row table is long-lived.) We still, of course, have the 8MB from the million string references, but the instances are all amortized.

One downside of string.Intern() is that it never forgets. Misjudging the cardinality of your inputs (e.g. accidentally caching billions of unique request correlation-ids instead of 244 unique country-names), that’s a memory-leak which will cause the OutOfMemory events we hoped caching would avoid.

Now, your domain may lend itself to a priori knowledge of the needed strings. You know the cardinality and the values; you’d pick string.IsInterned().

// All string-literals are automatically 'interned'
const string unitedStates = "United States";
// Reading from a stream, say, makes unshared strings. Boo!
string countryRead = new StringBuilder()
.Append("United ").Append("States").ToString();
// Add a step to prefer (not require) the cached version.
countryRead = string.IsInterned(countryRead) ?? countryRead;

We’ll assume our domain is not so well proscribed. A couple other considerations arise: BCL’s string intern mechanism may not be as fast as you want. You may need the string-set to evolve at runtime (vs. compile-time) and be independent of your deployment cadence. You may want instrumentation on hit-rates (which a string.IsInterned() + string.Intern() pair can give you, but at double the lookup cost). And you may want case-insensitive / normalized strings (“uNiTed sTATES” => “United States”).

Many domains’ strings may follow a power-law, such as a Zipf distribution, lending themselves to an LRU or even asbolute-expiry cache strategy. Some apps may see every string they’d need within the first N minutes of running (e.g. first model download event), after which the cache can switch to read-only mode. Or maybe you’re ok with a single-use, ephemeral cache. And some domains let you guess the count of unique strings, even if the content varies (e.g. 2000, but values specific to language-culture).

Basically, we want FrequentString pools taking a pluggable caching-policy.
That’s hard with structs, which don’t allow inheritance, and in our case we don’t want to bloat the instance with references to pool objects or lambdas. Probably want collaborating pool classes.

FrequentStringPool

// Stripped of logic, this now serves mostly as dev prompt,
// hinting to code-reviewers at intended cardinality
public struct FrequentString
{
// Non-public if we want only pools to instantiate them.
internal FrequentString(string input)
=> Value = input;
...
}
// -----public interface IFrequentStringPool
{
FrequentString Get(string input);
// With interface in same assembly as struct,
// the internal FrequentString ctor is callable.
protected static FrequentString FromString(string s)
=> new FrequentString(s);
}
// ----// Some example implementation.
public class FixedSizePool : IFrequentStringPool
{
private string?[] _pool;
private int _indexMask;
// 8192 * 8bytes/ref is under the 80KB LOH limit
// so seemed like a good general default.
public FixedSizePool(int capacity = 8192)
{
// Power-of-2 sizing enables index masking,
// faster than idiv+imul of (hash % capacity)
int nextPow2 =
(int) Math.Ceiling( Math.Log2( capacity ) );
this.Capacity = 1 << nextPow2;
// Could also ArrayPool<string>.Shared.Rent()
this._pool = new string?[ this.Capacity ];
this._indexMask = this.Capacity - 1;
}
public int Capacity { get; private set; }

public FrequentString Get(string input)
{
// Ordinal is fastest, if input source casing trusted.
StringComparison comp =
StringComparison.OrdinalIgnoreCase;
// Note: hash varies by AppDomain so do not
// persist/marshal the _pool array and expect same ordinals.
int hash = input.GetHashCode(comp);
int index = hash & _indexMask;
ref string? value = ref _pool[ index ]; if (string.Compare(value, input, comp) != 0)
{
// if (value is not null)
// { some cache-miss logic }
value = input;
}
return IFrequentStringPool.FromString(value!);
}
}

Our FixedSizePool aims for speed, but it’s lossy (hash+mask collisions) and not really thread-safe. It may make for a good transient pool though, like in a batch-job that might benefit from reduced mem but not so much that we want those strings loitering forever in our working-set.

The lossiness might be ok if the source is naturally partitioned, e.g. all “California”s occur before all “Colorado”s. But in general we’d want to compensate, perhaps by chaining FrequentStringPools together.

A ‘tier1’ pool’s cache-miss logic can consult the larger, thread-safe, ‘tier2’ pool-implementation. Thus if “Fred” evicts “Joe”, the “Fred” is sought in tier2 and we still get the one unique cached “Fred” string, even if Fred and Joe alternate all the time in the source data.

(We can also extend our interface with TryGet, allowing a batch-job to give Hannah and Joe the same lastName “Montana” instance as our global one, but the global cache is unpolluted by Joe DiMaggio’s lastName.)

Anyway, I hope this has been an interesting/diverting read. Though I know it’s not a greatly profound one.

— — — —

Aside:

In Semantic Primitives in C#, I described wrapper-classes giving a semantic context to strings (or ints or …). Whether you’re intrigued by the self-documenting and type-safety aspects of that approach or not, you can see its parallels with this article’s behavioral- or capability-encapsulation.

Here’s another variation on this theme: Say most of your system uses Guid identifiers but many components (logging, billing, authz, …) use strings. You’ve N different layers doing conversions between the two forms. Wrapper to the rescue:

public struct GuidString
{
private string? _str;
public Guid Id { get; init; } public string AsString => _str ??= Id.ToString(); // ... add string ctor to Id=Guid.Parse, etc...
}

One could combine this with the FrequentGuidStringidea, caching your B2B tenant-id Guids’ strings (’cause 1000 QPS/tenant of redundant Guid.ToString() may tempt you to sand that rough-edge all the way down).

Taming Duplicate Boxed Objects

Added thought: finding a duplicate strings issue, your app might also see a duplicate boxing issue. Say you’ve a Dictionary<string, object> feature-vector/row holding bool values, each of which incurring its own 16 byte object-preamble (pad, lock, method-table). A million trues, 16MB. The same ‘interning’ or FlyWeight-pattern idea applies.

public static class BoxIntern
{
private static readonly object _boxedTrue = (object) true;
private static readonly object _boxedFalse = (object) false;

public object BoxBool( bool b ) => b
? _boxedTrue
: _boxedFalse;
... maybe also for the top N integers you app boxes, etc...
}

--

--

Norm Bryar

A long-time, back-end web service developer enamored with .Net and C#, code performance, and techniques taming drudgery or increasing insight.