Lord of the Buckets — A Dictionary of Lists in C#

Fil
.Net Programming
Published in
12 min readSep 12, 2024
Photo by Joel Muniz on Unsplash

Whenever your code sorts things into a variable number of (named) buckets and keeps an order of said stuff inside the buckets, then you are working with a dictionary of lists. The most straightforward approach to this is to use a Dictionary<string, List<T>> to map from a bucket name to the contents, there is a better way!

In this post, I want to show how to implement a better dictionary of lists in C# with cleaner semantics and an interface that is easier to use correctly. This is quite useful in some applications and an interesting exercise in the design of data structures. In the process, we will also learn a thing or two about the proxy pattern and weak references. So let’s get started!

The data structure in this design is in fact in real use within the advanced Git merge engine of Diversion, where it keeps track of files (the list) associated with commits (the keys).

All code for this post (and unit tests) can be found on GitHub at https://github.com/divversion/dictionaryoflists.

Prologue

What’s wrong with Dictionary<string, List<T>> Do you ask? Nothing! But it can be a bit cumbersome to use, particularly in algorithms that require a lot of adding and removing lists (for instance, keeping track of objects associated with an ID like a database key). Let's examine this in a bit more detail. Consider this:

var listDict = new Dictionary<string, List<string>>();
listDict["the bucket"].Add("kick it!");

This will give you a runtime error KeyNotFoundException: The given key 'the bucket' was not present in the dictionary. Indeed, you need to explicitly create the list with key the bucket First:

var listDict = new Dictionary<string, List<string>>();
listDict["the bucket"] = new List<string>();
listDict["the bucket"].Add("kick it!");

We can be clever about this kind of thing and declare the following extension method for any Dictionary:

public static class DictionaryGetValueOrAddExtension
{
public static TValue GetValueOrAdd<TKey, TValue>(this IDictionary<TKey, TValue> dict, TKey key)
where TValue : new()
{
if (dict.TryGetValue(key, out var item)) return item;
item = new TValue();
dict.Add(key, item);
return item;
}
}

Then we can write

var listDict = new Dictionary<string, List<string>>();
listDict.GetValueOrAdd("the bucket").Add("kick it!");

But what about removing items?

var listDict = new Dictionary<string, List<string>>();
listDict["the bucket"] = new List<string>();
listDict["the bucket"].Add("kick it!");
listDict["the bucket"].Clear();

We still have

listDict.ContainsKey("the bucket")

a return true when "logically" the list for the key "the bucket" is now empty, which we would like to count as "not having the key at all". The above call Clear() makes it pretty obvious that we should instead remove the key via

listDict.Remove("the bucket");

However, if instead Clear() we called

listDict["the bucket"].Remove(0);

it would be much harder to see that the list may now be empty. If we forget to explicitly remove keys, this keeps the empty lists alive for the lifetime of the dictionary. Unfortunately, List<T> never decreases in size by itself - you have to call TrimExcess() to force the memory to be freed. Alternatively, you can add code like this:

if (listDict["the bucket"].Count == 0)
listDict.Remove("the bucket");

But you would have to do this after every operation that possibly removes the last item in a list! This is easy to forget and even if we remember to always check, it potentially clutters our algorithm code with many lines of housekeeping.

Abe Lincoln once said:

Give me six hours to chop down a tree and I will spend the first four sharpening the axe.

Well, maybe he did not say that, but the sentiment holds: Improving one’s tools can be a massive advantage in the medium to long term.

Often, this is less about saving code lines, but about lowering the mental load — everything I don’t have to think about frees up my grey cells to do something more productive! Anyway, let’s sharpen our axe and think if we can implement a data structure that gives us a dictionary of lists with automatic management of the lists.

In particular, we want to be able to write the following:

var listDict = new DictionaryOfLists<string, string>>();
listDict["the bucket"].Add("kick it!");
listDict["the bucket"].Clear();

After this, it should hold that listDict.ContainsKey("the bucket") is false, so no manual management of the bucket lists is necessary. Implementing these semantics correctly, however, turns out to be a little bit more tricky than it may seem at first sight...

First attempt: a naive implementation

Our dictionary-of-lists class will have the following declaration:

public class DictionaryOfLists1<TKey, TItem> : IDictionary<TKey, IList<TItem>> where TKey : notnull
{
...
}

Note that we inherit from IDictionary<TKey, IList<TItem>>. The IList<T> generic interface describes the minimal interface that a list must implement (see here for the official docs).

Why do we not use List<T> Directly? Because, as we saw above, we need to be notified when the last element is deleted from the list and the standard List<T> has no such functionality.

In UML this looks as follows:

UML diagram for the Proxy pattern
UML diagram for the Proxy pattern — traced by User:Stannered, created by en:User:TravisHein, CC BY-SA 3.0 <http://creativecommons.org/licenses/by-sa/3.0/>, via Wikimedia Commons

Thus we need the concept of the proxy pattern: We will define our list of T's by creating a class that implements IList<T>. This proxy list just forwards all operations to a private List<T> field, but adds the logic we need in addition.

The most convenient way is to use a nested class for this so the following definition is part of the DictionaryOfLists1<TKey, TItem> Body:

internal sealed class ListProxy : IList<TItem>
{
private readonly TKey _key;
private readonly IList<TItem> _list;
private readonly DictionaryOfLists1<TKey, TItem> _owner;

public ListProxy(DictionaryOfLists1<TKey, TItem> owner, TKey key, IList<TItem> list)
{
_owner = owner;
_key = key;
_list = [..list]; // (*) <-- see below
}

public ListProxy(DictionaryOfLists1<TKey, TItem> owner, TKey key)
{
_owner = owner;
_key = key;
_list = [];
}

public IEnumerator<TItem> GetEnumerator() { return _list.GetEnumerator(); }
IEnumerator IEnumerable.GetEnumerator() { return ((IEnumerable)_list).GetEnumerator(); }
public void Add(TItem item) { _list.Add(item); }
public void Insert(int index, TItem item) { _list.Insert(index, item); }
public bool Contains(TItem item) { return _list.Contains(item); }
public int IndexOf(TItem item) { return _list.IndexOf(item); }
public void CopyTo(TItem[] array, int arrayIndex) { _list.CopyTo(array, arrayIndex); }
public int Count => _list.Count;
public bool IsReadOnly => false;

public TItem this[int index]
{
get => _list[index];
set => _list[index] = value;
}

public bool Remove(TItem item)
{
var result = _list.Remove(item);
if (_list.Count == 0)
_owner.Remove(_key);
return result;
}

public void RemoveAt(int index)
{
_list.RemoveAt(index);
if (_list.Count == 0)
_owner.Remove(_key);
}

public void Clear()
{
_list.Clear();
_owner._dict.Remove(_key);
}
}

Mostly, this is just forwarding boilerplate code, with one additional piece of functionality added: When the last item is removed, then the key itself is removed. For this, the ListProxy keeps track of its owner, which is passed in the constructor.

You will see that the constructor of ListProxy the passed list is always copied. Indeed the collection expression [..list] involving the spread element .. (which is available since C# 12) is functionally equivalent to new List<TItem>(list) and always copies the list. This copying may seem a bit odd at the moment - we will come back to this point later.

With the above proxy at hand, we can now fill in our dictionary-of-list implementation. For brevity, only part of the definition is shown. You can download the whole code from the GitHub repo https://github.com/divversion/dictionaryoflists accompanying this post. Most methods are just simple forwarding methods:

public class DictionaryOfLists1<TKey, TItem> : IDictionary<TKey, IList<TItem>> where TKey : notnull
{
private readonly Dictionary<TKey, ListProxy> _dict;

public DictionaryOfLists1() { _dict = new Dictionary<TKey, ListProxy>(); }
public DictionaryOfLists1(DictionaryOfLists1<TKey, TItem> other) { _dict = new Dictionary<TKey, ListProxy>(other._dict); }
public bool Remove(TKey key) { return _dict.Remove(key); }
public void Clear() { _dict.Clear(); }

// ...

// ... definition of ListProxy here ...
}

Some methods are a bit special, however. First, for TryGetValue() We implement additional logic that an empty list is created on the fly if no element exists. Recall that our specification called that if a key is not found, then we need to create a new list proxy to return.

public bool TryGetValue(TKey key, out IList<TItem> value)
{
if (_dict.TryGetValue(key, out var listProxy))
{
value = listProxy;
return listProxy.Count != 0;
}
else
{
var newCollection = new ListProxy(this, key);
_dict[key] = newCollection;
value = newCollection;
return false;
}
}

Also in Add() we need to be a bit more careful:

public void Add(TKey key, IList<TItem> value)
{
if (_dict.ContainsKey(key))
throw new ArgumentException("An item with the same key has already been added.");
if (value.Any())
_dict.Add(key, new ListProxy(this, key, value));
}

Note that we do not add empty lists (that’s also why we need to separately test for the existence of the key).

Finally, the indexer also needs to respect the new semantics:

public IList<TItem> this[TKey key]
{
get
{
if (_dict.TryGetValue(key, out var listProxy))
{
return listProxy;
}
else
{
var newCollection = new ListProxy(this, key);
_dict[key] = newCollection;
return newCollection;
}
}

set
{
if (value.Any())
_dict[key] = new ListProxy(this, key, value);
else
Remove(key);
}
}

Note here that in case we set it to an empty list, we instead remove the key (since no key just means an empty list).

Let us now come back to the point about copying the list upon creating the ListProxy. At first sight, this seems inefficient, but imagine if we just stored (a reference to) the original list. Then, if we kept another reference to the original list somewhere else, we could still change the original list, possibly violating our list semantics! For instance, we could clear the list without the key being automatically removed (the Keys property should only contain the keys with non-empty lists).

As an aside, one could imagine a better version of the dictionary-of-lists class where this copying issue is remedied. For instance, any access to a property could go through all lists and check if they have become empty. But this is pretty inefficient!

So, I chose the above version, which is safe and efficient as long as we do not insert very long pre-existing lists. This seems a reasonable choice, but of course, variations are possible for special use cases.

Unfortunately, this implementation has a flaw! 💣

Consider the following code (this is the unit test. ListProxy_ClearAndAdd_HasValue):

var listPre = new List<string> { "A", "B", "C" };
var listDict = new DictionaryOfLists1<int, string>();
listDict.Add(1, listPre);
var list = listDict[1];
list.Clear();
list.Add("A");

So, we kept a reference to the list (proxy) around after a Clear() (or equivalently, Remove(0)) and then added another element to this list again. What do you think listDict.TryGetValue(1, out _) returns? Unfortunately, this is now. false!

Why? The moment we cleared list the connection of this list to listDict was severed, so when we add "A" back into it, this change is not reflected in listDict.

This kind of bug is exactly the thing that you can spend hours hunting down in complex code, where it may be hard to reproduce… So let’s avoid this particular gotcha. But how?

Second attempt: keeping around empty lists

Short of a better idea, we could try to just keep around empty lists. This leads to DictionaryOfLists2. For instance, Remove() becomes the following:

public bool Remove(TKey key)
{
if (_dict.TryGetValue(key, out var listProxy))
{
listProxy.ClearInternal();
return true;
}
else
{
return false;
}
}

Here we have added the following method to ListProxy:

internal void ClearInternal()
{
_list.Clear();
}

Indeed, the original Clear() method calls Remove() of the owner dictionary of lists, potentially leading to an infinite loop and then a stack overflow.

This approach now satisfies all the semantic requirements of the data structure — yay! All unit tests now pass, except for one: Remove_ReducesTotalCapacity. Its code is as follows:

[Fact]
private void Remove_ReducesTotalCapacity()
{
// Arrange
var listPre = new List<string> { "A", "B", "C" };
var listDict = new DictionaryOfLists2<int, string>();
listDict.Add(1, listPre);

// Act
var capacityBefore = listDict.TotalCapacity;
listDict.Remove(1);

// Assert
Assert.True(listDict.TotalCapacity < capacityBefore);
}

To make this work there is a new property TotalCapacity to DictionaryOfLists2, which sums the Capacity of all lists. The property Capacity of a List<T> contains the total reserved slots for the list. As you can see, even though we removed the key, no memory is reclaimed.

As you probably gathered already, by keeping the empty lists around we have created a memory leak!

I mentioned before that List<T>they never shrink their capacity themselves and calling TrimExcess() is not for free either (as it has to allocate a new array).

So now we need to plug the leak.

Third attempt: weak references to the rescue

Let’s pause a moment and think about our problem: We want to kick out lists (more precisely: make them eligible to be garbage-collected) when they become empty.

On the other hand, we want to be able to resurrect and re-insert them if from the outside someone puts an element in again. To put an element in, other code needs to keep a reference to the list (or, more precisely, to the list proxy that is returned by our dictionary of lists). If such a reference exists, the list can not be garbage-collected after all.

These semantics are precisely those of the WeakReference<T> class! Weak references to an object do not prevent garbage collection of the object, but allow it to be resurrected. This works as follows:

var weakRef = new WeakReference<string>(new string("a string"));

// ... later ...
if (weakRef.TryGetTarget(out var s))
{
// s now contains a (strong) reference to our string
Console.WriteLine(s); // prints "a string"
}
else
{
// the tring was garbage-collected and s is now null
Console.WriteLine("R.I.P. my beloved string");
}

This requires a few modifications in our new DictionaryOfLists3. First, we need a new internal dictionary to hold the weak references of lists:

public class DictionaryOfLists3<TKey, TItem> : IDictionary<TKey, IList<TItem>> where TKey : notnull
{
private readonly Dictionary<TKey, ListProxy> _dict;
private readonly Dictionary<TKey, WeakReference<ListProxy>> _dictWeak; // <-- here
// ...
}

This field _dictWeak will hold the weak references to empty lists. Let's first look at the new. Add() Method:

public void Add(TKey key, IList<TItem> value)
{
if (_dict.ContainsKey(key))
throw new ArgumentException("An item with the same key has already been added.");

if (!value.Any()) return;

_dict.Add(key, new ListProxy(this, key, value));
_dictWeak.Remove(key);
}

If the list to be added contains elements, then we add it (and also remove any possibly existing weak reference, which is no longer needed). If the list is empty, then we do nothing. Next, let’s look at the indexer:

public IList<TItem> this[TKey key]
{
get
{
if (_dict.TryGetValue(key, out var listProxy))
{
return listProxy;
}
else if (_dictWeak.TryGetValue(key, out var weakRef) && weakRef.TryGetTarget(out var emptyListProxy))
{
return emptyListProxy;
}
else
{
var newCollection = new ListProxy(this, key);
_dictWeak[key] = new WeakReference<ListProxy>(newCollection);
return newCollection;
}
}

set
{
if (value.Any())
{
_dictWeak.Remove(key);
_dict[key] = new ListProxy(this, key, value);
}
else
{
_dict.Remove(key);
}
}
}

The getter does the following:

  1. First, we look if the key exists in _dict. If so, we return the associated list. This is as before.
  2. Otherwise, we look in _dictWeak and check if the target still exists. If so, we return it.
  3. If there is no weak reference or if it has been garbage-collected, then we create a new empty list for the key, store a weak reference to it _dictWeak and return it. Note that now that we have returned it, there is also a strong reference to it, so the cannot be garbage-collected until the returned strong reference goes out of scope.

The setter saves a list proxy in _dict or _dictWeak depending on whether the value has elements or not. Note that here it is important to remove keys as well to handle the cases where an entry already exists.

Also, the list proxy has new methods to handle the cases when the list becomes empty or non-empty again:

private void CheckReinsertIntoOwner()
{
_owner._dictWeak.Remove(_key);
_owner._dict[_key] = this;
}

private void CheckRemoveFromOwner()
{
if (_list.Count != 0) return;
_owner._dict.Remove(_key);
_owner._dictWeak[_key] = new WeakReference<ListProxy>(this);
}

These are called after the relevant operations in the listed proxy, for example in the following:

public void Add(TItem item)
{
_list.Add(item);
CheckReinsertIntoOwner();
}

public void Insert(int index, TItem item)
{
_list.Insert(index, item);
CheckReinsertIntoOwner();
}
public bool Remove(TItem item)
{
var result = _list.Remove(item);
CheckRemoveFromOwner();
return result;
}

Finally, the DictonaryOfLists3 the class has a new CleanUp() Method to remove old weak references whose lists have been garbage-collected:

public void CleanUp()
{
var keys = _dictWeak.Keys.ToList();
foreach (var key in keys)
{
var weakRef = _dictWeak[key];
if (!weakRef.TryGetTarget(out _))
_dictWeak.Remove(key);
}
}

This is not called automatically in the various methods since it is potentially expensive (it needs to go through all weak references) and so this should be reserved for a convenient position in the code, maybe after various operations have all been carried out.

There you go — this is a usable data structure for a dictionary of lists! And indeed it is this last design is the one that lives inside the “magic” Git merge engine of Divversion.

Postscript

In this post, we saw how to design a data structure for a dictionary of lists.

Going through three revisions, we continuously improved our design until it satisfied all the requirements and had a good performance profile.

In particular, we saw how to employ the proxy pattern to transparently use a list with added functionality. We also used weak references to handle empty lists efficiently.

Of course, you could also implement a DictionaryOfSets<TKey, TItem> in almost the same way now if you require it — just replace IList<T> with ISet<T> and List<T> with HashSet<T>.

Thanks for reading! 🙏

--

--

Fil
.Net Programming

Founder at Divversion, C# connoisseur, Git enthusiast, lover of all mathsy things.