Data Structure Stories: Sets
A set is a data structure based on the name-sake from mathematics. It is an assortment of items not necessarily stored in order. Their defining feature is that normal sets cannot contain duplicate items. Sets are usually not indexed in the same way an array or map is. They are often defined by some indicator function, which tells us about the members of the set. In set theory one often learns about sets with simple examples, like red or black marbles being in different sets. This could then be represented in a Venn diagram. The functions can get much more mathematical, but there are always easy examples to help with the intuition.
A more interesting example is the Power Set of a set (P(s)), which is the set containing all subsets of the current set (including the empty set). This means it has a much larger cardinality (size) than the original set. Though infinity is not itself a number, you can actually define larger infinities than infinity by taking advantage of power sets and the idea of cardinality. I won’t go into much more detail, but this is a little additional nugget of Maths. If your interested then Wikipedia has great articles about these theories. In particular have a look at the article on “Aleph Null”; it’s a great read!
The most common programming set operation is testing for membership of an element. In this case, the set is always finite. For instance, I may have a set of the first 5 prime numbers, and I want to test whether 7 belongs to it. My membership function will return true, as 7 is the fourth prime number. These sets are often static. This means they only permit query operations. However, sets can also be dynamic. Dynamic sets allow insertion and deletion operations as well as the standard query options.
Core set operations include Union, Intersection, Difference and Subset. For static sets we have a membership operation, a means of testing for empty sets and a retrieval operation for cardinality. Dynamic sets may include a create operation (for a new set), an add (if not present already), a remove (if present), and a check for possible capacity.
History
In Mathematics, a German mathematician named Georg Cantor was responsible for the development of sets and set theory in 1874. His paper was titled “On a Property of the Collection of All Real Algebraic Numbers”. Understanding sets is a process that relies heavily on Maths, so I would recommend that the avid learner familiarise themselves with the core ideas before using them in programming. Sets come up again and again in Computer Science, so I cannot stress that enough.
In terms of early programming usage, a high-level language called SETL was developed in the late 1960s which was based purely on set theory. It was invented by Jacob Schwartz at New York University. Many programming languages since have an implementation of a set. C++, Java, Objective-C, Python and the .NET framework are some notable examples.
Performance
Sets can be implemented with many different underlying data structures. Therefore, in performance terms, they are heavily reliant on this choice. Implementations may often focus less on providing good performance for insertion or deletion and instead focus on specific set operations like Union or Intersection.
One could use a list of elements. As a result, a membership test would require iteration and conditional checks of time complexity O(n). This would be equally true for deletion. But sets are often implemented with trees, tries or maps (hash tables based) as well. In a map based implementation each key would be a sentinel, simply describing the current set element. The value would be the set member. Maps as we know provide amortised average cases of O(1) for many operations.
The implementation in C++ uses a Binary Search Tree. Java has three implementations extending from the Set interface: HashSet, SortedSet and TreeSet. As you could probably guess, each implementation is backed by something different. A sorted set also allows one to have a set that is in order. C# also has HashSet and SortedSet classes.
Alternatives
We have seen one alternative already, a sorted set. Sorted sets allow the items of a set to be in order. There are also Multisets. Multisets, also known as a bag, allow repeated values. Sample operations on a Multiset may include “Contains” to check for item existence, “IsSubBag” to check if the input is a valid sub-bag, and “Count(x)” which returns the count of a specific element, x, from the Multiset.