A Python-style Set in simple Java
One of the highlights of my time as a high school teacher was having the opportunity to teach AP Computer Science A. I had no background in Java before taking on the class (just a lone course in C++ in college) and had to teach myself the language from the ground up, so when I took on AP Computer Science Principles, I decided to use that as an excuse to dive into another programming language: Python. One of the ways I bridged the gap between thinking in Python and thinking in Java was to reverse-engineer some of the behaviors of core Python objects that were outside of the curriculum I covered in Java. While reinventing the wheel like this is ill-advised for practical purposes, doing so for educational reasons is a great way to take a deep dive into nuance.
In this post, we’ll be exploring a Java implementation of some of the important behaviors of Python set
objects. The code is restricted to the AP Computer Science A subset of Java, and does require basic understanding of the ArrayList
class, the enhanced for
loop, and implementation of the Iterable
interface. You should also either understand (or just believe in the magic of) the Integer
wrapper class’ autoboxing and unboxing. In order to focus on the behavior of a Set
and avoid some rabbit-holes such as shallow versus deep copying, we will limit ourselves to a simple Set
for holding Integer
objects.
Both Set.java
and SetDemo.java
, which demonstrates its methods, are available on my github.
First, a UML diagram of the Set
class can shed light on some of its behavior:
This Set
class will implement the Iterable
interface, so we can iterate over its elements one by one in an enhanced for
loop (which implicitly calls the iterator
method). The elements contained in the Set
class will actually be stored in ArrayList
of Integer
objects — this is to highlight how a Set
is different from an ArrayList
. You will notice that some methods are simply calls to an underlying ArrayList
method, while other methods from the ArrayList
class are completely inaccessible.
The constructors, contains
, size
, and toString
behave exactly as we would expect. In fact, contains
and size
are simply calls to the ArrayList
object’s contains
and size
methods. The differences starts to appear when we examine the add
method:
public void add(int new_value){ if (!contains(new_value))
{ set.add((int)(Math.random()*size()), new_value);
}}
This method does the heavy lifting when it comes to implementing the two main properties that distinguish the Set
class from an ArrayList
of integers. In the source code, we can see that no value can be added to a Set
object without passing through this filter method. The very first thing add
does is check to see whether or not the value is already in the list — new_value
is added if and only if it is not already present in the Set
. This is the first major difference between a traditional ArrayList
and a Set
; while an ArrayList
can contain duplicates, a Set
, by definition, contains only unique values. This method does not refuse to add items to the Set
; instead, it checks to see if they are already represented.
The second thing this method does is reinforce that the order of elements in a Set
is arbitrary. In Python, elements in a set
are arranged in memory according to their hash value (which is why aset
holds only hashable objects), and that hash value can present them in unpredictable ways. Because hashing is outside the scope of APCSA, we instead randomize the order of the elements to simulate this unpredictability. When thinking about how a Set
differs from an ArrayList
, we need to throw the idea of ordered elements out the window, and this implementation forces us to do that.
Now that we’ve talked about adding elements to the Set
, let’s look at how we can remove them.
public void discard(int value){ for (int i = 0; i < set.size(); i++) { if (set.get(i) == value) { set.remove(i); } }}
If we want to remove a specific item from the set, we can use the discard
method, and pass the element you want removed from the Set
. If the element is not found in the Set
, the discard
method does nothing. While this is similar to the ArrayList
’s remove method, which deletes the first occurrence of the desired element, the first occurrence is the only occurrence in a Set
, due to the uniqueness of the elements.
public int pop(){ return set.remove((int)(Math.random() * set.size()));}
A second way to remove elements from the Set
is with the pop
method. This method removes and returns an arbitrary (implemented here as random) element from the Set
, and is useful in algorithms which process and remove elements from the Set
one-by-one, but in no particular order. Similar to Python, attempting to pop
an element from an empty Set
will result in a runtime error.
Now that we’ve looked at how we can move elements into and out of the Set
, let’s look at operations involving more than one Set
. We’ll start with methods that check conditions.
First up is the notion of a subset. A set S is a subset of a set T if every element in S is also in T.
The Set
lacks an implementation of order, so this differs from the idea of a subarray, in which the elements of the sub-array appear in exactly the same order. In order to check if the calling object is a subset of the passed object, we have the isSubset
method:
public boolean isSubset(Set other){ for (Integer element : set) { if (!other.contains(element)) { return false; } } return true;}
We check every element in our internal set
, and if we find any element in the calling Set
that is not present in other
, we return false
. Notice that in this implementation, we will always return true
if the calling Set
is empty (which should sound familiar to anyone familiar with Set theory). The isSuperset
method turns the tables, checking to see if every element in the passed Set
is also in the calling Set
.
Now that we have the notion of a subset in our toolkit, we can use a set-theoretic definition for the equals
method. Two sets S and T are equal if and only if every element of S is also in T, and every element of T is also in S. In other words, this
set is equal to some other
set if this.isSubset(other)
and other.isSubset(this)
are both true
:
public boolean equals(Set other){ return this.isSubset(other) && other.isSubset(this);}
With a Set
, the order is arbitrary, so the same elements only need to be present, not in a particular order. This is one of the features that really distinguishes a Set
from an ArrayList
, in which two objects are equal if and only if they contain exactly the same elements in exactly the same order. (I am deliberately avoiding the actual implementation of an overridden version of the Object
class’ equals
method in order to focus on the logic of the Set
class.)
Finally, let’s take a look at the methods that accept two Set
objects and return a thirdSet
: union
, intersection
, difference
, and symmetricDifference
. I have provided both a static
method and an instance method for each of these; in each case, the instance method merely calls the static
method, passing this
as the first argument and the instance method’s parameter as the second.
The simplest of these methods to understand is union
. The union of two Set
objects is a new Set
that contains every element that is in at least one of the original Set
objects. The logic of the union
method is as simple as that: create a new Set
object, then use the add
method to add all elements from setA
and all elements from setB
. As we saw above, the add
method filters out any duplicates for us.
public static Set union(Set setA, Set setB){ Set setUnion = new Set(); for (Integer element : setA) { setUnion.add(element); } for (Integer element: setB) { setUnion.add(element); } return setUnion;}
Notice that again, the union
method can both operate on and return an empty Set
.
The second useful binary operation on Set
objects is taking the intersection. While the union includes all elements that are in at least one of the two Set
objects, the intersection includes only the elements that are in both:
public static Set intersection(Set setA, Set setB){ Set setIntersection = new Set(); for (Integer element : setA) { if (setB.contains(element)) { setIntersection.add(element); } } return setIntersection;}
By definition, any element in the intersection must be in both setA
and setB
, so we start by examining the elements in setA
, and add only the elements that are also in setB
to setIntersection
. Notice that the order of the sets does not matter; had we passed setA
and setB
in the opposite order, we would still wind up returning the same unordered collection of unique elements — that is, the same Set
. The method testIntersection
in SetDemo.java
demonstrates that these are indeed the same. Furthermore, this method is capable of accepting and returning an empty Set
.
Next, we have the only method in which the order of the arguments matters: difference
. If S and T are two sets, the difference, denoted in math as S-T or S\T, is the set of elements in S that are not also in T. Because this expression is read from left to right, I have used those as the parameter names in the difference
method to keep things straight:
public static Set difference(Set left, Set right){ Set diff = new Set(); for (Integer element : left) { if (!right.contains(element)) { diff.add(element); } } return diff;}
As a fan of symmetry, I wanted to highlight how we can implement intersection
and difference
with virtually identical code, where the only change isthe !
operator in the condition. This binary condition actually partitions left
into two different subsets: one that overlaps right
, and one that does not. The overlapping subset is returned by intersection
, while the non-overlapping subset is returned by difference
. This also shows why the order of the arguments matters with difference
but not with intersection
: the overlapping subset returned by intersection
is included in both left
and right, but the non-overlapping subset returned by difference
is unique to left
.
Speaking of symmetry, the final method the Set
class is symmetricDifference
. This method accepts two Set
objects, and returns all the elements that are in exactly one of them. One way to think of this is the difference of the union and the intersection, and that is exactly how I have implemented it:
public static Set symmetricDifference(Set setA, Set setB){ return Set.difference(Set.union(setA, setB), Set.intersection(setA, setB));}
And there we have it! A Python-style Set
built from the ground up using high-school-level Java.
A few closing remarks:
- This code is not optimized in any sense of the word. Its purpose was not to run efficiently, but to be written readably.
- These implementations are not even close to unique. For instance, we could have written
symmetricDifference
to return the union of the differences instead of the difference of the union and the intersection. - To really emulate Python behavior, we should have been hashing our elements instead of simply storing them in an
ArrayList
. We emulated the unpredictable order by randomizing, but ordering of elements of a Pythonset
is arbitrary, not random. - In Python, a single
set
can hold objects of different data types, as long as they are hashable. We could have accomplished this with generics or polymorphism, but that would have distracted from the important behaviors that distinguish theSet
class fromArrayList
. - This version of a
Set
is based on the set found in Python 2, which has since been deprecated. I opted for this older version because the latestset
retains the fundamental behavior, but adds in a few additional features that could have made this post unruly.
For APCSA teachers and students, or Java programmers looking to get their toes wet with Python, I hope this post helped to shed some light on Python set
functionality.