Sets


Contents


Introduction

A set is an unordered collection of distinct items. Individual items are called elements or members. All elements are of a common element type E or they are of type Object. Duplicates are not allowed.

A set can be described in words. For example, the set of vowels, or the set of postive integers, or the set of real numbers, or the set of students who scored above 90% on an exam, etc. A set can also be described by enumerating (i.e. S = {A,E,I,O,U}) or by diagramming the elements the set:

circle with letters {A,E,I,O,U} inside

Sets as a Collection Type

Operation Description
boolean add(E item) add item to this Set if it is not already in this set and returns true if the item was added.
boolean add(Set set) add each distinct item of set to this set and returns true if all items were added.
boolean contains(E item) return true iff item is in the Set
(i.e., there is an item x in the Set such that x.equals(item))
int size() return the number of items in the Set
boolean isEmpty() return true iff the Set is empty
boolean remove(E item) remove the item from this Set. If the item is not present, returns false.

Java's Java's Set is an interface that extends Iterable with similar methods to the operations we have chosen to focus on. It has several additional methods.


TEST YOURSELF #1

Question 1: What should the behavior be for the add methods if any item is already in the set?

Question 2: What other operations on might be useful? Are they fundamental to all Sets? Define them by writing descriptions like those in the table above.

Question 3: What about equals() and hashCode() methods? Must they be overridden for a user-defined set type? If so, how?

solution


Sets vs Lists

In some ways, a Set is similar to a List: both Sets and lists are collections of objects and in both cases you can add or remove items. You can also find out how many items are in either a Set or a List (using the size method). But, a list is an ordered collection that allows duplicate items. While a set is an unordered collection that does not allow duplicate items.

Lists are easily implemented with arrays or linked chains of nodes. But, if a Set is implemented with an array or linked chain of nodes, it may be inefficient (O(N) to O(N2)) to perform many of the most common operations. In fact, just determining if an item being added is a duplicate will become a linear O(N) operation. For that reason alone, Set implementations will need to use an internal data structure that is more efficient for lookups.

A disadvantage of a Set compared to a List is that the order elements are produced for an iteration is not deterministic. Note: it may be consistent within your specific implementation, but there is no reason to believe that the same implementation on a different platform or a different implementation would or should produce the same iteration order. It is a conceptual error to write code that expects an iteration of a Set to produce its elements in a given order.


TEST YOURSELF #2

Question 1: Assume that variable words is a Set containing k strings, for some integer k greater than or equal to zero. Write code that creates a new Set named plural that contains the plural of each word. For this question, assume that to create a plural word, you simply append an s to the end of the word.

For example, if words contains "happy","birthday","to","you" before your code executes: then after your code executes, the two sets would be:

words = {"happy", "birthday", "to", "you"}
plurals = {"happys", "birthdays", "tos", "yous"}

Question 2: Assume that variable letters is a Set containing some letters. Write code that removes all elements found in the set vowels from letters.

solution


Set Terminology

There are several operations that can be performed on or between sets. A class that implements the Set interface can define methods for these additional operations, or external static methods can be defined. If the methods are external, they require that both sets be passed in as parameters.

Subset
A set A is a subset of another set B, if all members of A exist in B. Thus, a set is also a subset of itself while a proper subset is a subset that is not also equal to the original set. Example: The set of vowels (smaller circle) is a subset of the set of all letters (larger circle).
set of all letters with set of vowels circled inside that set
Superset
A set B is a superset of set A, if B contains all elements of A and additional elements that are not found in A. Example: The set of letters (larger circle) is a superset of the set of vowels (smaller circle).
Union
The union of two sets A and B is the set of all elements that are in either set A or set B or both sets. A.union(B) is equivalent to B.union(A).
set of all letters union set of numbers = set of letters and numbers inside a single set circle
Intersection
The intersection of two sets is the set of elements that are found in both set A and in set B. A.intersection(B) is equivalent to B.intersection(A). intersection({0,1,2,5,9,8,7,6},{1,2,4,6,7,3,8})={1,2,8,7,6}
Disjoint
Set A and set B are disjoint if there are no elements that are found in both sets. A.disjoint(B) is equivalent to B.disjoint(A).
two disjoint sets {S,X,J,M,V} and {F,W,Y,D,E,O,Z} disjoint
Disjoint Sets
Disjoint sets refer to a collection of elements where each element is found in exactly one of the sets. For example, in the set of students in my class, each student is in one of the following disjoint sets:
  • high-school student
  • undergraduate student
  • graduate student
  • special student
  • visiting scholar

A single student can not be a member of more than one of these disjoint sets.


several disjoint sets of students inside of a set
Union-Find
Union-Find can be used to detect cycles in an undirected graph. Use find to find if two elements are in the same or disjoint sets. Use union to join two disjoint sets into a single connected set. See https://www.cs.princeton.edu/~rs/AlgsDS07/01UnionFind.pdf for detailed explanation and examples.

Implementation Options

Now let's consider which internal data structure to use for implementing a Set. An array or linked list of nodes is perfectly fine for the basic data structure options of a Set. However, something as simple as not allowing duplicates requires a lookup. If we use a sorted array, checking for match is O(log2 N) but keeping the array sorted on each insert is O(N). If we use a linked chain of nodes, the lookup itself is O(N). Because every insert is going to also require a lookup, it is critical that lookups be as efficient as possible.

Hash Table

Using a hash table to store the elements of a Set means that we will have nearly constant time lookups, inserts, and removes. This efficiency is desired, but a hash table implementation has a few downsides.

Balanced Search Tree

Using a balanced search tree for the internal data structure will ensure that lookups to check for duplicates will be no worse than O(log2 N) for each lookup. For insert, we can link the new element into the tree in the same operation for no additional cost. Operations such as union and intersection between two sets with size M and N can be O(M log2 N).

There is an advantage to using a balanced search tree. The ability to efficiently perform an inorder traversals means that the elements can be returned in sorted order in linear time. Likewise, the order that elements are returned by an iterator can be deterministic though Set doesn't require this. Finding the items that fall within a range will also be completed in linear time. If these operations are needed than a balanced search tree is a good choice for the internal data structure of a Set.

The Java API and Sets

The Java.util package provides a Set interface (with many more methods than the ones given in our SetADT interface), as well as a number of classes that implement that interface, including the HashSet class and the TreeSet class.


TEST YOURSELF #3

Question 1: Write the public static <E> Set<E> union(Set<E> set1, Set<E> set2) method.

Question 2: Write the public static <E> Set<E> intersection(Set<E> set1, Set<E> set2) method.

solution


Summary