A set is an unordered collection of distinct items. Individual items are called elements or members. All elements are of a common element type E or they are of type Object. Duplicates are not allowed.
A set can be described in words. For example, the set of vowels, or the set of postive integers, or the set of real numbers, or the set of students who scored above 90% on an exam, etc. A set can also be described by enumerating (i.e. S = {A,E,I,O,U}) or by diagramming the elements the set:
Operation | Description |
---|---|
boolean add(E item) | add item to this Set if it is not already in this set and returns true if the item was added. |
boolean add(Set |
add each distinct item of set to this set and returns true if all items were added. |
boolean contains(E item) | return true iff item
is in the Set (i.e., there is an item x in the Set such that x.equals(item)) |
int size() | return the number of items in the Set |
boolean isEmpty() | return true iff the Set is empty |
boolean remove(E item) | remove the item from this Set. If the item is not present, returns false. |
Java's Java's Set is an interface that extends Iterable with similar methods to the operations we have chosen to focus on. It has several additional methods.
TEST YOURSELF #1
Question 1: What should the behavior be for the add methods if any item is already in the set?
Question 2: What other operations on might be useful? Are they fundamental to all Sets? Define them by writing descriptions like those in the table above.
Question 3: What about equals() and hashCode() methods? Must they be overridden for a user-defined set type? If so, how?
In some ways, a Set is similar to a List: both Sets and lists are collections of objects and in both cases you can add or remove items. You can also find out how many items are in either a Set or a List (using the size method). But, a list is an ordered collection that allows duplicate items. While a set is an unordered collection that does not allow duplicate items.
Lists are easily implemented with arrays or linked chains of nodes. But, if a Set is implemented with an array or linked chain of nodes, it may be inefficient (O(N) to O(N2)) to perform many of the most common operations. In fact, just determining if an item being added is a duplicate will become a linear O(N) operation. For that reason alone, Set implementations will need to use an internal data structure that is more efficient for lookups.
A disadvantage of a Set compared to a List is that the order elements are produced for an iteration is not deterministic. Note: it may be consistent within your specific implementation, but there is no reason to believe that the same implementation on a different platform or a different implementation would or should produce the same iteration order. It is a conceptual error to write code that expects an iteration of a Set to produce its elements in a given order.
TEST YOURSELF #2
Question 1: Assume that variable words is a Set containing k strings, for some integer k greater than or equal to zero. Write code that creates a new Set named plural that contains the plural of each word. For this question, assume that to create a plural word, you simply append an s to the end of the word.
For example, if words contains "happy","birthday","to","you" before your code executes: then after your code executes, the two sets would be:
words = {"happy", "birthday", "to", "you"}
plurals = {"happys", "birthdays", "tos", "yous"}
Question 2: Assume that variable letters is a Set containing some letters. Write code that removes all elements found in the set vowels from letters.
There are several operations that can be performed on or between sets. A class that implements the Set interface can define methods for these additional operations, or external static methods can be defined. If the methods are external, they require that both sets be passed in as parameters.
A single student can not be a member of more than one of these disjoint sets.
Now let's consider which internal data structure to use for implementing a Set. An array or linked list of nodes is perfectly fine for the basic data structure options of a Set. However, something as simple as not allowing duplicates requires a lookup. If we use a sorted array, checking for match is O(log2 N) but keeping the array sorted on each insert is O(N). If we use a linked chain of nodes, the lookup itself is O(N). Because every insert is going to also require a lookup, it is critical that lookups be as efficient as possible.
Using a hash table to store the elements of a Set means that we will have nearly constant time lookups, inserts, and removes. This efficiency is desired, but a hash table implementation has a few downsides.
Using a balanced search tree for the internal data structure will ensure that lookups to check for duplicates will be no worse than O(log2 N) for each lookup. For insert, we can link the new element into the tree in the same operation for no additional cost. Operations such as union and intersection between two sets with size M and N can be O(M log2 N).
There is an advantage to using a balanced search tree. The ability to efficiently perform an inorder traversals means that the elements can be returned in sorted order in linear time. Likewise, the order that elements are returned by an iterator can be deterministic though Set doesn't require this. Finding the items that fall within a range will also be completed in linear time. If these operations are needed than a balanced search tree is a good choice for the internal data structure of a Set.
The Java.util package provides a Set interface (with many more methods than the ones given in our SetADT interface), as well as a number of classes that implement that interface, including the HashSet class and the TreeSet class.
TEST YOURSELF #3
Question 1: Write the public static <E> Set<E> union(Set<E> set1, Set<E> set2) method.
Question 2: Write the public static <E> Set<E> intersection(Set<E> set1, Set<E> set2) method.