UW-Madison
Computer Sciences Dept.

International Collegiate Programming Contest

Advanced Data Structures and String Processing

This session focuses on advanced data structures and string processing. If you are not familiar with the relevant concepts and algorithms, or you need a refresher, you can use the resources below.

Problem Set

  1. Ladice
  2. Just for Sidekicks
  3. Movie Collection
  4. String Matching
  5. Suffix Sorting
  6. Chasing Subs

Note: All problems expect I/O to be done using standard input and standard output. The solution to problems must be written in a single file. If you are programming in Java the application class should be called "Main" and should be in the default package.

Resources

Advanced Data Structures

There are a few data structures outside of what you typically learn in a data structures or algorithms course that are sometimes necessary in ICPC. These data structures are not supplied with the standard library of Java or C++. Because of this, we recommend having bug-free implementations in library code. In particular, we already have working implementations of the data structures discussed at https://github.com/atmorgan/ICPC2014.

Binary Indexed (Fenwick) Tree

Occasionally a problem based on an array or sequence will require range sum queries (RSQ); that is, it will ask for the the sum of elements between two indices in the sequence. If the array is static, the problem can be solved in O(1) time per query with O(n) time preprocessing, using dynamic programming. However, if the values of the array can change as part of the problem, then naive solutions will either require O(n) time per update or O(n) time per RSQ.

A more efficient approach is to use a Binary-Indexed Tree (BIT), which can achieve O(log n) time update and O(log n) time query. A BIT is a perfect binary tree built into an array that utilizes bit operations in its updates and queries. Because of this, it is particularly fast in practice. The code is also very simple and short. Read more about Fenwick trees in the wikipedia article here. To learn how to implement your own BIT, take a look at the following topcoder tutorial. As seen by our library code, BITs can additionally be extended to work in 2 or more dimensions, and modified to support range-updates, where a contiguous sequence of indices change value.

Segment Tree

Segment trees provide similar operations as BITs but with greater flexibility. In particular, segment trees can also be used for RSQ, but they are capable of a whole lot more (at a cost of additional complexity and more code). The basic idea of a segment tree is to utilize divide and conquer to again provide O(log n) time queries on a contiguous range of indices and O(log n) time update. The difference with segment trees as opposed to BITs is the query operation can be any function that can be computed (efficiently) by merging results of queries on smaller intervals. For example, segment trees are capable of dynamic Range Minimum Queries (RMQ); that is, they support finding a minimum between indices (i, j) in O(log n) time, and updating a value at a particular index in O(log n) time. As with BITs, segment trees can be modified to work in 2 or more dimensions, and can also support range updates via lazy propagation. The wikipedia article for segment trees provides information about the computational-geometry segment tree for stabbing queries on a set of intervals, which is typically not needed in programming competitions. However, you can read about segment trees used in ICPC in the following article.

As a warning to ICPC contestants utilizing a segment tree for a contest problem, the trees can unfortunately be quite slow. It is not uncommon to get Time Limit Exceeded on a problem with an implementation that asymptotically runs in the same time complexity as a model solution. For this reason, we have optimized our implementation with a few tricks that bring down some of the constants involved with working with segment trees. Take a look here.

String Processing

Working with strings, parsing obscure input, and producing complicated output are all tasks that appear frequently in ICPC. Familiarity with how your language of choice accomplishes basic string manipulation is essential to being competitive in a contest setting.

In addition to basic string manipulation, often string tasks in ICPC are solvable with a dynamic program. The classic problems in this space are edit distance and the related longest common subsequence problem.

In more advanced tasks, ICPC problems may require efficient string-processing algorithms or data structures. Although less common, understanding these algorithms and having an implementation handy is necessary to solve these problems.

Knuth-Morris-Pratt (KMP) Algorithm

The Knuth-Morris-Pratt (KMP) algorithm is an algorithm to determine if a query string Q of length m is a substring of a given string S of length n, and if so, the first index at which a match occurs. Although standard libraries have such string matching functions, the C++ and Java implementations run in worst case O(nm) complexity. The KMP algorithm runs in worst case O(n + m) complexity by preprocessing Q into a KMP table. This table can have uses for more than just the KMP algorithm, so it is useful to understand how it works. The wikipedia article for the KMP algorithm explains the details of the algorithm. Furthermore, we have an implementation here.

Aho-Corasick Algorithm

Aho-Corasick algorithm is a kind of dictionary-matching algorithm that locates elements of a finite set of strings (the "dictionary") within an input text. It matches all strings simultaneously. The complexity of the algorithm is linear in the length of the strings plus the length of the searched text plus the number of output matches. Note that because all matches are found, there can be a quadratic number of matches if every substring matches (e.g. dictionary = a, aa, aaa, aaaa and input string is aaaa). It can be seen as a multi-string extension for the KMP algorithm we mentioned above. Check this for the algorithm itself and the implementation.

Trie

A trie is a data structure for storing a set of strings. Each node of a trie is a letter, and a path from root to leaf represents a string in the trie. Words with shared prefixes share the same beginning path. When used purely as a dictionary, tries are no better than the standard library implementations of hash tables. Tries can be useful, however, if more information regarding the prefixes themselves is needed. For more information about tries, consult the wikipedia article.

Suffix Array

Suffix arrays are advanced string data structures that rarely come up in ICPC. However, they can provide efficient operations that are not possible with elementary approaches. The idea behind a suffix array is to determine all the suffixes of a given string S of length n, and sort them in lexicographical order. Naively this would take (at least) O(n^2) time, but it turns out it can be done in O(n log^2 n) time (easily), and optimized to O(n log n) when the alphabet is bounded by a constant (suffix arrays can actually be constructed in linear time with more complicated algorithms). Once constructed, a suffix array can achieve the following operations:

  • String Matching (same as what KMP does) in O(m log n) time - where n is the length of the preprocessed suffix array string and m is the length of the query string
  • Longest Common Prefix in O(n) time
  • Longest Repeated Substring in O(n) time
  • Longest Common Substring in O(n) time
For an overview of suffix arrays, consult the following article. We have a limited suffix array implementation available here.

 
Computer Sciences | UW Home