International Collegiate Programming
Contest
|
Advanced Data Structures and String Processing
This session focuses on advanced data structures and string processing. If you are not
familiar with the relevant concepts and algorithms, or you need a refresher,
you can use the resources below.
Problem Set
-
Ladice
-
Just for Sidekicks
-
Movie Collection
-
String Matching
-
Suffix Sorting
-
Chasing Subs
Note: All problems expect I/O to be done using standard input and standard
output. The solution to problems must be written in a single file.
If you are programming in Java the application class should be called
"Main" and should be in the default package.
Resources
Advanced Data Structures
There are a few data structures outside of what you typically learn in a
data structures or algorithms course that are sometimes necessary in ICPC.
These data structures are not supplied with the standard library of Java
or C++. Because of this, we recommend having bug-free implementations in
library code. In particular, we already have working implementations of
the data structures discussed at
https://github.com/atmorgan/ICPC2014.
Binary Indexed (Fenwick) Tree
Occasionally a problem based on an array or sequence will require range sum
queries (RSQ); that is, it will ask for the the sum of elements between
two indices in the sequence. If the array is static, the problem can be solved
in O(1) time per query with O(n) time preprocessing, using dynamic programming.
However, if the values of the array can change as part of the problem, then
naive solutions will either require O(n) time per update or O(n) time per RSQ.
A more efficient approach is to use a Binary-Indexed Tree (BIT), which can achieve
O(log n) time update and O(log n) time query. A BIT is a perfect binary tree
built into an array that utilizes bit operations in its updates and queries.
Because of this, it is particularly fast in practice. The code is also very
simple and short. Read more about Fenwick trees in the wikipedia article
here. To learn how
to implement your own BIT, take a look at
the following topcoder tutorial.
As seen by our library code,
BITs can additionally be extended to work in 2 or more dimensions, and
modified to support range-updates, where a contiguous sequence of indices
change value.
Segment Tree
Segment trees provide similar operations as BITs but with greater flexibility.
In particular, segment trees can also be used for RSQ, but they are capable of
a whole lot more (at a cost of additional complexity and more code).
The basic idea of a segment tree is to utilize divide and conquer
to again provide O(log n) time queries on a contiguous range of indices
and O(log n) time update. The difference with segment trees as opposed to
BITs is the query operation can be any function that can be computed (efficiently)
by merging results of queries on smaller intervals. For example, segment trees
are capable of dynamic Range Minimum Queries (RMQ); that is, they support finding
a minimum between indices (i, j) in O(log n) time, and updating a value at a particular
index in O(log n) time. As with BITs, segment trees can be modified to work in
2 or more dimensions, and can also support range updates via lazy propagation.
The wikipedia article for segment trees provides information about the computational-geometry
segment tree for stabbing queries on a set of intervals, which is typically not needed in
programming competitions. However, you can read about segment trees used in ICPC in
the following article.
As a warning to ICPC contestants utilizing a segment tree
for a contest problem, the trees can unfortunately be quite slow. It is not uncommon
to get Time Limit Exceeded on a problem with an implementation that asymptotically
runs in the same time complexity as a model solution. For this reason, we have optimized our
implementation with a few tricks that bring down some of the constants involved with
working with segment trees. Take a look
here.
String Processing
Working with strings, parsing obscure input, and producing complicated output are all
tasks that appear frequently in ICPC. Familiarity with how your language of choice accomplishes
basic string manipulation is essential to being competitive in a contest setting.
In addition to basic string manipulation, often string tasks in ICPC are
solvable with a dynamic program. The classic problems in this space are
edit distance and
the related
longest common subsequence
problem.
In more advanced tasks, ICPC problems may require efficient string-processing algorithms
or data structures. Although less common, understanding these algorithms
and having an implementation handy is necessary to solve these problems.
Knuth-Morris-Pratt (KMP) Algorithm
The Knuth-Morris-Pratt (KMP) algorithm is an algorithm to determine if a query string Q of
length m is a substring of a given string S of length n, and if so, the first index
at which a match occurs. Although standard libraries have
such string matching functions, the C++ and Java implementations run in worst
case O(nm) complexity. The KMP algorithm runs in worst case O(n + m) complexity by
preprocessing Q into a KMP table. This table can have uses for more than just the KMP
algorithm, so it is useful to understand how it works. The
wikipedia article for
the KMP algorithm explains the details of the algorithm. Furthermore, we have an
implementation here.
Aho-Corasick Algorithm
Aho-Corasick algorithm is a kind of dictionary-matching algorithm that locates elements of a finite set of strings (the "dictionary") within an input text. It matches all strings simultaneously. The complexity of the algorithm is linear in the length of the strings plus the length of the searched text plus the number of output matches. Note that because all matches are found, there can be a quadratic number of matches if every substring matches (e.g. dictionary = a, aa, aaa, aaaa and input string is aaaa). It can be seen as a multi-string extension for the KMP algorithm we mentioned above. Check this for the algorithm itself and the implementation.
Trie
A trie is a data structure for storing a set of strings. Each node of a trie
is a letter, and a path from root to leaf represents a string in the
trie. Words with shared prefixes share the same beginning path.
When used purely as a dictionary, tries are no better than the standard
library implementations of hash tables. Tries can be useful, however, if more information regarding the prefixes
themselves is needed. For more information about tries, consult the
wikipedia article.
Suffix Array
Suffix arrays are advanced string data structures that rarely come up
in ICPC. However, they can provide efficient operations that are not possible
with elementary approaches. The idea behind a suffix array is to determine all the suffixes
of a given string S of length n, and sort them in lexicographical order. Naively
this would take (at least) O(n^2) time, but it turns out it can be done in O(n log^2 n)
time (easily), and optimized to O(n log n) when the alphabet is bounded by a constant
(suffix arrays can actually be constructed in linear time with more complicated algorithms).
Once constructed, a suffix array can achieve the following operations:
- String Matching (same as what KMP does) in O(m log n) time -
where n is the length of the preprocessed suffix array string and m is the length of the
query string
- Longest Common Prefix in O(n) time
- Longest Repeated Substring in O(n) time
- Longest Common Substring in O(n) time
For an overview of suffix arrays, consult the
following article.
We have a limited suffix array implementation available
here.
|