CS 787 Spring 2018 Course Log Lecture 1 1/24/18 Wednesday Handouts: Course blurb, topics list, books on reserve Questionnaire (return Friday 1/26/18) Course web site: pages.cs.wisc.edu/~cs787-1 Stable marriage problem Gale and Shapley, American Math. Monthly, 1962. We discussed a simplified version of their "proposal" algorithm in which there is only one proposal (from a currently unattached man) per round. Main properties: 1. Procedure stops (as soon as each woman has received a proposal) 2. O(n^2) proposals, faster than trying all pairings 3. Men's partners get worse, women's partners get better 4. End result is stable (no man and woman prefer each other to their current partner). Lecture 2 1/26/18 Friday Using induction to design algorithms U. Manber, Comm. ACM, 1998. Where do good algorithms come from? One principle is mathematical induction. Review of step-by-step induction Calculus example: d/dx(x^n) = n*x^(n-1). Use the product rule. Use in algorithm design: To solve an instance of size n, *assume* you can solve instances of size n-1. Example: Largest permuted subset (in the paper) Given f:X->X (X is finite), find a largest S on which f acts like a permutation. 2^|X| subsets, so exhaustive search is not a good idea. Try to reduce size of X by removing elements. Idea: Any y that is not hit by f cannot belong to S. Remove it and solve the problem recursively. E.g. f(x) = x^2 mod 5, on X={0,1,2,3,4} A largest S is {0,1}. Can implement in O(|X|) time using reference counts (see paper). Example: Balance factors in binary trees Height of a binary tree = max # of edges in a root-leaf path Balance factor = height(left subtree) - height(right subtree) Want to compute balance factors of all subtrees Obvious induction won't work, since a tree's balance factor is not determined by balance factors of subtrees. For example, let T_k be the complete binary tree of height k (has 2^{k+1}-1 nodes, and balance factor 0). The trees * * / \ / \ T_2 T_2 T_2 T_3 have balance factors 0 and -1. Fix by strengthening the induction hypothesis: To handle a tree w/ n nodes, assume we can find both balance factor and height for all trees with = 0, then (A1 * ... * An)^(1/n) <= (A1 + ... + An) / n. This is the same as: A1*...*An <= [ (A1 + ... + An) / n ]^n. Cauchy gives an algebraic proof for this (no calculus). If n=1 there is nothing to do If n=2, use the polarization identity: A1 A2 = 1/4 * [ (A1 + A2)^2 - (A1 - A2)^2 ] <= [ (A1 + A2) / 2 ]^2 Use the same idea to handle 4, 8, 16, ... Now use *reverse induction*: if it holds for n+1, it holds for n. This will fill in the gaps and take care of all n >= 1. Let n>=3. Let Abar = (A1 + ... + An)/n. Assume >0, it not, done. A1 * ... * An * Abar <= [ (A1 + ... + An + Abar) / (n+1) ] ^(n+1) = Abar ^ (n+1). Cancel Abar to get desired <= . Note: We descended step by step but Cauchy uses the same idea to go directly from a power of 2 to a number below it. Karatsuba's Subquadratic Multiplication Algorithms ~ 1960: It was believed that the "school" algorithm for multiplying n-bit integers, which uses O(n^2) bit operations, was optimal. Karatsuba found an algorithm with o(n^2) bops. Use polarization identity to reduce to squaring Let n=2^k, and x = x1 * 2^{n/2} + x0, where 0 <= xi < 2^{n/2}. x^2 = x1^2 * 2^n + 2 x0 x1 * 2^{n/2} + x0^2. ^ ^ ^ | | | A B C Compute A, C, D = (x0 + x1)^2. Then B = D - A - C. Reduces squaring an n-bit number to 3 squarings of n/2-bit numbers. Note: Suppose the sum overflows, i.e. x0+x1 = 2^{n/2} + y. Then use (x0+x1)^2 = 2^n + 2^{n/2+1} y + y^2. If S(n) is the number of bit ops needed to square n-bit numbers, S(n) <= 3 * S(n/2) + const * n S(1) = const. (The const*n is for the work of adding, subtracting, shifting.) Since we want a big-O estimate, we can take the constants to be 1, and consider the corresponding equality. Consider the recursion tree. The i-th level has 3^i problems of size n/2^i. Summing up the work by levels, we get S(n) = n * [ 1 + (3/2) + (3/2)^2 + ... + (3/2)^k ] <= 2 * n * (3/2)^k (sum geometric series) = 2 * 3^{log_2 n} = 2 * n^{log_2 3} = O( n^ 1.59... ). The same result holds for any n, since we can just pad by 0's and replace n by 2^ceiling(log_2 n} <= 2n. Further work has improved this bound. The best known result, due to Furer, is barely above O(n log n). Lecture 4 1/31/18 Wednesday Dynamic Programming K. Pruhs, SIGACT News v. 29, 1988, 32-35. Historical origins: Optimization Principles in Physics Principle of least time (Fermat, 1600's): Among all paths from A to B, a light ray will pick for which the transit time is least. Example: Refraction Two media separated by a plane boundary A is in the "fast" medium, B in the "slow" one. The light ray will bend so as to make its path longer in the fast medium. Can find the crossing point using algebra Leads to Snell's Law for refraction angles Similar ideas in mechanics (principle of least action). Feynman's Lectures on Physics, v. 2, has a wonderful lecture on this. Bellman (1950s) exploited this for computational problems Exploited the "principle of optimality:" Each part of an optimal solution must itself be optimal. For refraction example, suppose the crossing pt is X The A-X and X-B paths must be straight. Reduces the search space for finding an optimal path. Dynamic programs usually compute more information than the problem strictly demands. Example: To find a sortest path from s to t in a graph, we must know shortest paths from s to all other vertices. Subproblems are typically solved using a (multidimensional) recurrence relation. Even if we want to systematically solve subproblems, it is not obvious which ones to consider. Pruhs suggests attacking the problem "from above": Make enumeration tree of all feasible configurations Apply pruning rules to reduce the tree to a "small" size Develop a recurrence relation to generate the nodes level by level (from root to leaf). Example: Longest increasing subsequence of a1,...,an. Assume ai are distinct. Full tree has, at kth level, all the subsequences from a1,...,ak. Keep only increasing sequences If two level k seqs have the same length, keep only the one with smallest final element (it can be used in any extension where the other one can). Cuts tree down to O(n^2) nodes, with <= n+1 nodes per level. Let BEST(k,l) = the length l increasing subsequence with smallest final element, which occurs at level k. | BEST(k-1,l) /* don't use ak */ BEST(k,l) = one of | | if this increasing, | BEST(k,l) o ak /* use ak */ We choose the one with smallest final element This gives an O(n^3) algorithm. We can reduce to O(n^2) by storing, for each k,l: location of last element which of the above two choices was used Lecture 5 2/2/18 Friday Collision Algorithms J.M. Pollard, Math. Comp. 1978 Idea: In many situations, agreements between pairs are more numerous than specific kinds of single elements. Example: A sock drawer with m kinds of socks Pick one, look for another to match it ==> O(m) search time Pick socks until two of them match ==> O(sqrt m) Subset Sum Problem Given a1,...,an;b (all positive integers) Find S [ {1,...,n} for which sum_{i in S} a_i = b. IP feasibility version: sum_i x_i a_i = b, x_i in {0,1} 2^n choices for the x_i's Decision version (can we make sum <= b?) is NP-complete Solvable using dynamic programming in O(n b) arithmetic opns Collision idea: split (x1,...,x_n) into left and right halves. Try to solve sum_{i <= n/2} x_i a_i = b - sum_{i>n/2} x_i a_i Enumerate all left sides, all right sides. Sort and look for a match. Size of our sorting problem, N, is <= 4*2^{n/2} So this is an O(N log N) = O(n 2^{n/2}) algorithm. Much better when n is large. Discrete Logarithms (Pollard) Background on number theory: When p is prime, for some q (relatively prime to p), the powers 1,g,g^2,...,g^{p-2}, reduced mod p, exhaust 0,1,...,p-2. [Small example: p=7, g=3] g is called a primitive root or generator for p. If g^x == a mod p, we call x an index or discrete logarithm (base g) for a. Fermat's little theorem: when a,p are relatively prime, a^{p-1} == 1 mod p. (Prove by group theory, or show a^p == a by induction on a.) Therefore, the powers of g make a periodic periodic sequence, with period p-1. For this reason we only care about the index mod p-1. x --> g^x is easy (repeated squaring) Determining x from a is hard (no known poly time algorithm) Pollard's Kangaroo Method Choose hash fn Delta : {1,...,p-1} ---> {0,...,p-2} Let f(x) = x*g^Delta(x). Think of this as an irregular step with a known distance. In fact, log f(x) = log x + Delta(x) (mod p-1). Consider two sequences: 1) x_0 = 1, x_i = f(x_{i-1}), i=1,...,m. 2) y_0 = a, y_i = f(y_{i-1}), i=1,...,m. We hope that for some i,j, we have x_i = y_j. Then the two sequences are coupled and will agree forever. Pollard calls 1) the "kangaroo trap". We don't have remember the entire sequence, just the last element x_m (and its distance from x_0). Think of a hunter hiding at x_m: x_0 --> x_1 --------> x_2 -> x_3 ---- ... ---------> x_m. A second kangaroo that lands in the trap, and has identical jumping behavior, will end up at x_m. This can be detected by comparing each successive y_j to x_m. Analysis (only heuristic) Assume: All samples (except for starting points) are i.i.d. uniform on Zp* = {1,2,...,p-1}. Suppose x_1,...,x_m are distinct. Then Pr[y_j hits trap] = m/n (n = p-1). Pr[y_1,...,y_m avoid trap] = (1 - m/n)^m With m = sqrt(n) this is (1 - 1/sqrt(n))^sqrt(n) ~ e^{-1}, about 0.367. Also (for this m) expected number of duplicate pairs in the list x_1,...,x_m is (m choose 2)*(1/n) ~ 1/2. So with high probability the trap has about sqrt n elements. This gives (heurstically, anyway) a way to compute discrete logs using about sqrt(p) time and space for O(1) residues mod p. Note that exhaustive search would have been O(p). Lecture 6 2/5/18 Monday Linear programming Reference: Lectures 3-4 from R.M. Karp's course, Great Algorithms, Winter 2006. LP = Optimization of a linear function subject to linear constraints. 1939: Kantorovich, industrial production models ... 1947: Dantzig, USAF logistics problems Simplex algorithm Early example: optimal diets (Stigler 1945) Data = list of foods with cost/unit and amounts/unit of various nutrients Want to meet daily requirements at lowest cost With ad hoc analysis, found a "cheap" diet (about $2/day in 2017 dollars) using wheat flour, evaporated milk, cabbage, spinach, navy beans. (Optimal?) Relevance to our course: 1. Can reduce combinatorial problems to LP (e.g. network flow) 2. Structural insight (polyhedral combinatorics) 3. Can use for approximation algorithms 4. Makes papers by Karp, Edmonds,... more accessible Formal development R^n = n-dim'l real vectors inner product, length, angle vector inequalities: <=, < hyperplane: soln set to ax = b half-space: soln set to ax <= b polyhedron: intersection of finitely many half-spaces need not be bounded example: x <= 1, y <= 2, in R^2. polytope: bounded polyhedron Linear programming: Optimizing linear functions on polyhedra e.g. min of cx s.t. Ax >= b. c = row vector x = column vector A = m x n matrix b = column vector May or may not have a solution (e.g. for above polytope, consider max of x+y, min of x+y). Even if solution exists, it need not be unique (consider max of x). Vertices We'd expect to find solutions at "corners" of the polytope. Defn: x in P (polytope) is a *vertex* if there is no line segment a--b within P with x strictly between a and b. Note the "forbidden configuation." Theorems (not proved here) 1. x is a vertex of P iff for some c, x is the unique solution to max cx s.t. x in P. 2. Let x in P, where P [ R^n is the intersection of m half-spaces. Then x is a vertex iff exactly n of the defining constraints hold with equality. Degeneracy: If Ax <= b is a system of linear inequalities (x in R^n), a vertex of the polyhedron they define is called degenerate if >n of the constraints hold with equality. Note that this depends on how we represent the polytope If v is degenerate, must there be a redundant inequality? No -- consider the apex of a pyramid. Degeneracy is important because it can cause the simplex algorithm to misbehave. Lecture 7 1/7/18 Wednesday The Simplex Algorithm Karp, op. cit. LP: min cx s.t. x in P (polytope) P [ R^n given explicitly by linear inequalities Possibly P is empty (infeasible) or unbounded in the direction of decreasing cx (unbounded). Under "normal" conditions, P is the convex hull of finitely many extreme points (vertices): [draw a picture for n=2] Simplex algorithm: descent from vertex to vertex Start with some vertex v While there is an adjacent vertex w that improves objective, v <- w [move there] Return v The trick is to do this with algebra Equality form for LPs: min cx s.t. Ax = b >= 0, x >= 0. A is an m x n matrix of rank m. To put into this form: a) replace unconstrained vble by difference of non-negative vbles b) replace any ax <= b by ax + s = b, where s>=0 is a slack vble c) delete redundant rows from A d) flip signs to make b >= 0 Descent (Pivot) Step: The algebra in Karp's notes is pretty clear on this so we'll just give the highlights. Eqns are [ B N ] [x_B] = b [x_N] Assume B is mxm, invertible. Setting x_N = 0 and solving gives a vertex. Assume for simplicity that x_B > 0. Try to raise some x_j in x_N. Then new x_B = B^{-1}(b - A_j x_j) [A_j = A's j-th column] The idea is that x_j will be a local coordinate for the line segment linking the old vertex to the new one. new obj value = [old obj value] + (c_j - C_B B^{-1} A_j) x_j ^ | if this is >= 0 raising x_j won't help. if this is >= 0 for every j, stop (vertex is optimal). pick a j making the expn in ()'s negative. Let d = B^{-1} A_j. Then new x_B = [old x_B] - d x_j , needs to be >= 0. if d <= 0, we can increase x_j forever, so the problem is unbounded (stop). Suppose some d_i > 0. If the old x_B > 0, we can (strictly!) increase x_j until some x_i in X_B becomes 0. For simplicity, assume this occurs for a unique i. Then we have removed a column from B (the i-th one) and replaced it by one from N (the j-th one). This gives a new vertex. Exercise: Prove that the invertibility of B is an invariant of the algorithm. (This is easiest to prove in the nondegenerate case where x_j increases, but it actually holds in all cases.) Degeneracy The vertex [x_B] is *degenerate* if some entry of x_B is 0 [ 0 ] That is, the number of 0 entries is "too big." This is a special definition, just for the simplex algorithm. In this case, the new x_i could come in at 0, meaning that the line segment we wanted to traverse to improve the objective has collapsed to a point. This can cause cycling: remove x_i, then x_j, then x_i, ... How do we avoid this? Here are three strategies: 1. Perturb the problem. If the perturbation is small enough, you will get the same B at the end. 2. Randomly pick x_j from the available contenders. 3. Use a combinatorial anti-cycling rule, such as the following (due to R. Bland): always pick the smallest index. The Number of Descent Steps This is usually small, but examples are known with an exponential number of pivots (Klee & Minty, 1972). Therefore, simplex is not worst-case polynomial time. Lecture 8 2/9/18 Friday Review of simplex: Using notation of the last lecture, we pick a variable x_j in x_N (the "inactive" vbles, set to 0) and imagine raising x_j. We have new = old + (c_j - c_B B^{-1} A_j) x_j obj obj x_j is *eligible* if the number in ()'s is > 0. If there are no eligible variables, we can stop as the current vertex is optimal. Otherwise, an eligible x_j is raised as much as possible (might be not at all), with the effect of new B = old B with i-th column removed and j-th column added. Invertibility of B is an invariant of the algorithm (exercise). Starting the simplex method Make up a new LP with an "obvious" vertex min sum_i y_i s.t. Ax + I y = b (>= 0) x,y >= 0. Run simplex starting wih the vertex x=0, y=b. If min sum_i y_i > 0, original LP is infeasible. If min sum = 0, then all y_i = 0. Consider the columns that are now in B. i) If they are all columns of the original A, throw away the y's, and run simplex (with the original cost function) starting with these columns. ii) If some columns of B appear in I, they can be "swapped" with columns of A to achieve i). (Exercise for you.) Continue as in i). Duality Every minimizing LP has an associated maximizing one, called its dual. (Vice versa for maximizing LP's) Think of original LP (primal) and its dual as two halves of the same problem. P: min cx s.t D: max yb s.t. Ax = b, x >= 0. yA <= c (y unrestricted) Weak duality theorem: If x,y are feasible for P,D then yb = y Ax = yA x <= cx ^ | (since x>=0) Corollary: If P is unbounded, D is infeasible If D is unbounded, P is infeasible Strong duality theorem: If P has a finite optimum, then D has an optimum, and the two objective functions have the same value. Pf: Run simplex on P, to get c_B c_N [ B N ] [ x_B ] = [ b ] [ x_N ] <- 0 here [ ] min cx = c_B x_B = c_B B^{-1} b Termination ==> for all x_j in x_N, c_j - c_B B^{-1} A_j >= 0, all x_j in x_N. Let y = c_B B^{-1}. Then: 1) y is dual-feasible: yA = y c_B B^{-1} A, and B^{-1} A = B^{-1} [ B | A_j's for j>m] = [ I | B^{-1} A_j's for j>m] so c_B B^{-1} A = [ c_B | c_B B^{-1} A_j's for j>m] <= c_j (j is not eligible) That is, yA = c_B B^{-1} A <= c. 2) y is dual-optimal: By weak duality yb = c_B B^{-1} b = c_B x_B = cx, (so yb is as large as any dual objective can be) We can think of simplex as solving the dual along with the primal (!). Some implementations do this explicitly, by maintaining a representation of B^{-1}. Interpretation of dual problem Forming the dual is a mechanical procedure, which rearranges the data. It is more interesting and useful to ask ourselves what the dual *means*. This is application dependent. Example: optimal diet problem. Suppose the nutrition requirements are met exactly, so we have min cx s.t. Ax = b (>=0), x >= 0 Rows of A represent nutrients (how much is in each food), columns represent foods (how much of each nutrient it contains). Primal expresses the consumer's problem: meet nutritional goals at least cost. Now imagine a supplier of artifial nutrients enters the market. (For simplicity, assume he has enough on hand to sell.) He might charge y_i for 1 unit of the ith nutrient He wants to maximize his revenue yb yA <= c means he is competitive food-by-food. Suppose this is not the case. Then, for some j, y [j-th column of A] > c_j. and the consumer could satisfy a portion of his demands more cheaply by buying the j-th real food. Lecture 9 2/12/18 Monday Set Cover Problem V. Chvatal, Math. of OR, 1979 Setup: Given S1,...,Sn subsets of X = {1,...,m} c_j = cost of Sj (positive) want to choose a set J of indices so that the S_j, j in J, cover X at minimal cost Example: Car buying elts of X are features we'd like: video in back, CD player, ... sets are packages that include various features we can't buy features individually, only as part of a package Decision version of Set Cover is NP-complete (VERTEX COVER is a special case). So we don't expect to solve exactly, and look for approximations. That means (in this case) the sets we pick cover X the cost of these sets is not too far from the minimum cost Example (m=4, n=5): columns are The 0-1 array is called the sets incidence matrix 0 1 0 1 0 1 1 1 1 1 0 0 rows are elements 1 0 1 1 0 0 0 1 costs 3 1 4 5 cost ---- 3/2 1/3 2 5/4 element One thing we could do is just pick sets in decreasing cost order, skipping over any that don't give us any new elements. If we do that here, we incur cost 1 + 3 + 5 = 9. Note that S3 is skipped. Here is a more sophisticated strategy (GREEDY, should be called STINGY). At each step, choose a set that enlarges the collection, at minimal cost per "new" element. On the example above, GREEDY tells us to pick S2 (which covers 1,2,3). Then the costs (per element remaining) are revised to be cost ---------- 3 --- 4 5/2 new element Because elements are effectively being removed from the sets, the costs per element go up, or at least do not decrease. Now the next best choice is S4, which adds 2 elements at cost 5. So GREEDY's cover is S2, S4. Cost is 1+5 = 6. This is an optimal cover: choosing S4 is the only way to cover 5, and any additional set costs >= 1. We don't expect GREEDY to always find the optimum (since Set Cover is NP-hard), but we can ask how well it does compared to the best cover. To answer such questions (for minimizing problems), we need lower bounds on the cost of the optimum. LP duality gives us such lower bounds. Integer program (just like LP, but variables must be integral) for set cover: min sum_j c_j x_j s.t. Ax >= 1 [all 1's column vector] (A = 0-1 matrix) x_i >= 0, integral (Note: no harm in allowing x_i > 1 here.) Relax to a linear program: same, but x_i >= 0 (can be fractional). Intuitively, the LP is allowing the car buyer to purchase packages at a fractional level: say half of the sports package, half of the entertainment package, ... The dual of this is: max y1 = sum_i y_i s.t. yA <= c, y >= 0. What does it mean? We can think of Set Cover as a kind of optimal diet problem where the demand for each nutrient is the same (you need it, but any amount will do), and foods cannot be split (the grocer will not sell you half a cucumber). Using this analogy, y_i is the price for the i-th element. On the example, GREEDY paid (1/3 1/3 1/3 5/2 5/2). Is this dual-feasible? No. If we multiply this by the last column of the incidence matrix, we get 2/3 + 10/2 > 5. However, it can be "shrunk" by a logarithmic factor to become dual-feasible. Theorem: If y denotes the vector of prices paid by GREEDY, then y* = y/H_m is dual-feasible. (H_m = 1 + 1/2 + ... + 1/m.) Proof: Consider the set S = {e1,e2,...,ek} with the elements numbered in the order they were covered by GREEDY. (Resolve ties arbitrarily.) Right before ei was covered, S had >= t - (i-1) uncovered elements. (At most i-1 elements of S could have been covered before ei.) So GREEDY paid at most cost(S) / (t-i+1) for ei. (It might have covered ei as part of another set, but that would have been cheaper, as marginal costs only go up.) Therefore the total cost paid for elements of S is <= sum_{i=1}^t cost(S) / (t - i + 1) <= H_m * cost(S). [Since t <= m.] Suppose the optimum cover is S1,...,Sk. Then GREEDY's cost for all elements is <= sum_{j=1}^k GREEDY's total price for Sj <= sum_{j=1}^k H_m * cost(S) = H_m * [cost of optimal cover]. To get our approximation factor, we apply weak duality: (dual) (primal) . . . ..........x................|==========XX=====X==X=== = = = ^ ^ | | sum of yi* cost of best = GREEDY cost / H_m <= integral soln (also a fractional soln) so [GREEDY cost] <= H_m * [cost of min cover] Lecture 10 2/14/18 Wednesday Is there a better analysis for GREEDY? Integrality Gap Review of Set Cover problem (notation from last lecture) GREEDY's approximation factor [cost of GREEDY soln]/[cost of best cover] is <= H_m. This cannot be improved: Let X={1,...,m}. Let S_0 = X, with cost 1+epsilon and S_i = {i}, cost 1/i. GREEDY will pick S_m,...,S_2,S_1 in that order. To see why, consider the first step. This picks S_m and leaves the same problem but with m reduced to m-1. So cost of best cover = 1+epsilon Cost of greedy cover = H_m, so cost for GREEDY H_m Approximation ratio = ---------------- = --------- optimum cost 1+epsilon ~ (1 - epsilon) ln m. Now let's consider any Set Cover strategy, like GREEDY, that is based on comparing a dual solution to the best integral cover. The best we could hope for is to use the optimal dual value, which (by strong duality) is the cost of the best fractional cover. The performance of such a strategy will be limited by the difference [best integral cover cost] - [best fractional cover cost]. This is called the *integrality gap.* (Note: some authors use the ratio instead of the difference.) [Draw the picture.] A Set Cover example with integrality gap Omega(log m) Background: F2 = {0,1}, with + and * taken mod 2. [Give tables for +, *] F2 satisfies all the laws of high school algebra. Systems like this are called *fields.* [Give examples, such as distributive law, and existence of multiplicative inverses for nonzero elements.] Let V = F2 x F2 x ... x F2 (n times). Elements of V are sequences of 0's and 1's, and they can be considered "vectors" for the system of "scalars" F2. Note: you can do this for any field F. Calculus textbooks usually treat n=2,3 and F = R (the reals). Inner product: vw = sum_i v_i w_i (mod 2). Now we make a set cover problem. Let X = V - 0 (all bit vectors that are not all 0's) |X| = 2^n - 1 := m For each v in X, let S_v = {w : vw = 1} [note: w != 0 so S_v [ X] Each S_v contains 2^{n-1} elements of X Proof: Wolog v_1 = 1. We can pick w_2,...,w_n arbitrarily, in 2^{n-1} ways, and then solve v_1 w_1 + sum_{i >= 2} v_i w_i = 1, to get w_1 = 1 - sum_{i >= 2} v_i w_i = 1. By the symmetry of the inner product, each w in X is contained in 2^{n-1} S_v's. We will assign each S_v a cost of 1. Example: n=2, so m=2^n-1=3. sets _ *1 1* ** 01 1 0 1 elts 10 0 1 1 <--- 3x3 incidence matrix A, 11 1 1 0 in this case symmetric. In general, the number of 1's in each row and column of A is 2^{n-1}. [Can we always make A symmetric?] Let c be an all-1's row vector, and b an all-1's column vector. The LP relaxation for our set cover problem, and its dual, are P: min cx s.t. D: max yb s.t. Ax <= b yA >= c x >= 0 y >= 0. Let x = (1/2^{n-1},...,1/2^{n-1}) and y = x^T (transpose). Then x is primal feasible, y is dual feasible, and they have the common value m/2^{n-1} = 2 - 1/2^{n-1}. Weak duality tells us that x is the best fractional cover. Claim: We cannot cover X with fewer than n of the S_v's. Proof: Choose k= n-k (<= 1) - 0 = nonempty set . So S_1 u ... u S_k != X. Therefore, the integrality gap for this example is n - 2 + 1/2^{n-1} ~ log_2 m. Note that log_2 != ln. Examples are known with integrality gap ~ ln m. As a final note, it has been proved that no polynomial-time Set Cover algorithm has an o(log m) approximation factor, unless P=NP. See ???, Proc. STOC 2013. Lecture 11 2/16/18 Friday New Section: Paths, Flows, and Matchings Shortest Paths Let G = (V,E) be a directed graph. Each edge has a *cost* (>= 0), which we will think of as a distance. For given s,t in V we want a least-cost path from s to t. (For simplicity assume that s can reach t.) Example: Vertices s,a,b,t Edges s->a, s->b, a->b, a->t, b->t. e1 e2 e3 e4 e5 We can represent G by a *signed* incidence matrix A: e1 e2 e3 e4 e5 sign convention: s -1 -1 0 0 0 a 1 0 -1 -1 0 - ---> + b 0 1 1 0 -1 t 0 0 0 1 1 Lots of information about the graph can be read off this matrix. For example, the out-degree of a vertex is the number of -1's in its row. Think of a path from s to t as carrying 1 unit of some commodity from s to t. We let 1 if ei is in the path xi = 0 if not Then, Ax = b, where [-1 ] [ 0 ] b = [...] [ 0 ] [+1 ] The first and last equations say that s is a source and t is a sink. The other equations enforce Kirchoff's current law: if v != s,t the flow into v (number of path edges ending at t) equals the flow out of v (number of path edges beginning at t). Note that these paths need not be simple. Let cj be the cost for ej. As an IP, the shortest s-t path problem is: min cx s.t. Ax = b, xj in {0,1}. Since cj >= 0, we can simplify this to: min cx s.t. Ax = b, xj in Z_{>=0}. (Using extra "flow" will not reduce the cost.) LP relaxation: min cx s.t. Ax = b, x >= 0. This allows the use of fractional paths. As we'll see shortly, these don't improve the optimum. It can be shown (exercise for you) that A (signed incidence matrix for a digraph) is *totally unimodular*. This means that any r x r submatrix has determinant 0 or +- 1. For example, taking rows s,a,t and columns e1, e3, e5 in the above gives us e1 e2 e3 s | -1 0 0 | a | 1 -1 0 | = +1. t | 0 0 1 | Thm: Every vertex of the polytope Ax = b, x >= 0 has integral coordinates. Proof: Use the simplex algorithm. We can pick a cost function for which this vertex is optimal, and then we get (after running the algorithm), a vertex given by [B N][ x_B ] = b, [ x_N ] with x_N = 0 and B nonsingular. Solve B x_B = b by Cramer's rule. The denominators will be det(B), which equals +-1. So the (unique) solution to B x_B = b is integral. Every simple s-t path gives a vertex. Why? Let c_j = 1 if e_j is on the path, and M (very large) if x_j isn't on the path. Then, the vector x, given by 1 if e_j is on the path x_j = 0 if it isn't is the unique minimizer of cx. So it is a vertex. [Can we describe *all* of the vertices?] Dual of the LP: max yb s.t. yA <= c (y unrestricted) For our example we get [y_s y_a y_b y_t] [ -1 -1 0 0 0 ] <= [c1 c2 c3 c4 c5] [ 1 0 -1 -1 0 ] [ 0 1 1 0 -1 ] [ 0 0 0 1 1 ] The dual constraints are associated with edges. Considering a typical column, we have e v --> w ===> -y_v + y_w <= c_e, i.e. y_w <= y_v + c_e (*) This suggests we think of y_v as the shortest distance to v, since (*) says distance to w <= distance to v + cost(v,w). There is a symmetry: if y is dual-feasible, so is the vector y' obtained by adding a constant to each coordinate of y. Put another way, dual feasibility depends only on differences between dual coordinates. So let's normalize by taking y_s = 0. Minty's "analog" solution to the dual: Represent each edge by a piece of string of the appropriate length. Connect these strings at the vertices, and then pull s and t away from each other until some path becomes tight. Dijkstra's shortest-path algorithm: The strategy is to set dual variables, one by one, starting with s. We grow a tree T whose vertices are those whose distance from s is known. Initially: T = {s}, y_s = 0 For (s,v) in E, y_v = c(s,v) All other v get y_v = +infinity. We stop when T includes t. The algorithm maintains a list F of vertices that are one edge away from T, e.g. T: s / \ / p --> v Each v in F has a tentative / \ distance y_v, plus its parent --------- p by which this tentative distance was last assigned Tree growing step: Choose a v in F that minimizes y_v Add p-->v to T For each e = (v,w): If y_v + c_e < y_w /* violation of dual constraint */ then y_w := y_v + c_e /* "relaxation" step */ set w's parent to be v put w in F (if not already there) Observe that the dual variables decrease as the algorithm proceeds. You can think of each dual value as giving the best distance found so far. It can be shown, using induction on |T|, that for every v in T, y_v is the minimum distance from s to v. (This relies on the costs being non-negative.) Lecture 12 2/19/18 Monday Flows in Networks Reference: R.E. Tarjan, Data Structures and Network Algorithms, Chapter 8. Setup: Directed graph G = (V,E) with distinguished source s, sink t. We'll assume s != t. Think of each edge as a link through which some commodity can flow. We give each edge a *capacity* (>= 0). Goal: Send as much as possible (per unit time) from s to t, subject to conservation laws and capacity constraints. Example: 4 vertices, all capacities 1 a ^ | \ / | v s | t \ | ^ v v / b We can send 1 unit along s->a->t and 1 unit along s->b->t, for a total of 2. This is best possible since the capacity out of s equals 2. Formally: c(x,y) is the capacity of the edge (x,y), or 0 if no such edge exists. (Note: Our graph model doesn't allow self loops or multiple edges.) A *flow* is a function f : V x V --> R such that: i) f(x,y) = - f(y,x) [skew-symmetry] ii) sum_x f(v,x) = 0, all v != s,t [Kirchoff current law] Note: Skew-symmetry makes some formulas look nicer. Many authors say that a flow is a function from edges to R. Item ii) is a mass conservation law. To see this, split the sum into the contributions going into v and the contributions going out of v. These must cancel, which is saying that the commodity is neither created nor destroyed at v. Sometimes an additional requirement is included: iii) f(x,y) <= c(x,y) [flow bounded by capacity] Defn: The value of a flow is the net flow out of the source: value(f) = sum_{x in V} f(s,x). Thm: The value also equals the net flow into the sink. Physically evident. Also follows from the next theorem. Defn: If G is a flow network, a *cut* is a partition of V into S (containing s) and T (containing t). ("Partition" means that V = S n T and the intersection of S and T is empty.) E.g. The network above has 4 cuts: S T s a,b,t s,a b,t s,b a,t s,a,b t In general, if there are n vertices, there are 2^{n-2} cuts. The *capacity* of S,T is sum_{x in S, y in T} c(x,y). Note that we only count capacity in the forward direction. Thm: (Flow can be measured across any cut) value(f) = sum{x in S, y in T} f(x,y) This is also physically evident. (Draw a boundary around the vertices of S.) For a calculation to prove this, see Tarjan p. 97. Note: Unlike the definition of capacity, an edge carrying flow in the reverse direction (from T to S) will be included, but with a minus sign (because of skew symmetry). E.g. consider S . T . --------------> . --------------> . --------------> . <-------------- . <-------------- . . If each edge has 1 unit flowing toward the arrow, the net flow from S to T is 1+1+1-1-1 = 1. From the LP perspective, cuts are dual to flows (and vice versa) From now on we consider only flows that satisfy all of i)--iii). In particular, flow <= capacity along any edge. Weak Duality: If f is a flow, C is a cut, value(f) <= capacity(C). Proof: Imagine S . T sign of edge flow from S to T . --------------> + --------------> + --------------> + <-------------- - <-------------- - . . The value of the flow includes the negative terms, and if these are omitted, the sum can only get larger. For each positive term, the edge flow that it represents cannot exceed the capacity of the edge. Summing all these edge capacities gives the capacity of the cut. Strong Duality: (Max-flow / min-cut theorem) In any flow network, max flow value = min cut capacity. We will prove this by means of an algorithm that finds a maximum flow and a minimum cut, simultaneously. You should observe that this shows that finding the maximum flow value is a finite problem, since there is a finite number of cuts. Lecture 13 2/21/18 Wednesday Construction of max flows via path augmentation Flow deconstruction Tarjan op. cit. Review of flow networks (see last lecture). The "obvious" way to compute a maximum flow (repeatedly find s-t paths of positive capacity, and send flow through them) doesn't work. Consider the unit-capacity network from last time: a ^ | \ / | v s | t \ | ^ v v / b If we send 1 unit of flow along s->a->b->t, all s-t paths contain at least one saturated edge, so the algorithm is stuck. Ford and Fulkerson (1950s) realized that the algorithm could continue if we adjust capacities to allow for path flows to cancel. Defn: If f is a flow, the *reserve* capacity for (v,w) is r(v,w) c(v,w) - f(v,w). This is a function on vertex pairs. To see it is the "right" definition, consider one edge carrying flow; we'll assume there isn't an edge in the opposite direction. 0 <= f <= c v ---------> w r(v,w) = c(v,w) - f(v,w) (as we'd expect) r(w,v) = c(w,v) - f(w-v) = 0 -( -f(v,w) ) = f(v,w). That is, we could change the flow along the edge either by making more of it go in the same direction, or decreasing it. If f is a flow, its residual graph R has: same vertices, source, sink edges v-->w whenever r(v,w) > 0. For our example flow above, the residual graph is a / ^ \ v | v s | t \ | ^ v | / b This graph has the s-t path s->b->a->t. If we add 1 unit along this path, we get a maximum flow. Max Flow / Min Cut Theorem: The following are equivalent: 1) f is a maximum flow 2) there is no augmenting path in R 3) value(f) = capacity of some cut S,T Proof: We will show each implies the next one, in cyclic order. 1) ==> 2): Show ~2) ==> ~1) instead. If R has an augmenting path, flow can be increased, so f wasn't maximal. 2) ==> 3): Search in R from s. This will not reach t. Let S = reachable vertices, T = V-S. R cannot have an edge from S to T, or the search would have continued. So all S-T edges are used to maximum capacity, and value(f) = cut capacity. (Why is there no "backwards" flow?) 3) ==> 1): Any bigger flow would violate weak duality. This suggests a method to compute a maximum flow: Start with f = 0. Repeatedly look for an augmenting path in R, and when one is found, increase flow as much as possible along that path. This is due to Ford and Fulkerson. Note similarity to the simplex algorithm: We find something to increase, and make it as large as we can. Theorem: When all capacities are integral, there is an integral maximum flow. (This means all f(v,w)'s are integers.) Proof: Integrality of all f(v,w) is an invariant of the method. Each augmentation increases the flow value by >= 1, so the FF method terminates with a maximum flow. Is FF polynomial time? No. Take the above "diamond" graph and put capacity 1 on a->b, and capacity M on all other edges. Alternating between the two "zigzag" s-t paths requires 2M = 2^{log_2 M + 1} augmentations. Even worse, there are examples (with irrational capacities) where FF fails to terminate. By the way, this shows that FF is not a disguised version of the simplex algorithm. Lecture 14 2/23/18 Friday A strongly polynomial-time maximum flow algorithm Edmonds and Karp, J. ACM, 1972 Restriction for today: The network for the max-flow problem does not have any antiparallel edge pairs: ------> v w <------ This can be enforced by adding extra nodes if we wish. Review of Ford-Fulkerson procedure (see last lecture). This is not deterministic (arbitrary choice of augmenting path). To get a good running time bound, we can be smarter about which path to use. Theorem (Dinits, Edmonds & Karp). If we always choose a path with the fewest number of edges, FF runs in time poly(|V|,|E|). (This is a *strongly polynomial* bound: it does not involve the capacities.) Shortest paths. These can be found in O(|V|+|E|) steps using breadth-first search. This assigns each vertex a level, e.g. ------------------------- | | | | | ---------> a ---> t | / | / |-> s | / level 2 \ v / ---------> b has s at level 0, a,b at level 1, c at level 2. The presence of an edge implies a relation between levels: v ---> w ===> level(w) <= level(v) + 1 (this is just one of the dual constraints for a shortest path problem, in which all the edge costs are 1). Now we let G=G0, G1, G2, ... be the successive residual graphs resulting from shortest-path augmentation. (Part of the problem is to show this sequence is finite!) Let level_i(v) be the level of i in G_i, or +oo if v can't be reached from s in G_i. Lemma 1 (Propn 3 in E&K): For all vertices v, level_{i+1}(v) >= level_i(v) Thus, as the algorithm proceeds, vertices get further away from the source. Proof: Do this for i=1,2,... in turn. We use induction on the distance from s to v. Basis: v=s, whose level never changes (it's always 0). Induction step: Consider a shortest path to v: s ----> ... ----> u ----> v level_{i+1}(v) = level_{i+1}(u) +1 (shortest path) (1) >= level_i(u) + 1 (induction) Now, we consider the edge u ----> v. Case 1: u--->v is in G_i. Then level_i(v) <= level_i(u) + 1 (dual constraint) Case 2: u--->v is not in G_i. Then v--->u was on the ith augmenting path, and became saturated. This path was s ----> ... ----> v ----> u So level_i(v) = level_i(u)-1 <= level_i(u) + 1. In both cases, then (2) level_i(v) <= level_i(u) + 1 Combine (1) and (2) to get the result. Lemma 2 (Lemma 2 in E&K): Suppose, for the edge u-->v, we have u-->v . . . . . . . . . . . . . u-->v j>i. in G_i [not in G_{i+1},...,G_j] in G_{j+1} Then level_j(u) >= level_i(u) + 2. [Proofs of Lemmas 1 and 2 followed a lecture by Jeff Erickson.] Proof: u-->v is on the ith augmenting path (it disappeared) v-->u is on the j-th augmenting path (it disappeared) So level_j(u) = level_j(v)+1 >= level_i(v) + 1 = level_i(u) + 2. Proof of the theorem: Suppose that u-->v is saturated (and thus removed) k >= 1 times. Each of these events, except the last is followed by a reappearance of u--> v. Since level increases by 2 with each reapparance, 2(k-1) <= |V|-1, or k <= (|V|+1)/2. There are 2|E| possible edges (for each edge in G, its reversal might be needed, too). At each augmentation, mark one of the edges that disappears (we don't care which one). If there are more than |E|(|V|+1) augmentations, some edge gets marked (and therefore was saturated) more than (|V|+1)/2 times, by the pigeonhole principle. To summarize: If shortest-path augmentation is used, the number of augmentations is O(|E||V|), and the total work is O(|E|^2 |V|). Lecture 15 2/26/18 Monday Combinatorial applications of network flow Bipartite matching Baseball elimination (K. Wayne, SIAM J. Disc. Math., 2001) Matchings G = (V,E) directed graph A *matching* is a set M [ E with no two edges sharing an endpoint. That is, "bigamy" such as y | | x --- z is forbidden. Is is *perfect* if |M| = |V|/2. (No bigger matching can exist.) Perfect matchings need not exist, even if |V| is even. Example: a tree with only a root and leaves. G is *bipartite* if V can be partitioned into (disjoint) X,Y with all edges linking X to Y. Why are bipartite graphs important? Many real-world situations have an asymmetry, e.g. workers being matched to jobs. Many graph theory problems become easier for bipartite graphs. We can test if G is bipartite (and find suitable X,Y) using breadth-first search. This takes time O(|V|+|E|). There is a "forbidden subgraph" theorem here, too: G is bipartite iff it has no odd cycles. (Cycle means simple cycle: no repeated vertices.) Using network flow, we can find a maximum matching efficiently in any bipartite graph. Proof by example: Start with x1------y1 \ \ \ \ \ x1----- y2 Direct all edges from X to Y, and add new source and sink: x1----->y1 ^ \ \ / \ v s \ t \ \ ^ v v / x1----->y2 All capacities are 1. Using, say, the algorithm from last lecture, we find a max flow that is integral: 1 x1----->y1 ^ \ \ 1 1 / \ v s \ 0 t 1 \ \ ^ v v / 1 x1----->y2 1 The X-Y edges carrying flow form the matching. Note that bigamy is prevented by the capacity constraints on the "new" edges. This matching is maximum size: If there were a larger one, we could have found a flow of greater value. Suppose |S|=|Y|=n. Then since value(f) <= n, at most n flow augmentations are needed. Each is O(n^2), so the total work is O(n^3). I believe that versions of this technique go back to the 1930s. For a purely graph-theoretic presentation, see Section 5.4 of Bondy and Murty, Graph Theory with Applications. An application Many theorems of graph theory give sufficient conditions for a perfect matching. Here is one. Theorem: Let G be bipartite with |X|=|Y|=n. Suppose, for some d >= 1, each vertex has d neighbors. Then G has a perfect matching. Proof: In the resulting flow network, there is a fractional flow of value n: just put flow 1/d on each X-Y edge. Then, there is an integral flow of the same value, which gives the perfect matching. Baseball Elimination Imagine a baseball league in which teams play only within the league. No ties are allowed (as in US professional baseball). The pennant is won by the team that wins the most games, or shared if there is more than one such team. Problem: It is late in the season. Determine, from the record so far, and the remaining schedule, if a given team can win the pennant. Here is an example (n=4): wins wi games left sum remaining schedule wi gi wi + gi ATL PHI NYM MON ATL 83 8 91 - 1 6 1 PHI 79 4 83 1 - 0 3 NYM 78 7 85 6 0 - 1 MON 76 5 81 1 3 1 - (The losses accrued have been suppressed.) Any team that cannot win as many games as the league leader (defined by number of wins) is clearly eliminated. For example, Montreal cannot win. But this criterion is not sufficient (consider Philadelpha). A network flow model. Suppose there are g_ij remaining games scheduled between i and j. We can think of these as "wins" to be distributed between i and j. This suggests a flow network, one piece of which is: (i) ^ \ / \ oo / \ / v s -------> {i,j} t g_ij \ ^ oo \ / \ / v / (j) (The values of g_ij can be read off the schedule.) We include such a subgraph for every pair of teams.) Consider team k. If it can finish first at all, it can do so by winning every one of its remaining games. So let w = wk + gk, the maximum number it could end up having won. Team k is still a contender if there is a future in which no other team wins too many games. So we amend the network by adding team-to-sink capacities: (i) ^ \ / \ w - wi oo / \ / v s -------> {i,j} t g_ij \ ^ oo \ / w - wj \ / v / (j) If the flow out of i is within capacity, the number of future wins by i (call this xi) must satisfy xi <= w - wi, that is, wi + xi [# future wins by i] <= w . Theorem: Team k is viable iff the above network has an integral flow of value equal to the total number of remaining games. Note: It is possible to use a smaller network, from which team k has been eliminated. In that case, we want the flow value to be the number of remaining games not involving k. Wayne's paper has another interesting theorem. Suppose that, for two teams i and j, wi + gi <= wj + gj. Then, if j is eliminated, so is i. This means we can determine all eliminated teams with O(log n) max-flow computations. Even faster algorithms are possible (see the paper). Lecture 16 2/28/18 Wednesday Matchings and Coverings Ko"nig's Theorem (Strong duality in bipartite case) Review concepts of matching and vertex cover. Remember that we start with an undirected graph G = (V,E). Thm (weak duality): If M is a matching and C is a vertex cover, then |M| <= |C|. Proof: For every edge in the matching, at least one of its edges must be in the cover. This inequality can be strict: A triangle has |M| <= 1, |C| >= 2. Thm (constant factor approximation for vertex cover): |smallest vertex cover| <= 2 * |largest matching| Pf: Let M* be a martest matching: ... o ===== o ... ... o ===== o ... ... ... o ===== o ... Every edge is either in M* or adjacent to a vertex in it. (If not we could extend the matching) So the 2|M*| matched vertices cover all the edges. The minimum cover is no larger than this. Note: For this proof, M* need only be maximal in the sense of set theory (no proper superset is a matching). This gives a simple "greedy" 2-approximation for the minimum vertex cover. 2 is the best possible constant. It is attained by a graph consisting of n disjoint triangles. For a connected example, consider K_n, which has |max matching| = floor(n/2), |min vertex cover| = n-1 . Thm (Ko"nig, 1931): If G is bipartite, |largest matching| = |min cover|. Pf: Assume G has no perfect matching (if it does, we can just use X). Use a unit-cost flow network to compute a largest matching. Consider R, the residual graph for the resulting maximum flow. X: Y: W_X W_Y s t (edges not shown) D_X D_Y W_X ("wet") nodes in X are reachable in R, W_Y ("dry") nodes are not. Same for W_Y and D_Y. Claim: C = D_X u W_Y is a vertex cover. (How to remember this: nodes in X tend to be wet, nodes in Y tend to be dry. Nodes in C are on the "wrong" side.) Proof of claim: An uncovered edge of G, x --- y, would have x in W_X, y in D_y. So in R, x ---> y is not possible (y would be wet) Is x <--- y possible? No. This would mean x---y is in the matching, so s --> x is saturated (in the original network) Since x is wet, it got wet through some y' in W_Y. So the residual graph has x <--- y', and x is bigamous: x --- y' are both matching edges x --- y Now we check the size of C: X: Y: * W_X *W_Y s t (edges not shown) D_X* D_Y * * Each x in D_X is matched (if it weren't it would be wet, via s). Its partner is in D_Y (if not, x is wet). Each y in W_Y is matched (the search didn't reach t) and its partner is in W_X (where it got wet from). So |M| >= |D_X| + |W_Y| = |C|. By weak duality, |M| = |C|. Note: this pf gives a particular vertex cover from a given matching. Lecture 17 3/2/18 Friday Hall's Marriage Theorem Weighted Bipartite Matching (Assignment Problem) A recurring theme in graph theory is to find sufficient conditions for a graph to have a "nice" subgraph. Here is one by Philip Hall, from 1935. Thm: Let G be bipartite, with |X|=|Y|. G has a perfect matching iff for all A [ X, |N(A)| >= |A|. (N(A) denotes the set of neighbors for the vertices in A.) Proof: (=>) is clear, since each a in A is matched to a neighbor. (<=): Suppose G has no perfect matching. Build a flow network with the usual edges, but now with infinite capacity on each x --> y edge. The "outer" capacities are still 1, to prevent bigamy. Find a maximum flow, and from it, an s-t cut S,T of capacity < n. The picture is: X: Y: [ ] [ ] r edges Let A = X n S [ ] [ ] from here B = Y n S s [ ] [ ] to t [ ] [ ] ---|-|---------------------------|-|-|- v v [ ] ^ [ ] v v v [ ] | [ ] l edges [ ] no X to Y [ ] from s [ ] edges [ ] to here [ ] cross the [ ] [ ] cut here [ ] t Observe r+l < n, so r < n-l. N(A) [ B, so |N(A)| <= |B| = r < n-l = |A|. The constructive proof of Hall's theorem identifies an "obstruction" to the existence of a perfect matching. A more general version holds (exercise for you): If |X| <= |Y|, there is a matching for X iff the same condition holds on subsets of X. Assignment Problem This is one of the all-time classics of combinatorial optimization. We have n workers, n jobs. a_ij is the cost of assigning worker i to job j. We (management?) want to minimize sum_sigma a_{i, sigma(i)} over the n! permutations of 1..n. Without loss of generality, the a_ij are non-negative. IP version: min sum_ij a_ij x_ij s.t. sum_j x_ij = 1, all i sum_i x_ij = 1, all j x_ij in {0,1}. This looks really hard, but (as we will see soon), all vertices of the "assignment polytope" (same, but with x_ij >= 0) are integral. This is the Birkhoff-von Neumann theorem. Alternate statement of B-vN: Any doubly stochastic matrix (entries >= 0, row and column sums are 1) is a convex combination of permutation matrices (0-1 matrices with exactly one 1 in each row and each column). This suggests there is an efficient algorithm. Assignment Problems with Integer Data Suppose a_ij >= 0, integral. Key observation: Adding a (possibly negative) constant to the entries in any row or column doesn't change the identity of the optimal matching. The algorithm will work with an nxn matrix, and try to decrease the entries (a few may go up). All entries stay non-negative. At the end, there will be a matching of 0's, which gives the optimum solution. First let's translate our bipartite graph ideas into matrix language. Edge = position i,j with a 0 Matching = set of non-attacking rook positions Vertex cover = set of lines (rows or columns) Consider the matrix 3 4 7 1 2 5 3 1 1 Decreasing row and column entries, we get 3 4 7 3 4 7 0 2 5 1 2 5 --> 0 1 4 --> 0 2 5 2 0 0 2 0 0 2 0 0 Think of the 0's as indicating edges in a bipartite graph, and find a minimal vertex cover (indicated by *'s); o -- o* <- column 1 / / o o / / row 3 -> *o -- o On the matrix, this is | 0 2 5 0 2 5 -2-0-0- | Let delta = 1 be the smallest uncovered entry. To create a new 0 in the uncovered part of the matrix, we can: subtract delta/2 from each uncovered row add delta/2 to each covered row do same for the columns. This is called a *pivot*. For our matrix delta=2 so we get 0 0 3 0 0 3 4 0 0 which now can have 3 non-attacking rooks at (1,1), (2,2), and (3,3). Going back to the original numbers, the optimum assignment cost is 3+2+1 = 6. Thm. (Kuhn?) If the a_ij are integral, the algorithm terminates. Proof: Show that each pivot reduces the sum of the entries. Suppose there are n1 horizontal, n2 vertical lines. Let n1' = n - n1, n2' = n - n2. n1 n2 entries go up by delta, n1' n2' entries go down by delta. Net change (in units of delta) is: n1 n2 - n1' n2' = - n [n - (n1+n2)] (algebra) < 0 (since n1+n2 < n) So the total number of pivots is bounded by the initial sum, which is <= n^2 max{a_ij}. This is a *pseudopolynomial* time bound -- something that looks like polynomial time, but really isn't, because it is not polynomial in bit length. Lecture 18 3/5/18 Monday Strongly Polynomial Time Bound for Assignment Problem (Due to J. Munkres, 1957.) Review of Assignment Problem (last lecture) Kuhn's "Hungarian" Algorithm: while there's no size n rook placement (matching) on the 0's cover all 0's with the smallest number of lines (rows, columns) delta := minimum uncovered entry (> 0) subtract delta from uncovered entries add delta to covered entries Final rook placement gives the best assignment Example: 1 2 3 10 | delta = 1 v --0--1-- 2 9 | delta = 2 v | 0 1 0 8 | | delta = 1 v 0 0* 0* 8 Review of: Using max flow to find a largest matching Ko"nig's theorem K-cover: dry X-vertices and wet Y-vertices in residual gph Theorem: If the Hungarian algorithm uses the K-cover, the number of iterations is poly(n). First, translate into matrix terms: X-vertices == Rows of the matrix Y-vertices == Columns of the array G (bipartite) is the graph formed from the 0 locations On the matrix, the K-cover consists of dry rows and wet columns: (wet) | | | --------------- (dry) --------------- | | | | | | | | | (Remember that "dry" and "wet" refers to the last search, in the residual graph.) Lemma: After each iteration, either: max matching is larger or set of wet nodes becomes larger Let's see this in action, for our example. Vertices put between braces are wet. 0-graph G and its flow ntwk 1 2 o o s -->o o--> t 3 10 o o -->o o--> first flow is 0, its residual gph is s -->[o] o--> t -->[o] o--> so the cover is empty pivot using delta=1 to get 0 1 o---o s -->o-->o--> t 2 9 o o -->o o--> flow = 1 along top. residual gph is s <-- o <--o <-- t -->[o] o --> cover is the top row; the 2 9 row is uncovered. pivot using delta = 2 to get 0 1 o---o s -->o-->o--> t / ^ 0 7 / / o o -->o o--> flow unchanged. residual gph is s <--[o]<-[o]<-- t ^ / -->[o] o --> cover is the first column. pivot using delta = 1 to get 0 0* Has perfect matching (indicated by *'s) 0* 7 Proof of claim: Singly covered 0's are unaffected by pivoting. Consider the doubly covered 0 entries: (wet) | | | -0------------- (dry) ---0----------- | | | | | | | | | None can be a matching edge, since it would mean R has dry wet i <----- j Therefore, they are all like this in R: dry wet i -----> j In particular, the search in R did not go through these edges. Pivoting will remove these but since they were inessential to the search, no wet node becomes dry. Finally, consider a new 0 entry created by pivoting: (wet) j | | | --------------- (dry) --------------- | | | i | | | 0 | | | wet dry Its edge in R must be i ----> j so one more vertex becomes wet. If the connection from j to t is intact in R, flow (and hence matching size) is increased. Thm. If K-covers are used, the number of pivot steps is O(n^2). Proof: The matching can be enlarged <= n times, and in between, the number of wet nodes increases <= 2n times (since s is wet to start, and 2n+2 wet nodes indicates an augmenting path). So <= 2n^2 pivots. Each pivot requires O(n^2) work, and the cost to search R is also O(n^2), the total run time is strongly polynomial: O(n^4). The algorithm can be implemented to use O(n^3) steps. Unbalanced Matching Problems Let G = K_{2,3} with these weights: C D E A 3 1 4 B 1 5 9 We want the best assignment for A,B. To solve this, introduce a new vertex X with uniform costs in its row: C D E A 3 1 4 B 1 5 9 X 10 10 10 Solve the resulting assignment problem, and use the results found for A,B. This works because X can match any of A,B,C, and all choices have the same cost. Lecture 19 3/7/18 Wednesday Primal-Dual Method This is a "master plan" for the design of combinatorial algorithms. It has been very influential, particularly in the area of approximation. The idea: Express problem as LP Reduce so a sequence of "simpler" LPs The simpler LPs are often combinatorial and can be solved efficiently. Consider P: min cx s.t. D: max yb s.t. Ax = b (>=0) yA <= c x >= 0 (y unrestricted) Complementary slackness (we won't prove this): Feasible x,y are optimal iff y(Ax-b) = 0 (yA-c)x = 0 (each is orthogonal to the other's residual). The primal-dual method works by finding better and better dual feasible points: y -> y' -> y'' -> ... until the best y is reached, and we can determine corresponding x. Failure to find the x allows us to "improve" y. The Basic Cycle Start with dual-feasible y (e.g. y=0 if c>=0). Plug into second comp slax equality, which is easy to solve: yA-c = (0...0 **...* ) <0 here 0 = (yA-c)(x') <-- vector of r unknowns, anything >= 0 works (0 ) <-- forced Which x' will also satisfy the first comp slax condition? Let A' = first r columns of A. Our task reduces to solving Ax' = b, x' >= 0. As in the first phase of the simplex algorithm, we do this by solving another LP: RP: min sum_i w_i s.t. (A' I) (x') = b (w ) x', w >= 0 w is a vector of r slack variables. Think of sum_i w_i as measuring how far x' is from being feasible. RP stands for "restricted primal" (it isn't really, we have added some new variables). In many cases, RP is unweighted even if the original problem had weights. If the min = 0, we're done as (x') is primal feasible. (0 ) Otherwise, consider the dual of the new LP, which is DRP: max zb s.t. z(A' I) <= (0...0 1...1) (z unrestricted) DRP is "dual of restricted primal". The constraints can also be written as ZA' <= 0, z_i <= 1, all i. Let z solve this. By strong duality, zb = sum_i w_i > 0. We can use z to improve our original dual. Compare zA = z(A' A'') = (*...* ?...? ) <= 0 signs unknown to yA - c = (0...0 *...* ) < 0 Since the last part of yA-c consists of strictly negative entries, there will be some lambda > 0 for which ( lambda z + y ) A - c <= 0. Choose the largest possible lambda for which this is true, replace y by y' = lambda z + y, and continue. y' is a better dual vector, since y'b > yb. Comment: This is a lot to swallow. Every problem now generates *four* linear programs: P, D, RP, DRP. It is useful to see this in action. Transportation Problem Consider a small (2x2) network, with transportation costs as indicated: 0 (1) A ------ C (2) \ / 2 \ / \/ /\ 3 / \ / \ (3) B ------- D (2) 0 The parenthesized numbers indicate supplies (for the producers A,B) and demands (for the consumers C,D). Note that total supply equals total demand. We can also give the data for the problem in a table: 2 2 (demands) 1 0 2 3 3 0 (supplies) ^ | transportation costs We want a shipping schedule that meets the supply and demand constraints, and has minimal cost. The assignment problem is the special case of n producers, n consumers, and supply and demands all equal to 1. Write this as an equality-form LP: c x A x b min [0 2 3 0][x_11] s.t. [1 1 0 0][x_11] = [1] [x_12] [0 0 1 1][x_12] [3] [x_21] [1 0 1 0][x_21] [2] [x_22] [0 1 0 1][x_22] [2] x_ij denotes the amount shipped from producer i to consumer j. Since c>=0, y=0 is dual feasible, and yA - c = -c = [0 -2 -3 0] So we try to solve (yA-c)x = 0, which is 0 = [0 -2 -3 0] [x_11] <-- free [x_12] =0 [x_21] =0 [x_22] <-- free The restricted primal (you should write this out) is a max-flow problem: oo A -------> C w1 ^ f = x_11 \ 2 w3 1 / \ / v s t \ ^ 3 \ / 2 w2 v f = x_22 / w4 B -------> D oo Capacities are indicated, and the wi's are slack variables for unused capacity. Solving this by inspection, we get x_11 = x_22 = 1, w1 = 0, w2 = 1, w3 = 1, w4 = 0. sum w_i = 2. The restricted primal's dual is: max z1 + 3 z2 + 2 z3 + 2 z4 s.t. z1 + z3 <= 0 z2 + z4 <= 0 all zi <= 1 with solution z = (-1 1 1 -1). (It's optimal because the objective value is -1 + 3 + 2 - 2 = 2.) DRP can be viewed as the selection of a cut. Consider the residual graph for our flow: <------- A -------> C z1 = -1 ^ \^ z3 = +1 / \\ / v\ s t ^\ ^ \\ / z2 = +1 \v / z4 = -1 B -------> D <------- This gives the cut S = {s,B,D} and T = {A,C,t}. Note that +1 values go with edges inside the cut sets, and -1 values go with edges crossing the cut. In general, if we use this rule, we have (considering only edges from s or to t) cz = sum_{e inside cut} cap(e) - sum_{e across cut} cap(e) = sum_{all e} cap(e) - 2 * sum_{e across cut} cap(e) so max cz = total edge capacity - 2 (min cut capacity) = total edge capacity - 2 (max flow value) = sum of unused capacities We then get zA = [0 -2 2 0] yA - c = [0 -2 -3 0] From the third entry, 2 lambda - 3 <= 0, so max lambda = 3/2.3 Next y is y' = (3/2)z + 0 = [-3/2 3/2 3/2 -3/2]. For the next iteration (you should fill in details) y'A - c = (0 -5 0 0) so 0 = [0 -5 0 0] [x_11] <-- free [x_12] =0 [x_21] <-- free [x_22] <-- free Next RP is the max-flow problem oo f = x_11 A ------> C w1 ^ ^ \ 2 w3 1 / / \ / f = / v s x_21 / t \ / ^ 3 \ / / 2 w2 v / / w4 B ------> D f = x_11 oo with solution x_11 = 1, x_21 = 1, x_22 = 2, and all wi = 0. So we're done, and the optimal shipping schedule is 1 from A to C 1 from B to C 2 from B to D (cost 1*0 + 1*3 + 2*0 = 3.) Lecture 20 3/9/18 Friday Primal-Dual Approximation Algorithms D. Williamson, Math. Programming B, v. 91, 2002, pp. 447-478 Primal-Dual Method (for exact optimization): Repeatedly find better and better dual solutions. If a dual solution is not optimal (we can't solve the complementary slackness equalities), we solve another linear program to find a "direction" along which the dual can be improved. Earliest example: Kuhn's Hungarian algorithm for the assignment problem. Many combinatorial algorithms fit into this framework. Primal-Dual ideas can also be used for approximation. Hitting Set Problem E = set of n elements c_e (>=0) is the cost of element e Sets T_1,...,T_m [ E Find min-cost S [ E such that S n T_i is nonempty, all i. IP version A = 0-1 incidence matrix for the sets (each row represents a set). c = cost vector, b = all 1's "target" vector x = vector of non-negative integers. Indicates a set S (e is in if x_e > 0, out if x_e = 0). In practice x is a 0-1 vector, as there is no need to use more than one copy of any element. min cx s.t. Ax >= b x >= 0 (integral) This is a *covering* problem. x wants to be small, but it has to be big enough to touch all the given sets. Inequalities are associated with sets: sum_{e in T_i} x_e >= 1. Mathematically identical to the set cover problem (use columns of the matrix A to represent sets). Hitting Set can naturally express many other problems, for example: 1) Vertex Cover. E = vertices of an undirected graph sets are the edges {v,w} cost of each vertex is 1 This shows that Hitting Set is NP-hard, as Vertex Cover is. 2) Shortest s-t Path. E = edges in a graph (undirected) sets = crossing edges for s-t cuts. If c>0, the min-cost set must be an s-t path, which hits all specified sets. Note that there are exponentially many of these. (Williamson mentions many others.) Primal-Dual Approximations Relax the IP to LP (allow any x >= 0). The dual of the LP is max sum_i y_i s.t. yA <= c y >= 0 y wants to be large. Think of "packing" y values into the "container" given by yA <= c. We have a dual variable y_i for each set. Dual constraints are associated with elements: (y1...yn) A_e <= c_e ^ | column listing which sets element e is in The Strategy _ Want to find x bar (primal feasible and integral) and y (dual feasible) for which _ c x <= alpha yb (with alpha "small"). If we can to this we will have (for the least-cost hitting set x^) _ yb <= c x^ <= c x <= alpha * yb So _ c x <= alpha * yb <= alpha * cx^ (by weak duality). We say the approximation factor is <= alpha. Hochbaum's Algorithm (1980 -- published 1982) Find dual optimal y* S := {e : y* A_e = c_e} (elements whose dual constraint is tight). Claim: S is primal feasible. Follows from complementary slackness: y(Ax - b) = 0 (1) (yA - c)x = 0 (2) (These are necessary and sufficient conditions for optimality.) Let x* denote a primal optimal solution (might be fractional). By (2) tight < 0 (y*A - c)x* = (0 ... 0 * * ... *)( x_1* ) = 0. ( x_2* ) ... ( x_2* ) ( 0 ) ... ( 0 ) Since x* is feasible, Ax* >= 1. Note x*_e > 0 iff e is in S. We can assume no component of x* exceeds 1. _ Round each component upward to get incidence vector x for S. Cost estimate: Let's assume |T_i| <= alpha for each i. sum_{e in S} c_e = sum_{e in S} y* A_e = sum_{elts e in S} sum_{sets T_i containing e} y_i* = sum_{sets T_i} sum_{elts e in T_i} y_i* = sum_i |T_i n S| } y_i*. le alpha sum_i y_i*. For vertex cover, alpha <= 2, so we get a factor 2 approximation for this NP-hard problem. Bar-Yehuda and Even's Improvement (1981) Don't solve the dual LP. Just raise individual y_i as needed. y = 0 S = empty while S doesn't hit every T_i choose k with S n T_k empty (we "missed" T_k) increase y_k until the inequality for some e in T_k becomes tight S := S u {e} This is very fast. Still works, since the proofs for feasibility (S hits every T_i) and the cost bound only used tightness of of the dual constraints. Note that we increase y by lambda z = lambda (0 ... 0 1 0 ... 0). This is similar to P-D optimization, but we did not try to find the "best" z. Lecture 21 3/12/18 Monday New Section: Greedy Algorithms and Matroids Greed Before Edmonds Alan Hoffman, On Greedy Algorithms that Succeed, in Ian Anderson, ed., Surveys in Combinatorics, 1985, 97-112. Is it always best to make "locally advantageous" choices? No. Printers are an example. Initial outlay (cost of device) is insignificant compared to ongoing expense (cost of toner). Greedy strategies are easy to implement and often used as preliminary solutins. If we are going to use them, we need theory to guide us. A greedy strategy for linear programming Let c >= 0. We want to solve max cx s.t. Ax <= b, x >= 0, <---- call this P Can we order the variables x1, x2, ... and then raise each one to the maximum possible, without regret? x1^ = largest x1 s.t. for some x2,...,xn, (x1^,x2,...,xn) in P x2^ = largest x2 s.t. for some x3,...,xn, (x1^,x2^,...,xn) in P etc. [Draw picture in R^2. Note that for this to work, the LP is pretty special.] Earth Moving Problem (Monge, 1781) Given two regions A,B [ R^2. Dirt to be dug out from A, placed on B. (Assume constant depth.) We want precise directions telling us where each load of dirt goes. Monge observed that an optimal directive has no crossings: I --------- J X R -------------- S Indeed if R sent 1 load to J and I sent 1 load to S, it would have been cheaper (by the triangle inequality applied to triangles IXJ and RXS) to use I-J and R-S. Monge's Inequalities Consider a maximizing transportation problem: b1 b2 ... bn demands supplies a1 ... c_ij is profit gained from 1 shipment am from i to j. We'll assume total supply = total demand. LP: max sum c_ij x_ij s.t. x_ij row sum = ai, x_ij column sum = bj x_ij >= 0. As we know, choosing "most profitable" edge first need not work: e.g. assignment problem 100 99 w/ demands = supplies = 1 99 1 Defn: The array c_ij satisfies Monge's inequalities if i <= r, j <= s ==> c_ij + c_rs >= c_is + c_rj (Note that LHS - RHS is similar to a determinant.) Note that this depends on how rows and columns are ordered. For example, the 2x2 array above violates the one Monge inequality, but swapping the columns makes it hold. A greedy transportation algorithm Assume supplies and demands are integral. If Monge holds, the following algorithm finds an optimal schedule: x_11 := mu = min(a1, b1) a1 := a1 - mu, b1 := b1 - mu (one of these will be 0) if a1 = 0, cross off row 1 if b1 = 0, cross off column 1 process remaining array recursively The base case for this is one row or one column. Then, the allocation is forced. So we are doing induction on m+n. Proof. At least one optimal schedule is integral. (This follows from the algorithm presented Wed 3/7/18, if you solve max-flow "correctly"). Any such solution with y_11 < mu can be made "more greedy". Indeed, there must be r,s for which the solution is y_11 ... >0 (by supply/demand constraints) ... ... >0 ... y_rs Let delta = min{ y_r1, y_s1 } (> 0). We can do a pivot, and add delta ... -delta ... ... -delta ... delta without violating constraints. The change in objective is delta(c_11 + c_rs - c_r1 - c_rs) >= 0, so the new solution is still optimal. If the y_ij are integral (we can assume this, see end of lecture Wednesday 3/7/18, and make sure the maximum flows are all integral), the new solution will be too. Therefore, with a finite number of such modifications, we reach the greedy value for x_11. We can now use induction since the new problem is strictly smaller. A Generalization Let (i,j) >> (r,s) mean that we consider x_ij before we consider x_rs. Then, if (i,j) >> (i,s) (row) >> (r,j) (column) >> (r,s) (diagonal) the greedy algorithm still finds an optimal solution. Another Application: Coloring Interval Graphs I_1, I_2, ..., I_n are closed intervals in R^1, sorted by left endpoint. E.g. I_1 I_3 |-------------------| |------------| I_2 |------------------------| I_4 |--------------| We want to "color" each interval so that overlapping intervals get different colors. (This is basically a scheduling problem. Overlapping intervals can't be scheduled on the same machine.) We can reduce this coloring problem to an easy bipartite matching problem. Consider any coloring. The intervals given the same color must look like this: |------| |-------------| |-------| ... So we are just matching the end of each interval to the beginning of the next. If there are m such pairings, # of colors = n - m. Conversely, if we have such a matching M, we can use it to make a coloring with n - |M| colors. So min # colors + max matching size = n. To treat this formally, let B be a bipartite graph on {1,...,n} x {1,...,n} with (i,j) in E exactly when i I in II (inheritance) (ii) I,J in II with |I|+1 = |J| ("exchange" axiom) ==> there is an e in J-I with I+e in II. [draw pictures] Subsets in II are called *independent*. We've seen two examples: E = edges in a graph II = acyclic E = columns of a matrix over some field II = linearly independent We can get geometric insight by considering an mxn matrix over R. Columns are just vectors in R^m (i) says that if a set of vectors point in "independent" directions, the same holds for any subset (ii) says that if the span of a set J has dimension 1 lower then the span of another set I, we can always pick a vector from I that points "outside" the directions available to J. [draw pictures] Defn. A maximal independent set (maximal in the sense of set theory) is called a *basis*. For the graph example, when G is connected, basis means "spanning tree." All spanning trees have the same number of vertices (n-1 if there are n vertices). This observation generalizes. Thm. All bases of a matroid have the same number of elements. (Their common size is called the *rank* of the matroid.) Proof (by contradiction): Let B,B' be two bases with |B| < |B'|. Remove elements from B', if necessary, until |B'| = |B|+1. By (i) B' is still independent. By (ii) we can find e in B' for which B+e is independent, so B wasn't maximal. Counting Matroids Intuitively, independent sets in a matroid are not too "large". Illustrative example. Let E = {1,...,n}. Let k satisfy 0 <= k <= n. II = the subsets with <= k elements. This matroid has (n choose k) bases. Its rank is k. Two extreme cases: if k=0, only the empty set is independent if k=n, every subset is independent Suppose the elements are given positive weights. Then the greedy algorithm finds a subset of maximum total weight. Example: Sort the elements by weight, say 10 7 5 4 2 1 Choose the top k (here k=3): 10 7 5 4 2 1 ------ If we chose any other k-subset, there would be a "hole" where one of 10, 7, 5 wasn't used. This could be traded for the smallest element of the subset, and the result would be no worse. Continue this until we have 10,7,5. Lecture 23 3/16/18 Friday Properties of GREEDY (Lawler 7.6) Dual matroids (Lawler 7.8) Review of matroid axioms, bases, matroid rank (see previous lecture) Construction of a basis To find a basis, let S = empty set For each e in E If S + e is independent, add e to S. (Note: For this to work, it was crucial that the empty set be independent, and that E be finite.) Finding a maximum weight basis Suppose each element e has a weight w(e) >= 0. GREEDY: Sort elements by weight (largest first) B = empty set While unexamined elements remain: e = next such element If B + e is independent, B := B + e Else reject e. Theorem: GREEDY finds a basis B of maximum possible weight. Proof: Let B = {x_1, x_2, ..., x_r} at the end of the algorithm Let C = {y_1, y_2, ..., y_r} be a basis of greater weight Both are listed so that weights don't decrease Let i be first index for which w(y_i) > w(x_i) (has to exist). Take I = {x1,...,x_{i-1}} J = {y1,...,y_{i-1},y_i} Both are independent, so for some k, there is a y_k in J-I that we could add to I. But w(y_k) >= w(y_i) > w(x_i) so GREEDY would have considered it before x_i, and accepted it. This contradiction proves the theorem. Here is an extension (exercise for you): A greedy basis is Gale-optimal. What this means: Let x_1,...,x_r be a basis found by GREEDY (weights non-increasing), and let z_1,...,z_s be any other independent set (ditto). Then w(x_i) >= w(z_i) for 1 <= i <= s. What if we want a minimum weight basis? Replace w(e) by W - w(e), where W >= sum_e w(e). Since all bases have the same size, a max-weight basis for the new weights is a min-weight basis for the original weights. GREEDY characterizes matroids, in a certain sense. Theorem: If M is a subset system satisfying (i) but not (ii), then GREEDY will fail for some choice of non-negative weights. An equivalent statement is: If M satisfies (i) and GREEDY always finds a maximum-weight independent set, then M satisfies (ii). Proof: Let (ii) fail for I,J. That is, I,J are independent, |I|+1 = |J|, but no element of J-I can be added to I and preserve independence. Let |I| = m. Choose weights as follows: Elements of I get w=1 Elements of J-I get w=1-epsilon All other elements get w=0. where 0 < epsilon < 1/(m+1). GREEDY picks all of I rejects all of J-I may pick other elements of weight 0 GREEDY's solution has total weight m But weight(J) >= (m+1)(1 - epsilon) = (m+1) - (m+1)epsilon > m+1 - 1 [by choice of epsilon] = m. So GREEDY failed. [Question: Where did we use (i)?] We assumed nonnegative weights. What happens if some are negative? Then, the maximum-weight basis found by GREEDY need not be the maximum-weight independent set. Example: Counting matroid (independent means has <= k elements) Weights are 1,1,...,1 (k-1 times), -1,-1,...,-1 (n-k elements) GREEDY gets a basis of total weight k-2 (one -1 must be used) but the max-wt independent subset has weight k-1. If we want a max-weight independent set, we can modify GREEDY so that is rejects all elements of negative weight. Matroid Duality This is similar to linear programming. Every matroid has a closely associated "dual" matroid. Let M = (E,II) be a matroid. Its *dual* is the matroid M* = (E, II*), where II* consists of the complements of bases of M, together with the subsets of these complements. Theorem: M* is also a matroid. (See Lawler p. 280 for proof.) This allows us to solve more problems. Example: The Mayor's Dilemma. Let G be a connected graph with positive edge weights. Among the largest sets of edges whose removal leaves G connected, choose one of minimum total cost. The idea is that G represents a road network, with w(e) equal to the cost of rebuilding e. The mayor wants to fix as many roads as possible, without disconnecting anyone, at minimal cost. For this problem: E = edges of G II = edge sets containing no cycle M = (E,II) is the *cycle matroid* for the graph J is independent in M* <==> there is a spanning tree T s.t T n J is empty <==> (V,J') is connected (J' is the complement of J.) M* is called G's cocycle matroid. The minimizing version of GREEDY (STINGY?) will solve this. Example: 1 *------* m = 5 | / | \ n = 7 2 | 5/ | \ 3 | / 4| \ | / | \ *------*-----* 6 7 Accept Reject 1 2 3 4 5 6 7 (Note: Contrary to intuition, accepted edges are taken out of service.) So the city can fix 3 (= m+n-1) roads, at total cost 1+3+4 = 8, Observe that we also found a minimum spanning tree (with weight 2+5+6+7 = 20). Effectively, GREEDY solves two problems at once: Find a max-wt basis in M Find a min-wt basis in M*. Lecture 24 3/19/18 Monday More Matroid Examples: Polygamous Bipartite Matching Partition Matroids Matching Matroids Review of matroid axioms (i) and (ii); See 3/14/18 lecture. Inheritance (i) is pretty intuitive but it is hard to get a good sense for (ii). One idea is that more (larger) is better (the larger set has some "capability" that could be added to the smallet set). Review of matroid examples: vector spaces, acyclic edge sets in a graph, "small" subsets (counting matroids). 1) Polygamous Bipartite Matching G = (V,E) E [ X x Y Think of X as a set of men and Y a set of women E gives all the compatible pairs Each woman can have at most one husband, but a man can have multiple wives. Example: A--- a doesn't have a perfect matching, B--- b but we could "marry" A to a, \ and B to b and c. \ \ C c A set of edges is independent if no Y vertex commits bigamy. That is, there is a forbidden subgraph x -- y / / x' The analog of the (maximizing) assignment problem can be solved by a greedy algorithm: Consider each column in turn, and choose a maximum element from each column. E.g. for n=3 a b c A 3 7* 3 B 9* 5 5* starred cells are chosen. C 8 1 3 This uses n(n-1) ~ n^2 comparisons of entries. On the other hand, every algorithm for this problem must look at all the entries: if one number was unexamined, it might be the largest one in its column. 2) Partition Matroids Divide E into disjoint blocks B1, B2, ..., Bm. Block i has a capacity ci (non-negative integer). I is independent when |I n Bi | <= ci, all i. (Exercise: Check that this satisfies the matroid axioms). This is a generalization of the last example, for which E consists of pairs (x,y) with x in X, y in Y We bin the pairs by their last coordinate. All capacities are 1. The greedy algorithm can be interpreted as throwing balls into bins: | | | | | |----| |----| | | |----| | | | * | | |----| | * | * | | * | <-- if the next ball goes into B4, --------------------- it is rejected. B1 B2 B3 B4 c = 3 2 3 1 For polygamous matching, this changes the order in which columns are processed. For the above example, we would consider the cells labeled by 9,8,7,5,5,... and find a b c A 3 7* 3 starred cells are accepted B 9* 5x 5* cells with x got rejected C 8x 1 3 we stopped as soon as each column has a match. bins | B | A | B | Note that the end result is the same, we just found it differently. 3) Matching Matroids G = (V,E) undirected graph (need not be bipartite) I [ V is independent when there is a matching M of G that "hits" each vertex in I Note that this is a vertex model. Inheritance (i) is clear, but the other axiom needs proof. See Lawler p. 271. (It relies on the idea of alternating paths.) Example: a --- b --- c has 4 independent sets empty, a, b, c. So it is the k=1 counting matroid on 3 elements. Here is a non-obvious application of this. Unit Length Job Scheduling with Deadlines and Penalties Jobs j=1,...,n Each takes unit time Deadlines dj (positive integers) Penalties pj (non-negative reals) There is only one processor, and pre-emption is not allowed (once started, a job has to run to completion). A set J of jobs is called *independent* if all jobs in J can be completed by their deadlines. This looks awful (there are |J|! orderings of the jobs), but there is actually an easy test: If the deadlines for the jobs in J are d1 <= d2 <= ... <= dm, then they can be scheduled without penalty iff j <= dj holds for all j. (Exercise for you. The idea is that if a schedule is penalty free, so is the one with all slack times removed.) Let's think of pj as the profit (reward) gained by completing job j on time. Notice that we are just matching jobs to times. For example, if the data is jobs A B C D E deadlines 3 1 4 1 5 penalties 2 4 1 5 7 we need to solve the (maximizing) assignment problem given by 1 2 3 4 5 times jobs A 2 2 2 0 0 B 4 0 0 0 0 C 1 1 1 1 0 D 5 0 0 0 0 E 7 7 7 7 7 We can cover all nonzero entries with column 1, and rows A, C, E. By Ko"nig's theorem, the maximum number of independent jobs is 4. The greedy algorithm for this problem is: Consider jobs in penalty order. For each job, accept if iff there is a free slot before its deadline, and put it in the last available such position. Schedule all tardy jobs at the end (in any order). For the example, we'd consider jobs E, D, B, A, C and get t = 0 1 2 3 4 5 |----|----|----|----|----| D A C E B is rejected and will go last We can compress this to t = 0 1 2 3 4 5 |----|----|----|----|----| D A C E B (total penalty 4) Lecture 25 3/21/18 Wednesday Matroid Intersection R.E. Bixby and W.H. Cunningham, Matroid Optimization and Algorithms, in Handbook of Combinatorics, ed. R.L. Graham et al., Elsevier, 1975. Review of matroids and their use in "greedy" optimization. Using Two Matroids Many problems have constraints that can usefully be thought of as independence in a matroid. Sometimes, more than one matroid is required. Example 1: G = (V,E) bipartite graph with E [ X x Y. M [ E is a matching when: No pair shares a left endpoint x (independence in M1) No pair shares a right endpoint y (independence in M2) It is easy to find maximum independent sets in each of M1, M2. The trick is to find a largest set that is simultaneously independent in both of them. Example 2: Directed spanning trees. Let G be a directed graph, in which a distinguished "root" vertex r has in-degree 0. We seek a spanning tree with root r, in which each root-to-leaf path is directed. (Edmonds has called this a "branchocracy.") E.g. if G is r / \ / \ v v a --> b <-- one possible tree is r --> a --> b Thinking of this tree as a set of edges, there are two requirements. M1: In the underlying undirected graph, the edge set is acyclic. (Cycle matroid.) M2: No two edges point to the same vertex. (Partition matroid, in which all "bins" have capacity 1.) Although this problem can be solved by breadth-first search, the weighted version cannot, and so it is useful to have this description. Abstractly: Given matroids M1 and M2 with common universe E, choose, from among the sets that are independent in both M1 and M2, one of largest size. Note that the empty set qualifies, so this problem does have a solution. Additional Matroid Concepts Rank. Suppose that (E,II) is a matroid. If S [ E, its *rank* (written r(S)) is the size of a largest independent set within S. (Exercise: Check, using the axioms (i) and (ii), that all such subsets have the same size.) This agrees with the rank we defined earlier. The rank of the matroid is just r(E). Circuits. In a matroid, a *circuit* is a minimal dependent set. These can have different sizes, e.g. the cycle matroid for * *-----* / \ | | / \ | | *-----*-----*-----* has circuits of size 3 and size 4. Circuits can be found by taking a dependent set and removing elements until it becomes independent. If J is independent, and J+x is not, there is a unique circuit within J+x. We call this C(J,x). The elements of C(J,x) are "equivalent" in the following sense: if y is in C(J,x), then J+x-y is independent. (Proof: C-y is independent, and |C-y| <= |J|. If they are not the same size, add "new" elements from J, not in C-y, until the resulting independent set J' has the same size as J. This J' must be J+x-y.) Independence oracle. In discussing abstract matroid algorithms, we assume we have a test to recognize independent sets. For the examples we've seen, such a test can be done in polynomial time. Duality Results M1, M2 are two matroids on E, with rank functions r1 and r2. If J ranges over sets that are independent in M1 and in M2, and S ranges over subsets of E, we have |J| <= r1(S) + r2(S') (Weak Duality) and max{ |J| } = min{ r1(S) + r2(S') } (Strong Duality) Here S' = E-S denotes the complement of S. See Bixby and Cunningham, p. 9 for a proof. Finding a Largest Common Independent Set The first poly-time algorithm for this problem was given by Lawler. Let M1, M2 be two matroids on E. We seek a largest J [ E that belongs to both II1 and II2. The algorithm finds J by a sequence of shortest-path computations, each of which either: Replaces J by a larger set or Constructs an S giving a "strong duality" proof that J is optimal. Initially J is empty. Given J in II1 n II2, we construct a digraph D as follows: Vertices: E u {s,t}. Edges: e --> t for e in J', J+e in II1 s --> e for e in J', J+e in II2 e --> f for e in J', f in J with f in C1(J,e) f --> e for e in J', f in J with f in C2(J,e) Then: If s, e_1, f_1, ..., e_k, f_k, e_{k+1}, t is a shortest s-t path in D, the larger set J - f's + e's is in II1 n II2. If no such path exists, the set S of reachable vertices from E satisfies |J| = r1(S) + r2(S') (and so J is optimal). This gives a polynomial time algorithm: If |E| = n, there are <= n^2 edges in D; also |J| can increase at most n times. This will not be proved (you can look at Bixby and Cunningham, p. 10), but examining some simple cases makes the first claim plausible. 1) D has s --> e --> t. The first edge tells us that e can be added in M2, and the second tells us the same for M2. Therefore it is OK to replace J by J+e. 2) We have s | | <-- e can be added in M2 v ...e... . | . . | . <-- but makes ckt C1(J,e) in M1 . v . ...f... . | . . | . <-- ckt C2(J,e') in M2 . v . ...e'.. | | <-- we can add e' in M1 v t with e != e'. The element f is common to both circuits. In M1, J+e-f and J+e' are independent |J+e-f| < |J+e'| So we can add a "new" element from J+e'. This can't be f, because J+e-f+f = J+e is dependent So it must be e. Therefore, J+e-f+e' is independent. Similar argument applies to M2. Here is a small example. We want a maximum matching in X: Y: A -- a \\ \\ B -- b M1 is the matroid for no bigamy in X, M2 similarly for Y Suppose we start with J = {Ab}, which is indicated with a doubled edge. Then D is s | Aa and Ab indep in M2 v e: A -- a | C1(J,e) is Aa, Ab v f: A == b | C2(J,e) is Ab, Bb v e': B -- b | Ab, Bb indep in M1 v t So we can replace Ab by {Aa, Bb}. This is optimal. Next step: The only element not in J is Ab. This can't be added in M1 (A becomes bigamous) or in M2 (b does). So D has no connection from s to anywhere. This makes S empty, S' = E. r1(S) = 0 (the rank of an empty set is always 0) r2(S') = 2 (use the greedy algorithm from 1) of 3/19/18) So |J| = r1(S) + r2(S') and we're done. Weighted Matroid Intersection Suppose each element e in E is given a weight w(e) (>= 0). Among the largest sets that are independent in both matroid, we seek one of maximum total weight. This is similar to the assignment problem. It can be solved in polyomial time, by repeatedly searching in a weighted version of the digraph D defined above. As the algorithm proceeds, the weights are updated. See Bixby and Cunnningham, pp. 15-16. In particular, this gives a polynomial time algorithm for finding a directed spanning tree of maximum total weight. Lecture 26 3/23/18 Friday Greedoids Korte and Lovasz, SIAM J. ALgebraic and Discrete Methods, 1984 Bjorner and Ziegler, Introduction to Greedoids, in: Matroid Applications, ed. Neil White, 19XX. Greedoids were introduced by Korte and Lovasz in 1981. As motivation, consider two questions: 1) What happens if we relax the requirements for being a matroid? 2) Are there interesting greedy algorithms that are not matroid-based? The matroid axioms can be stated as follows. Let E be a finite set and II a set sytem (collection of subsets). Then (E,II) is a matroid if: (o) The empty set is a member of II (i) Full inheritance: X [ Y in II ==> X in II (ii) Exchange: X,Y in II with |X|>|Y| ==> there is x in X-Y with X+x in II. (Previously, we required (ii) only when |X|=|Y|+1. In the presence of (i), however, the stronger version of (ii) holds.) How might we relax these requirements? Defn. (E,FF) is a *greedoid* if axioms (o) and (ii) hold. The sets in FF are called *feasible*. Thm. If (E,FF) is a greedoid, then (iii) Deconstruction: If X in FF is nonempty, then there is an x in X such that X-x is in FF. The idea here is that not every subset needs to be feasible, but there is always one element whose removal preserves feasibility. To prove the theorem, build up. Start with the empty set (a proper subset of X, and use (ii) to successively add elements. Stop when you have a set that is one element shy of being X. Note: K+L remark that we could use (o), (iii), and (ii) for |X| = |Y|+1 as axioms. This is perhaps more satisfying because the axioms refer to incremental modification of feasible sets. There are many examples of greedoids. Here are a few. 1) Any matroid. 2) Connected graph with distinguished root r. E = {edges}, FF is the empty set plus the edge sets of trees that touch r. This is called the undirected branching greedoid. 3) The directed version of 2), in which edges point away from r. 4) Bipartite graph with E [ V x U, where we have an ordering u1,u2,...,u_n of U. The feasible sets are the A [ V that can be matched to u1,...,u|A| in U. Edmonds calls this the medieval marriage greedoid. In The Taming of the Shrew, Act 1, Scene 1, Baptista says: [I resolve] not to bestow my youngest daughter before I have a husband for the elder Before discussing optimization, we need a few more concepts. a) If (E,FF) is a subset system, its *hereditary closure* is (E,FF'), where FF' includes every subset of a set in FF. b) The *rank* of a set is defined as it is for matroids. r(A) is the size of a largest feasible subset of A. c) A *basis* is a maximal feasible subset. All bases have the same size. The same proof we used for matroids works here. d) A subset A of E is *closed* if {x in E : r(A+x) = r(x)} = A. Note that the set above always contains A. So the idea is that anything we try to add from outside A will increase the rank. For example, suppose A is a finite set of vectors in R^m. Then A is closed (in the linear-algebra matroid) if all vectors in E-A are outside the span of A. (Note that this is different from being topologically closed. Indeed E has the discrete topology, so all subsets of A are closed, in the usual sense.) Now we give each element e in E a weight w(e). GREEDY is start with S = empty set while S is not a basis: choose x to maximize w(x) for x in E-S and S+x in FF. (if there is no such x we can stop.) add x to S. To use GREEDY, we need a way to test if a set is feasible. This can be done easily in the cases noted above. When will GREEDY work? Here is one result. Theorem (in K+L 1984): Let (E,FF) be a greedoid with hereditary closure (E,FF'). GREEDY finds a max-weight basis for every weight function iff: 1) (E,FF) is a matroid, and 2) A closed set in (E,FF) remains so in (E,FF'). Let's apply this to an undirected branching greedoid of G=(V,E). We assume G is connected and denote the root by r. Ad 1): Any subset of E that is acyclic (in the usual sense) can be extended to a spanning tree. So (E,FF') is just the cycle matroid of G. Ad 2): The nonempty closed subsets in (E,FF) are just the trees T connecting to r (that is, their edge sets), augmented with any other edges both of whose ends are in T. Such a set is closed in (E,FF'). In this case, GREEDY is Prim's algorithm (1957): R = {r} (set of vertices) S = empty (set of edges) while S is not a basis (i.e. has < |V|-1 edges) choose a max-wt edge e = {v,w} with v in R, w not in R. add w to R add e to S (Historical note: Prim actually rediscovered the algorithm published in Czech by Jarnik in 1930.) In their paper, K+L also give a sufficient condition for GREEDY to work on a given weight function. Finally, we note that there is a "greedoid" interpretation for Dijkstra's single-source shortest path algorithm. For this problem, the weight function involves a sum over paths, so we use a version of GREEDY that considers entire sets, not just individual elements. Lecture 27 4/2/18 Monday Randomization (this is a new topic) Long used for entertainment, divination, ... 1940s: Use by Erdos to prove combinatorial theorems 1970s: Randomized algorithms popularized by Rabin Now a standard part of discrete math and algorithm design Today's applications Lower bound for Ramsey numbers (from N. Alon and J. Spencer, The Probabilistic Method) Probabilistic identity testing (J. Schwartz, J. ACM 1980) Ramsey Theory Started by Frank Ramsey (UK) in the 1930s. Basic theme is that in any sufficiently large object, structure is unavoidable. Ramsey's Theorem: Fix k>1. There is an n such that any graph with >=n vertices has either a k-clique or a k-anticlique (= independent set with k vertices). We will not prove this here. (A proof can be found in M. Hall, Combinatorial Theory, p. 57.) Note that the forced appearance of a clique or an anticlique "switches on" and then stays on forever. So we define R(k) to be the minimum n for which the assertion of the theorem is true. Theorem. R(k) >= 2^{k/2 - 1} if k >= 2. Proof: Start with n vertices. For each possible edge, flip a fair coin to determine if that edge is in the graph or not. These coin flips are independent. Pr[ there exists a k-subset S that is a clique or anticlique ] <= sum_{S of size k} Pr[S is a clique or an anticlique] (union bound) = (n choose k) * 2 (1/2)^(n choose k) (note factor 2) <= (n^k / 2) * 2 (1/2)^{n choose k} = n^k / 2^(k choose 2). If n<2^{k/2 - 1}, n^k < 2^{k(k-2)/2} < 2^{k choose 2} and this makes the probability above < 1. Therefore, there some point in the sample space (= way to put edges in the graph) that avoids both k-cliques and k-anticliques. The assertion follows. How good is this bound? Erdos and Sekeres (1935) showed that R(k) <= (2k - 2 choose k-1) = O(4^k / sqrt k), so the sequence R(k) grows exponentially. Very few values of R(k) are explicitly known As of 2007, the proved values were k 0 1 2 3 4 R(k) 0 1 2 6 18 (see sequence A120414 in OEIS). Note that this could have been done by counting, but the theory of probability gives us tools and intuition. Polynomial Identity Testing Suppose we find an efficient way to evaluate a complicated expression, such as | 1 x1 x1^2 ... x1^{n-1} | | 1 x2 x2^2 ... x2^{n-1} | | 1 x3 x3^3 ... x3^{n-1} | = product{i1. Then we can order the variables so that f = x1^d1 f1(x2,...,xn) + [terms of degree < d1 in x1] So |Z(f) n S^n| <= d1 |S|^(n-1) + |S| * (d - d1)|S|^(n-2) ^ ^ | | if f1 != 0, there by induction, the number of are <= d1 ways to pick ways to make f1 vanish is a1 bounded by this. For each, there are |S| ways to pick a1. = (d1 + d - d1)|S|^(n-1) = d |S|^(n-1). Lecture 28 4/4/18 Wednesday Continuing with randomized algorithms Finite fields Computing square roots mod p Today's problem nicely illustrates the connection between randomized algorithms and algebraic structure. We have already seen an example of this: the Schwartz algorithm for identity testing relies on the idea that a degree d polynomial f(X) in R[X] can have at most d zeroes. Finite Fields Defn. A *field* is a system (F,+,*,0,1) that obeys the laws of high school algebra. More precisely: + and * are commutative and associative + and * obey the usual distributive law 0 is the identity for + 1 is the identity for * for each x there is a y making x+y=0 for each x != 0 there is a y with x*y = 1. We also require that F have at least two elements. (As an exercise, show that 0 != 1.) Familiar examples include Q (rational numbers), R (real numbers), C (complex numbers). Let Z_n = {0,1,2,...,n-1} with addition and multiplication taken mod n. (This means that we do the operation, and if it overflows, we take the remainder after division by n.) Fact. Z_n is a field iff n is prime. Proof sketch: If n = n1 n2 with 1 < n1,n2 < n, we have n1*n2 == 0 (mod n) but n1 != 0. If n is prime, let 1 <= a <= n-1. Then we can use Euclid's algorithm to solve ax + ny = 1 in integers, because a is relatively prime to n. (This is proved in every book on elementary number theory.) Then ax == 1 (mod n). Field Extensions. Suppose that F is a field, and f(X) = X^2 + bX + c is irreducible over F. This means that you can't express it as a product of two degree 1 polys with coefficients in F. Then, we can make a bigger field F' as follows: The elements of F' are polys of degree <= 1, say uX+v w/ u,v in F Addition: just add corresponding coefficients Multiplication: multiply the polys as usual, then subtract the multiple of f that cancels the degree 2 term. The reason that F' is a field is similar to the case of integers mod p. (In a more thorough treatment, this would be proved.) Example: F = R, f = X^2 + 1. (This has no real zeroes.) (X+2) + (3X+4) = 4X + 6 (X+2) * (3X+4) = 3X^2 + 10X + 8 -3(X^2 + 1) --> 10x + 5 The new field F' is just the field of complex numbers, since X = sqrt(-1) in F'. Can we continue and get even bigger fields F'', F''', ... ? No. According to the fundamental theorem of algebra, every polynomial with complex coefficients has a complex zero. So the only irreducibles are linear. If F = Z_p, |F'| = p^2, because we can choose the coefficients u and v in p ways, independently. Note: The same process can be done using irreducibles of degree d. We only need the case d=2 today, though. Computing Square Roots mod p From now on, p denotes an odd prime. How might we solve the quadratic equation x^2 = a in Z_p? To get started, consider what squaring does for small p. E.g. p=11 x = 1* 2 3* 4* 5* 6 7 8 9* 10 x^2= 1 4 9 5 3 3 5 9 4 1 The squares (marked with a *) seem to be distributed irregularly. We could try brute force: try all x < p/2. This is exponential in the length of p. Cipolla's algorithm (1903) To solve x^2 == a (mod p) Choose t from Zp Form f = Z^2 - tZ + a (we hope it is irreducible) Compute x = Z^{(p+1)/2} mod f Return x as a square root of a Nowadays we imagine that t is chosen at random. Note that x can be computed efficiently using repeated squaring. So the number of Z_p operations required is O(log p). Why does it work? Suppose f is irreducible. Then we are working in the extension field F' (which has p^2 elements). In F', Z is a zero of f (plug it in). Z^p is also one. This takes a little argument. f(Z^p) = Z^(2p) - t Z^p + a = Z^(2p) - t^p Z^p + a^p = ( Z^2 - t Z + a )^p = 0. The calculation above relies on two facts. In Z_p, a^p = a (Fermat's little theorem) In F', (r+s)^p = r^p + s^p (Freshman's binomial theorem) Both are consequences of the following reault: If 1 <= i <= p-1, then p divides (p choose i). Also, Z != Z^p. (The equation x^p - x = 0 has only p roots, and they are all in F.) All of this implies that (Z^{(p+1)/2})2 = Z^p * Z = a since the constant term of an equation is the product of its roots. If a is a nonzero element of (Z_p)^2, then Z^{(p+1)/2} is in Z_p, since there are only two solutions to x^2 = a, and both are in Z_p. How likely is f to be irreducible? Thm. If t is chosen from the uniform distribution on Z_p, Pr[f is irreducible] = (p-1)/2p) ~ 1/2. Proof. We count the reducible monic polynomials (Z - alpha)(Z - beta) with alpha, beta in Z_p and alpha*beta = a. For two choices of alpha, alpha^2 = a, and we get the polynomials (Z - alpha)^2. There are p-3 ways to pick the remaining alpha, and then beta must equal a*alpha^{-1}. This gives (p-3)/2 polynomials, since alpha,beta and beta,alpha give the same polynomial So # of reducible polys is (p-3)/2 + 2 = (p+1)/2. # of irreducibles is p - (p+1)/2 = (p-1)/2. Lecture 29 4/6/18 Friday Detecting Perfect Matchings Mulmuley, Vazirani, Vazirani, Combinatorica, 1987 G = (V,E) undirected graph When G is bipartite, we can use network flow techniques to find a maximum matching in G. What can we say in general? Let PM = {G : G has a perfect matching} PM belongs to NP (just exhibit a matching) PM' = {G : G has no perfect matching} also belongs to NP (!) This is a consequence of a theorem proved by Tutte (1947). For X [ V let q(X) = # of connected components in G-V having an odd number of vertices. Example: suppose G is a triangle. Then q(triangle) = 0 q(*--*) = 1 q(point) = 0 q(empty set) = 1 Tutte's theorem says that G has a perfect matching iff for every X [ V, q(X) <= |X|. Note that this is similar to Hall's marriage theorem (proved 3/2/18). So a certificate for PM' is a set X violating the inequality. In a famous paper (Paths, Trees, and Flowers, 1965), Edmonds showed that PM belongs to P. We'll present a randomized algorithm for testing if G has a perfect matching. Review of Schwartz's theorem (see lecture 4/4/18). We proved this for real coefficients, but the result is actually true for polynomials over any field. The Tutte Matrix Let V = {1,...,n}. For each pair i Let 0, if {i,j} not in E t_ij = x_ij, if {i,j} is in E and ij The Tutte polynomial is det(t_ij) Example: Suppose G is 1 ---- 2 | | | | 3 ---- 4 Write x for x_12, y for x_13, z for z_24, w for x_34 The matrix is [ 0 x y 0] [-x 0 0 z] [-y 0 0 w] [ 0 -z -w 0] Note that this is the adjacency matrix of G but with the 1's replaced by +- variables. To compute the determinant | 0 x y 0| |-x 0 0 z| |-y 0 0 w| | 0 -z -w 0| | (shift rows up) v |-x 0 0 z| = - |-y 0 0 w| | 0 -z -w 0| | 0 x y 0| | (shift columns right) v | z -x 0 0| = | w -y 0 0| | 0 0 -z -w| | 0 0 x y| = |z -x| * |z -w| = (xw - yz)^2. |w -y| |x y| This always happens. Any skew-symmetric determinant is the square of a quantity called the *Pfaffian*. For a proof of this, see N. Jacobson, Basic Algebra I, section 6.2. Note that if n is even, det(t_ij) = P^2, and P has degree n/2. *--* | | So *--* belongs to PM, since its Tutte poly is != 0. On the other hand, a triangle has Tutte polynomial | 0 x y | |-x 0 z | = 0. |-y -z 0 | It can be shown that G is in PM <==> G's Tutte polynomial is != 0. Lovasz Randomized Perfect Matching Test (FCT 1979) Let |V| = n, n even Choose a prime p >= n (this is >= 2 * degree(P)). Let S = {0,1,...,n-1} [ Z_p Choose a_ij i.i.d. from the uniform distribution on S Evaluate f(a_ij's) mod p, where f is G's Tutte polynomial If this is 0, say G has no perfect matching If this is !=0, say G has a perfect matching This has one-sided error: If G is in PM, Pr[error] <= 1/2 If G is not in PM, Pr[error] = 0. We can evaluate f using O(n^3) arithmetic operations in Z_p. How large is p? There is always a prime between n and 2n. This is Bertrand's postulate (proved by Chebyshev). We can use the Sieve of Eratosthenes to find p. The idea is to cross off all larger multiples of each prime in turn. So when n=4, we'd have 2 3 4 5 6 7 8 (so 2 is prime) 2 3 5 7 (so 3 is prime) 2 3 5 7 (5 is prime, we stop since 5 >= n) Although better bounds are possible, the work for this can be estimated as O(n^2), since we make <= n passes on an array of length <= 2n. Lecture 30 4/9/18 Monday Review of Lovasz randomized perfect matching tester (see last lecture) Suppose we know (say, by running the Lovasz algorithm) that G has a perfect matching. How might we construct one? Self Reduction This is a very general technique, applicable to a lot of problems. Assume G has a perfect matching Choose a vertex v (it must be matched) For each w adjacent to v: If G' = G-v-w has a perfect matching, (*) recursively construct a perfect matching M' for G' return M = {v,w} + M' Else go on to the next w Since the matching test is randomized, this might go through all w without making a recursive call on a smaller graph. In this case, we can just try again. The probability that we are wrong on all edges linked to v is <= 1/2^{# of partners for v} <= 1/2 (since v has at least 1 partner) So E[# of time through the loop] <= 2. By linearity of expectation, the expected number of times we use the Lovasz graph tester is bounded by 2n + 2(n-2) + 2(n-4) + ... ~ 2 n^2 / 4 = n^2 / 2. So the *expected* run time to construct a matching is polynomial. Under the promise that G has a PM, this is a Las Vegas algorithm: always right, fast in expectation. Algorithms like the matching tester are Monte Carlo: always fast and probably right. Directly Constructing a Matching How might we get more information out of the Tutte polynomial? One idea is to use arithmetic. (Fields like Z_p only have trivial arithmetic: if x,y are nonzero, x divides y and vice versa.) Let's take our evaluation points from Z, and evaluate T(a_ij) as an integer. We will use powers of 2 as our evaluation points. Consider the example from last time: x = 2^3 * ------- * | | y = 2^1 | | z = 2^4 | | * ------- * w = 2^1 For the indicated evaluation points, T = (xw - yz)^2 = (2^4 - 2^5)^2 = 2^8. The term with the smallest exponent of 2 comes from the matching of smallest multiplicative weight. (The multiplicative weight is the product of all the weights of edges participating in the matching.) This matching is *---* *---* When the exponents of the 2-powers are spread out enough, our intuition is that only one matching has the minimum exponent. MVV recommend taking exponents from 1..2m, when G has m edges. (We'll justify this choice later). Assume this works, and that only one matching has the minimum exponent, say w. Then, if w_ij is the exponent for the edge {i,j}, det(T(2^{w_ij}) = 2^2w * [odd number] = 4^2 * [odd number]. By Cramer's rule, there is an integer matrix B for ahich B T(2^{w_ij}) = 4^w * E, where E = diag(epsilon1,...,epsilonn) is a diagonal matrix with odd entries. For the example, w=4 and we have [ 0 1 -8 0 ] [ 0 8 2 0 ] [ 24 ] [ 0 1 -8 0 ] * [-8 0 0 16 ] = [ 24 ] [-3 0 0 -3 ] [-2 0 0 2 ] [ 24 ] [ 0 1 4 0 ] [ 0 16 -2 0 ] [ 24 ] B * 2^{-5} T(2^{w_ij}) 2^3 * 3. MVV prove that B_ij * 2^{w_ij} {i,j} is in the min-wt matching <=> ----------------- is odd. 4 The expression on the right has entries from a Hadamard product: A (*) B = C, where C_ij = A_ij * B_ij. For the example: [ 0 8 -16 0 ] B * 2^{-5} (*) T = [ 24 0 0 -48 ] [-48 0 0 24 ] [ 0 16 -8 0 ] so B (*) T [ 0 1* -2 0 ] ------- = [ 3* 0 0 6 ] 2^8 [-6 0 0 3*] [ 0 -2 -1* 0 ] The odd entries are marked with a *. They are at 1,2, 2,1, 3,4, and 4,3. So M is {1,2}, {2,3}. Lecture 31 4/11/18 Wednesday The Isolation Lemma and Applications N. Ta-Shma, manuscript 2012. Lecture 32 4/13/18 Friday Introduction to randomized rounding Raghavan and Thompson, Combinatorica, 1987. This can be useful for getting approximate solutions to NP-hard optimization problems. Example 1: Vertex cover G = (V,E) undirected graph Want to choose a smallest set of vertices that hits all edges (imagine putting guards in an art gallery) Suppose G is a triangle. min is 2. As a 0-1 integer program: min x1 + x2 + x3 s.t. x1 + x2 >= 1 x1 + x3 >= 1 x2 + x3 >= 1 xi in {0,1}. Relax this to an LP (0 <= xi <= 1). This has the fractional solution x1 = x2 = x3 = 1/2. (Optimal, by considering the dual LP.) Let's interpret the xi's as probabilities, and include vertex i if its coin comes up "heads". The number of vertices chosen has a binomial distribution. So Pr[0 vertices] = 1/8 (infeasible) Pr[1 vertex] = 3/8 (ditto) Pr[2 vertices] = 3/8 the optimal choice Pr[3 vertices] = 1/8 feasible, not optimal. Pr[feasible] = 1/2. E[# vertices | feasible] = 2*3/4 + 3*1/4 = 9/4 = 2.25. This is about 13% worse than optimal. More generally, consider G = K_n (complete graph w/ n vertices). The optimal fractional solution is still xi = 1/2, for all i. Then Pr[feasible] = (n+1)/2^n (small for large n) E[# vertices | feasible] = n^2 / (n+1) E[# vertices | feasible] n^2 ------------------------ = ------------ ~ 1 + 1/n^2 . This is pretty good, provided we pick a feasible solution. However, the chance of feasibility is small. optimum 0-1 choice (n+1)(n-1) Example 2: Wire Routing G = (V,E) undirected graph (usually a grid) Given subsets S1, S2, ... S_k [ V For each i, there is a collection of trees that can be used to connect Si. Goal: Choose one tree for each S_i so as to minimize the maximum traffic over the edges. Note: This is a so-called bottleneck optimization problem. Our metric is the min, over edges e, of the traffic on e Example: 3 x 1 grid A *----*----*----* b S1 = {A,a} | | | | S2 = {B,b} B *----*----*----* a so we're connecting opposite corners Available trees: T11 T12 *----* *----*----* for S1 | | *----*----* *----* T21 T22 for S2 *----*----* *----* | | *----* *----*----* Any feasible choice has 2 links on some edge. (Use symmetry. Choose T11, then consider adding T21 or T22.) As a 0-1 integer program min W (width) s.t. x_ij in {0.1} x_ij = 1 means we use Tij sum_j x_ij >= 1, all i each set is connected by some tree sum_{ ij s.t. e in Tij } xij <= W (edge traffic <= W) LP relaxation: same but with 0 <= x_ij <= 1 sum_j xij = 1 (no reason to over pack) For the example the optimal fractional choice is xij = 1/2 (all ij) The width is 1, which is optimal. Randomized rounding recipe Solve the LP. For each i, we interpret the xij's as probabilities pij that sum to 1 Each set chooses its tree at random, but with (possibly) non-uniform probabilities Note that the choices for each set are independent, and that feasibility is guaranteed. E[traffic thru e] = sum_{Tij has e} p_ij <= W_LP (an LP constraint) We want the traffic thru edges to be *uniformly* small. This is made probable by exploiting independence. A concentration result Lemma: Let Z = Z1 + ... + Zt be a sum of independent 0-1 random vbles mu = sum_i E[Zi] (these means might be different) beta >= 0 Then Pr[ Z >= (1+beta)mu ] <= exp( - beta^2 mu / 3 ). This is a version of Azuma's inequality (sometimes associated with Bernstein, Chernoff, Hoeffding, ...). See Angluin and Valiant, JCSS, 1979. Think of Zi as saying that the tree for Si runs through e. So we are expressing local traffic as a sum of independent rv's. The quantity (1 + beta)mu is called a "large deviation" (despite how it looks!). When the expectations are all 1/2, the lemma says that the probability of observing a sum this far out decays exponentially to 0. This is consistent with the DeMoivre-Laplace limit theorem, which implies that with high probability, the sum is mu + O(sqrt mu). Theorem (RT). Let |E| = m. Let 0 < epsilon < 1 be such that 3 log(m/epsilon) <= W_LP (opt fractional width) Then with probability >= 1-epsilon, the width of the randomized rounding solution is <= OPT + sqrt( 3 OPT log(m/epsilon) ). Here (as is customary) OPT denotes the cost of the best integral solution. Proof. W_LP <= OPT, so it will be enough to show that w/ the indicated probability, width < W_LP + sqrt( 3 W_LP log(m/epsilon) ). (this constraint is more stringent). Consider the edge e, and let mu = E[traffic on e] Pr[e's traffic >= W_LP + sqrt( 3 W_LP log(m/epsilon) ) ] <= Pr[e's traffic >= mu + sqrt( 3 mu log(m/epsilon) ) ] (since mu <= W_LP from edge constraint) Rewrite this event as e's traffic >= mu + mu * sqrt( 3 log(m/epsilon) / mu ) <------ call this beta -----> By the lemma its probability is no more than exp( -3 log(m/epsilon) / mu * mu / 3 ) = exp( - log(m/epsilon) ) = epsilon / m One of the axioms of probability says that the probability of a union of events is no more than the sum of their individual probabilities. Applying this, Pr[ some edge has traffic >= ... ] <= epsilon. Lecture 33 4/16/18 Monday Randomized rounding for MAX-SAT Goemans and Williamson, SIAM J. Disc. Math., 1994 An Optimization Version of Boolean Satisfiability Consider formulas in conjunctive form, e.g. phi = (x v y' v z) ^ (x' v z') This is an AND of clauses. Each clause is an OR of literals (variables or their negations). Often called CNF (conjunctive normal form) even though it isn't unique. For example, I could replace the last clause by (x' v y v z') ^ (x' v y' v z'). Want to determine the maximum number of clauses that can be simultaneously satisfied. This is an NP-hard optimization problem, by the Cook-Levin theorem. (phi is satisfiable iff the maximum is the number of clauses.) For simplicity we'll assume there are no repeated clauses, and no repeated variables in a clause. Also write 1 for true and 0 for false. Random Assignment This is a simple strategy. Flip a coin for each variable and then set it to 1 if the coin was "heads" and to 0 otherwise. Let C be a clause with k literals. We can assume that C is x1 v x2 v ... v xk. Unless all xi are 0, the clause C is satisfied. So Pr[C is satisfied] = 1 - 2^{-k}. If there are m clauses, each with >= 2 literals, the expected number of satisfied clauses is >= (3/4) m. Using Random Rounding Integer program for MAX-SAT: Given clauses C1, C2, ..., Cm using variables x1, ..., xn. Let Cj+ be the variables appearing positively in Cj Let Cj- be the variables appearing negatively in Cj E.g. for C1 = x v y' v z C2 = x' v y' C1+ is x,z C2+ is empty C1- is z C2+ is x,y (Note these sets give another way to write the formula.) The IP is max sum_{j=1}^m zj s.t. sum_{xi in Cj+} yi + sum_{xi in Cj-} (1-yi) >= zj, all j yi, zj in {0,1} yi indicates whether xi is set to 1 or not zj is the indicator for satisfaction of the j-th clause The LP relaxation is same, but with 0 <= yi, zj <= 1. To get an assignment we solve the LP, obtaining possibly fractional values yi^, zj^. Then, set xi to 1 with probability yi^. (Note that feasibility is guaranteed.) The Probability of Satisfaction by Random Rounding Let beta_k = 1 - (1 - 1/k)^k. Recall zj^ is gotten by solving the LP Claim. If Cj has k literals, Pr[Cj is satisfied] >= beta_k zj^. Proof. We can assume Cj is x1 v ... v xk. As before, Pr[Cj satisfied] = 1 - Pr[all xi are 0] = 1 - prod_{i=1}^k (1 - yi^). Recall if alpha_1,...,alpha_k >= 0, sum alpha_i <= const, the product alpha_1 * ... * alpha_k is largest when all alpha_i are equal. (This is a consequence of the arithmetic-geometric mean inequality, for which see 1/23/18 lecture.) From the LP, sum_i yi^ >= zj which is equivalent to sum_i (1 - yi^) <= k - zj Therefore, prod_i (1 - yi^) <= prod_i (1 - zj^ / k) [the average of 1-yi^ is <= 1 - zj^ / k.] So Pr[Cj satisfied] >= 1 - (1 - zj^ / k)^k >= (1 - (1 - 1/k)^k) zj^ [2nd >= holds if 0 <= zj^ <= 1, prove this as an exercise.] = beta_k zj^. A Combined Strategy Run both procedures. Use the assignment that satisfies the largest number of clauses. Suppose random assignment satisfies n1 clauses random rounding satisfies n2 clauses Theorem. E[ max {n1, n2} ] >= 3/4 sum_j zj^ >= 3/4 * OPT of MAX-SAT Proof. The second inequality holds because we relaxed a maximizing problem to get the LP. E[n1] = sum_k sum_{Cj w/ k literals} (1 - 2^{-k}) >= sum_k sum_{Cj w/ k literals} (1 - 2^{-k}) zj^ . E[n2] >= sum_k sum_{Cj w/ k literals} beta_k zj^. Since max >= average, E[max{n1,n2}] is lower bounded by sum_k sum_{Cj w/ k literals} [(beta_k + 1-2{-k})/2] zj^. The expression in brackets is >= 3/4 for k=1,2,... (exercise for you), so E[max{n1,n2}] >= (3/4) sum_j zj^ >= (3/4)*OPT. Lecture 34 4/18/18 Wednesday New topic: Applications of nonlinear optimization Today: Introduction to convexity Papadimitriou and Steiglitz 1.5, 1.6 Fleming, Functions of Several Variables, 1-4, 1-5, 2-4. Why? Convexity is the natural language for discussing optimization problems that go beyond linear programming. In many cases, convex problems are "easy" to solve whereas non-convex problems may not be. Convex sets Defn. Let C be a subset of R^n. C is *convex* if, for every x,y in C, the line segment joining x to y is within C. (The line segment joining x and y is given parametrically by z = lambda x + (1 - lambda)y, 0 <= lambda <= 1.) In R^2, boxes and discs are convex. kidney-shaped sets aren't. [draw the pictures] In R^1, convex sets are the intervals (possibly unbounded). Every convex set is connected (meaning that any two points in the set can be linked by a path). Here the line segment between the points can serve as the path. Note: The defn of convexity involves universal quantifiers: for every x,y in C, for every lambda in [0,1] ... so the intersection of any family of convex sets is convex. Convex functions Defn. Let f : C (convex) --> R. f is *convex* if for each x,y in C and each lambda, 0 <= lambda <= 1, f(lambda x + (1-lambda)y) <= lambda f(x) + (1-lambda) f(y). The idea of this is that f(average) <= average(f). This has a probabilistic interpretation. Define X (a random variable) by X = x w.p. p = y w.p. q = 1-p. Then f(E[X]) <= E[ f(X) ]. This is an example of Jensen's inequality. On an open convex set, a convex function is continuous. (For proof, see Fleming, p. 26.) But it need not be smooth: consider f(x) = |x| on R. Warning: Continuity can fail at boundary points. Consider, for example, f(x) = 0, 0 <= x < 1 = 1, x = 1. If f is convex, the *sublevel set* {x : f(x) <= alpha} is convex. [draw the picture.] Typically we present a convex set as an intersection of sublevel sets: C = intersection_{i in I} {x : f_i(x) <= alpha_i }. Example: polytope, as an intersection of finitely many half planes. E.g. in R^2, x >= 0, y >= 0, x+y <= 2, same as -x <= 0, -y, <= 0, x+y <= 2. Convex optimization (generalizes LP): Given C,f (both convex) find x^ minimizing f(x) on C. Warning: If f is convex, -f need not be. So we can't interchange maximizing and minimizing problems as freely as we did for linear programming. Warning++: Even if C is bounded, the min need not exist. Consider min x+1 s.t. 0 < x < 1. If C is compact (closed and bounded) and f is continuous, then the min must be attained. This is true even if f isn't convex. By adding an extra dimension, we can make f(x) linear: min f(x) s.t. x in C <==> min t s.t. x in C, f(x) <= t. (taking the min over x,t) To understand why this is true, consider a picture: \ . . . . . / \. . . . ./ \ . . . / \. . ./ \ . / \./ * The set of (x,y) with f(x) <= y is called the *epigraph* of f. Each "fiber" of the epigraph is a closed half-line. We are looking for the *lowest* boundary point of these half-lines. Big theorem: Local minima are global minima Let x* be a local min for f : C --> R (both f and C convex). Then x* minimizes f over all of C. Proof: We will reduce to dimension 1 (this trick often works). Suppose (for contradiction) that x* isn't a global min. That is, there is a better x^ somewhere in C. Since C is convex, the line segment from x* to x is within C. We use lambda in [0,1] as the local coordinate for that line segment, with lambda=0 indicating x* and lambda=1 indicating x^. Let g(lambda) = (1-lambda)f(x*) + lamba f(x^). (Draw the picture -- the graph of g is a straight line lying above the graph of f, and g' = f(x^) - f(x*) < 0.) If 0 < lambda < 1, f((1-lambda) x* + lambda x^) <= (1-lambda)f(x*) + lambda f(x^) = g(lambda) < g(0) (g decreases!) = f(x*). Since lambda can be arbitrarily small, x* wasn't a local min. This is a big deal because it means that we only need to investigate f locally to verify that we have an optimum. (Of course, we still have to prove global convexity.) Some examples to ponder. 1) C = (a,b), f smooth on (a,b), f'' > 0. Any critical point (place where f' = 0) minimizes f on the interval. 2) C = [-1, 1]. f given by 0, -1 <= x <= 0 f(x) = exp(-1/x), 0 < x <= 1 (This is a famous example of a smooth function that is not equal to the sum of its Maclaurin series around x=0.) There is an interval of critical points: [-1,0]. Each is a global min. 3) Quadratic forms in R^2 f = a x^2 + b xy + c y^2. For simplicity assume a != 0. partial f / partial x = 2ax + by partial f / partial y = bx + 2cy These both vanish at the origin, which is always a critical point. Hessian (matrix of second partials) is ( 2a b ) ( b 2c ) A smooth function f is convex iff its Hessian matrix is positive semidefinite (means that z^T H z >= 0 for all z). So f is convex iff f(x,y) >= 0 for all x,y. When does this happen? Complete the square: af = (ax + b/2 y)^2 + (ac - b^2 / 4) y^2. From this we can see that f(x,y) >= 0 whenever: a > 0 and 4ac - b^2 >= 0. In this case, the local min at x=y=0 is also a global min. What if a = 0? The general result, which you can prove with similar reasoning, is: f is convex iff a, c, 4ac - b^2 >= 0. Lecture 35 4/20/18 Friday Semidefinite programming (SDP) Lovasz, Semidefinite Programs and Combinatorial Optimization, in Recent Advances in Algorithms and Combinatorics, ed. B. A. Reed and C. L. Sales, Springer-Verlag 2003, pp. 137-194. This will be on the course web site. Convex optimization (min f(x) s.t. x in C, both f and C convex) has nice theoretical properties. For example, any local minimum of f is also a global min. However, it includes all kinds of weird non-intuitive possibilities, such as 1 if x = 0 f(x) = x if 0 < x <= 1. Here, [0,1] is compact (closed and bounded) but the min doesn't exist. Like convex optimization, SDP is a generalization of linear programming, but it is better behaved. Defn. Let A be an nxn matrix with positive entries. A is *positive semidefinite* if x^T A x >= 0 for all x in R^n. S = {nxn positive semidefinite matrices} [ R^{n^2}. S is convex: if x^T A x, x^T B x >= 0, and 0 <= lambda <= 1, then x^T ( lambda A + (1 - lambda) B ) x >= 0. S is a cone: if x^T A x >= 0 and gamma >= 0, then x^T( gamma A ) x >= 0. Same for S' = {symmetric pos. semidefinite matrices} [ R^(n+1 choose 2) Lo"wner Ordering. A >> B means that A-B is positive semidefinite So A >> 0 means that A is pos. semidef. Some examples. (n=1) ax^2 >= 0 for all x iff a >= 0. [a b] (n=2) A = [b c] Associated quadratic form: ax ^2 + 2bxy+ cy^2. |a b| So A >> 0 <==> a >= 0 and |b c| >= 0. (general n): Let A be n x n, symmetric: a11 | a12 | a13 | ... a1n ----- | | a12 a22 | a23 | ... a2n ----------- | a13 a23 a33 | ... a3n ----------------- ... A >> 0 iff all n "upper left corner" determinants are >= 0. This means we can check whether A >> 0 or not in polynomial time -- O(n^4) operations -- if we use Gaussian elimination to compute these determinants. Note that optimization of linear functions over S or S' is pretty uneventful: the optimum is either 0 or the problem is unbounded. So we introduce further constraints. Defn. In R^n, an *affine subspace* is a translation of an ordinary linear subspace. These can be specified in two ways: 1) Solution set of linear equations 2) Image of an affine map t --> At + bt For example, the line through (1,0) and (0,1) is either: 1) Soln set of x+y = 1 or 2) All points (t,-t) + (1/2, -1/2). Semidefinite programming: minimize a linear function of the entries of a symmetric matrix X, subject to: i) affine constraints on entries, and ii) X >> 0. Just as with LP, SDP's can be presented in two ways, reflecting the two ways to specify the affine constraints. 1) min C o X such that Ai o X = bi, i=1,...,m X symmetric, X >> 0 ( C o X means sum_ij cij xij, the elementwise inner product. ) 2) Given A0, A1, ..., Am symmetric nxn matrices, c in R^m min cx s.t. A0 + A1x1 + ... + + Am xm >> 0. ^ x is (x1,...,xm) here Theorem: SDP has LP as a special case. Proof. The idea is that LP is SDP on the diagonal. We use the abbreviation diag(a1,...,an) for a diagonal matrix with the indicated entries. we translate min cx s.t. Ax = b, x >= 0 as follows: X = diag(x1,...,xn), so xij = 0 if i != j xi >= 0, all i iff X >> 0 ax = b iff diag(a1,...,an) o diag(x1,...,xn) = g cx is diag(c1,...,cn) o diag(x1,...,xn). Background on eigenvectors and eigenvalues Let A be a matrix. A number lambda for which there is a nonzero v making Av = lambda v is called an *eigenvalue* and the v is an *eigenvector*. The idea is that along the direction given by v, the action of A is extremely simple: it just stretches or shrinks v. The amount of stretch or shrink is the eigenvalue. An n x n symmetric matrix with real entries has n real eigenvalues lambda_1 >= lambda_2 >= ... >= lambda_n These are counted with multiplicity, and the corresponding eigenvectors are orthogonal. This is proved in G. Strang, Linear Algebra and its Applications, pp. 235 ff. The mother of all SDPs: Find the largest eigenvalue of a real symmetric positive definite matrix. Theorem. max { lambda_i } = min { lambda : lambda I - A >> 0 } This is like a strong duality result, but note that the max is over a discrete set and the min is over a half-line. Proof. for all x x^T( lambda I - A ) x >= 0 <==> for all x lambda ||x||^2 >= 0 <==> for all x != 0, x^T A x lambda >= --------- ||x||^2 (the Rayleigh quotient). Since the quotient is homogeneous (degree 2) we may as well let ||x|| = 1. Then, write x = v1 + ... + vn where the vi are eigenvectors for lambda_1,...,lanbda_n. Then A x = lambda_1 v_1 + ... + lambda_n vn x^T A x = lambda_1 ||v_1||^2 + ... + lambda_n ||vn||^2 <= lambda_1 ( ||v_1||^2 + ... + ||vn||^2 ) = lambda_1 ||x||^2 This upper bound is tight: take x = v1 / ||v1||. So, finding lambda_1 is an example of SDP (type (2)) with m=1: lambda I - A = -A + I lambda ^ ^ | | A0 A1 lambda = c lambda with c=1. [If A is not pos def can it easily be made so?] Lecture 36 4/23/18 Monday Using SDP to approximate maximum cuts Goemans and Williamson, J. ACM, 1995 Max-Cut Problem: G = undirected graph with edge weights (>= 0). Want to partition vertices into two sets and maximize the total weight of the crossing edges. (Note that this forces both sets to be nonempty.) Example: G = K_n (complete gph on n vertices) If |S|= k, |V-S| = n-k. There are k(n-k) crossing edges. We maximize this when k = floor(n/2). NP-hard, even if the weights are all 1 (Karp 1972). G+W found an approximation algorithm based on semidefinite programming. Review of SDP from last lecture pos semidef matrices SDP: minimize linear fn of pos semidef matrix coeffs, coeffs may be linearly constrained Factoring Positive Semidefinite Matrices The expression x^T A x looks like the square of a length. By factoring A we can make this explicit. Thm. If A >> 0 (symm pos semidef) there is a real, upper-triangular M s.t. A = M^T M. This is called a Cholesky factorization. Note that x^T A x = x^ (M^T M) x = ||Mx||^2. So in the right coordinates, sqrt(x^T A x) *is* a length. The matrix M can be computed by an algorithm using O(n^3) operations from {+,-,*,div,sqrt}. The square root is necessary, since for n=1 we see that a = m^2 implies m = +- sqrt(a). Golub and Van Loan have a discussion of this. Relaxing Max-Cut to SDP Let w_ij be the weight of {i,j}, or zero if there is no such edge. Here is a nonlinear integer program for Max-Cut: max (1/2) sum_{i X = Y Y^T, say with [row vector y1] Y = [ ... ] [row vector yn] X_ii = 1 <===> || yi ||^2 = 1 X_ij equals (yi, yj) So we can write the max-cut relaxation as an SDP in equality form: min sum_ij wij X_ij s.t. X >> 0, X_ii = 1, all i. Assume that we can solve this, and factor the resulting X to get the yi's. To get the approximate Max Cut: Choose a random hyperplane H thru the origin Label each vertex vi by +1 or -1, depending on which side of H the vector yi is on Use this vertex classification as the cut In expectation, the value of the G-W max cut is close to optimal. Let theta_i, theta_j be two unit vectors, theta_ij the angle between them, chosen so that 0 <= theta_ij <= pi. H intersects the plane of yi,yj in random fashion, so Pr[yi,yj on opposite sides of H] = theta_ij / pi. Lemma. There is a c>0 such that on [0, pi] theta/pi >= c (1 - cos theta)/2 (Exercise for you.) This is very plausible if you graph the functions on the left and right side. Observe that c=1 is too big. The largest c that works is alpha = 0.87856... and can be found numerically. Using this, E[G-W cut value] = sum_{i= sum_{i= alpha/2 sum_{i= alpha * max over yk in +-1 of (1/2) sum_{i b} Let C [ R^n, compact convex set A separation oracle for C takes in x in R^n, and either tells us that x is in C, or provides a hyperplane H containing x such that C is entirely within H_. (This means the distance from C to H is positive.) The idea: H: \ C: \ ** x **** \ **** \ **** \ Any polytope {x : Ax <= b} has a poly-time separation oracle. If Ax <= b, say that x is in C. Else, some inequality is violated, say A_i x > bi. Let H be {x : Ai x - b = 0}. Ellipsoid as a Generalization of Binary Search At all times, we have an ellipsoid E that contains C. H: E: | ---------- / | \ | C x | \ | / ---------- | We test x, the center of E. If x is not in C, we draw a new ellpsoid E' that encloses the "left half" of E. E' will have volume that is smaller than the volume of E, by a constant factor that depends only on u. Replace E by E' and continue. Since the volume of E shrinks geometrically, this cannot continue forever. For n=1, ellipsoids are intervals. So: E: C: |-----[***]--x------------| ^ test center (here) if x not in C, we reduce the length (= 1-dimensional volume) by 1/2. This is just binary search. Computational Details Affine map on R^n: x --> Mx + b, M nxn invertible solve y = Mx + b to get inverse: x = M^{-1}(y-b) = M^{-1} y - M^{-1} b. Ellipsoid: { x : (x - x0)^T A (x - x0) <= 1 }, A symm pos def Using a polynomial number of operations, we can find an affine change of coordinates that brings the ellipsoid to the form ||x||^2 <= 1. The idea of the algorithm is to repeatedly complete the square, and bring the quadratic expression x^T A x to diagonal form. Note that ||x||^2 = x^T I x. The center is x=0. Square root is needed, since x^2 <= 2 iff (x / sqrt 2)^2 <= 1. So we can assume E is {x : ||x|| <= 1}. Its two "slices" are E- = {x | ||x|| <= 1, x1 < 0} E+ = {x | ||x|| <= 1, x1 > 0} We get E' by taking the center (x=0) and moving it to the left along the x1-axis, maintaining the property that E' touches the left end of E, and meets the topmost points (0,x2,...,xn). For this we can draw the picture for the (x1,x2) plane. By symmetry it is enough to consider these coordinates. A good choice for the new center is x' = (-1/(n+1),0,...,0). The new radius (distance to left end of E) is 1 - 1/(n+1). We want (-1/(n+1),1) to be on E' (we are setting all other coordinates to 0), which is x1^2 / r1^2 + x2^2 / r2^2 = 1. Solving this for r2 (we know x1 = -1/(n+1), x2=1) we get r2 = (1 - 1/n^2)^{-1/2}. Suppose Vol(E) = c_n Vol(E') = c_n* r1 * r2^{n-1} Then Vol(E')/Vol(E) = r1 r2^{n-1} = (1 - 1/(n+1))(1 - 1/n^2)^{n/2} ... (computations skipped here) = 1 - 1/(2n) + O(1/n^2) So applying this 2n times will shrink volume by approximately (1 - 1/2n)^{2n} ~~ e^{-1}, about 0.37. Suppose we choose the initial ellipsoid so that C [ E_0 [ [-R,R]. Then (to first order), if there are k iterations, Vol(C) <= Vol(E_k) <= (1 - 1/2n)^k Vol(E_0) <= (1 - 1/2n)^k (2R)^n <= exp( - k/2n ) (2R)^n So log(Vol(C) <= -k/2n + n log 2R which rearranges into k <= 2n^2 log (2R) - 2n log( Vol(C) ). So after at most this many iterations, we will get an x in C. Lecture 39 4/30/18 Monday Continuing with Ellipsoid Algorithm Review of ellipsoid algorithm (last lecture) Getting Started Suppose C is the polyhedron {x : Ax <= b} [ R^n all entries of A,b integral, <= M in absolute value. C is a convex combination of its vertices (here is where we use that C is bounded). At each vertex v = (v1 ) there are n linearly independent tight (...) inequalities (vn ) So v is the unique solution to Bv = b, where B is a submatrix of A and det(B) != 0. (This follows from properties of the simplex algorithm). By Cramer's rule det(B with i-th col replaced by b) vi = --------------------------------- det(B) So (expand the determinant into n! terms) |vi| <= n! M^n Also, if |vi| = ri/si (fraction in lowest terms) we have log(ri), log(si) = O( n log(Mn) ). Suppose there are m vertices, and x in C. Then xi = sum_{j=1}^m lambda_j vi^(j), where sum_j lambda_j = 1, and all lambda_j >= 0. This plus the bound for vertex coordinates gives |xi| <= n! M^n implying ||x||^2 = sum_i |xi|^2 <= n (n!)^2 M^2n so ||x|| <= sqrt(n) n! M^n <---- make this R The initial ellipsoid can just be ||x||^2 <= R^2. As required, log(2R) = O( n log n + n log M) is polynomial in the number of bits needed to specify C. Note: Our argument that linear programming can be done in polynomial time is not complete, because we didn't examine the precision required in our numerical calculations. You can find a discussion of that in Papadimitriou and Steiglitz. Lecture 40 5/2/18 Wednesday Using the ellipsoid algorithm for SDP Lovasz, Semidefinite programming and combinatorial optimization (1995) So far we've used the ellipsoid algorithm for linear programming. The key concept is that of a separation oracle for a convex set C. Its job is to take a test point x, and report that either x is in C, or give a hyperplane H that goes through x and has C strictly on one side of it. (See lecture 3/28/18.) Using the separation oracle, we put C inside a equence of smaller and smaller ellipsoids, until we eventually find x in C. The ellipsoid algorithm can also be used for (well behaved) semi-definite programming problems. We have to be careful, since some familiar features of LP need not hold: Optima need not rational Example: eigenvalue problem, lambda_max is the largest root of an n-th degree poly equation. Strong duality need not hold In general, we only have weak duality Optimum points can be huge There is no analog of Cramer's rule. Let's look at how we can make separation oracles for SDPs. Simplest case: C = { A : A >> 0 } = cone of symmmetric positive definite matrices Factor A = U^T D U, D = diag(lambda_1,...,lambda_n). We can do this because A is symmetric (principal axis theorem) A is positive semidef iff all lambda_i >= 0. So if A is not pos semidef, lambda_1 (say) < 0. Let v = U^{-1} [1 ]. This a particular vector that depends [0 ] on A. [...] [0 ] Then v^T A v = [10...0] U^{-T} U^T D U U^{-1} [1 ] [0 ] [...] [0 ] = [10...0] D [1 ] = lambda_1 < 0. [0 ] [...] [0 ] Consider the linear map (X is a matrix) L(X) = -v^T X v. All pos semidef matrices X have L(X) <= 0. For the "bad" A (not in C), L(A) > 0. So take H = { X : -v^T X v + v^T A v = 0 }. Note that A is in H, and if X is pos semidef, -v^T X v + v^T A v < 0. ^ ^ | | <= 0 < 0 If we are going to use ellipsoid for an SDP problem we need a good a priori bound on the optimal matrix. This allows us to replace the (unbounded) cone by a compact set. Example: Goemans-Williamson max cut approximation (see lecture 4/23/18) Recall we needed max of (1/2) sum_{i> 0, X_ii = 1. (*) Here x_ij if {i,j} is an edge in the graph W = 0 if no such edge To get the a priori bound, factor X = Y^T Y (Y is called the Gram matrix), and let yi be the ith row of Y: [---y_1---] Y = [ ... ] [---y_n---] By Cauchy-Schwartz, |Y_ij| = |(yi,yj)| <= ||yi|| ||yj|| = 1. So |X_ij| <= n. [Need to discuss the separation oracle for this problem.] Now, suppose we have the SDP min L(x) s.t. x in K, where r and R are radii for enclosing the feasible set K. (That is, if x in K, then r <= ||x|| <= R.) The following complexity bound appears in a paper by Grotschel, Lovasz, and Schrijver, Combinatorica, v. 1, 1981, 169-197. (See eqn. (37).) Let X^ be the optimum for the above SDP. To get |L(x) - L(x^)| < epsilon, poly(R/r, log(1/epsilon), n) calls to the separation oracle are sufficient. For the max-cut SDP, our goal is |W o X - W o X^| < epsilon, where W is the weight matrix. What are good values of r and R? We can take R = n^2, since sum_ij |X_ij|^2 <= sum_ij n^2 = n^4. Finding r is more problematic. [Watch this space.] Although this gives a polynomial-time algorithm for approximating a maximum cut, in practice the ellipsoid algorithm is quite slow. This is a result of: 1) A volume shrink factor close to 1 (to first order, 1 - 1/(2n)) 2) The number of iterations is not sensitive to the data, so the average case is basically the worst case 3) The need for accurate arithmetic For these reasons, modern convex optimization routines use interior point methods. (These are beyond the scope of this course.) Lecture 41 5/2/18 Wednesday Lagrangian Relaxation This is a general-purpose approximation method to get bounds on solutions to "hard" problems. We'll consider minimizing problems min f(x) s.t. (A) x in X g(x) >= 0. The main idea is to move the constraint into the objective function, in effect introducing a reward for compliance: min f(x) - u g(x) s.t. x in X (B) u >= 0 Fact. Let x^ be an minimum for (A) and x~ an minimum for (B). (We'll assume all required optima exists.) Then for any u >= 0, f(x^) >= f(x^) - u g(x^) [x^ is feasible for A, so g(x^) >= 0] >= f(x~) - u g(x~) [x^ in X, so feasible for B] This is a lower bound with a parameter in it, which is very valuable. Let's make the lower bound as tight as possible: min f(x) s.t. >= max min [ f(x) - u g(x) ] x in X u >= 0 x in X g(x) >= 0 L(x,u) -- the Lagrangian -- is the expression in brackets theta(u) will denote min_{x in X} L(x,u), sometimes called the dual function Note that we have a "weak duality" result: min { f(x) : x in X, g(x) >= 0 } >= max { theta(u) : u >= 0 } Now we see how this unfolds for a particular problem. Time Constrained Shortest Path G = (V,E) is a DAG with distinguished source s and sink t. Each edge has a cost c and transit time t Given a deadline T, we wish to select, from among s-t paths taking total time <= T, one of least cost. Imagine a package delivery service accepting an order T hours before Christmas. This is an NP-complete problem (Garey & Johnson, p. 214) For example (all edges directed left to right): (4,2) o ------ o labels are (1,1) / \ / \(1,3) <-- (cost,time) /(2,2)\ / \ s \/ t \ /\(3,1)/ (2,1) \ / \ /(1,3) \ / \ / o ------o (6,7) Using dynamic programming can show in s-t time 5 There is one path taking 5 time units: s o \ 1/ \3 total cost 2+3+1=6 1\ / \ o t If we take use 6 time units, two more paths become available: o t 1/ \2 /3 total cost 1+2+1=4 / \ / s o and 2 o ----- o 1/ \ 3 total cost 1+4+1=6 / \ s t To put this in our framework, let P = {s-t paths}, c = cost vector t = transit time vector By weak duality min cx s.t. >= max min { cx - u(T - tx) } x in P u >= 0 x in P tx <= T If we fix u, finding min of L(x,u) is a shortest path problem, in which path cost = (c + ut)x - uT. ^ ^ new cost same for vector all paths We can solve this by dynamic programming (remember G is acyclic). For each x in P, u --> L(x,u) is an affine-linear function of u. It is shown in convex analysis that the pointwise infimum of any collection of affine-linear functions (u --> au + b) is concave. (We have finitely many x in P, so the inf is actually the min.) So theta(u) is concave. We can maximize it by a variation on binary search. The idea is to maintain three samples. Leaving out configurations forbidden by concavity, there are three possible cases: * / * max can't be in left half (right end is bigger) / so remove this. next sample is between the middle * and the right points. * \ * (symmetric) \ * * / \ take 2 samples that bisect the "legs" / \ * * More efficient search strategies can also be used, such as Fibonacci search. Upshot: if 0 <= optimum u <= U, then we can get the best Lagrangian relaxation bound for this problem in O(E log U) steps.