CS 787
Spring 2018
Course Log

Lecture 1
1/24/18 Wednesday

   Handouts: Course blurb, topics list, books on reserve
             Questionnaire (return Friday 1/26/18)

   Course web site: pages.cs.wisc.edu/~cs787-1

   Stable marriage problem

      Gale and Shapley, American Math. Monthly, 1962.

      We discussed a simplified version of their "proposal" algorithm
      in which there is only one proposal (from a currently unattached
      man) per round.

      Main properties:

          1. Procedure stops (as soon as each woman has received a proposal)
          2. O(n^2) proposals, faster than trying all pairings
          3. Men's partners get worse, women's partners get better
          4. End result is stable (no man and woman prefer each other
             to their current partner).

Lecture 2
1/26/18 Friday

    Using induction to design algorithms

       U. Manber, Comm. ACM, 1998.

    Where do good algorithms come from?  One principle is mathematical 
    induction.

    Review of step-by-step induction

         Calculus example: d/dx(x^n) = n*x^(n-1).  Use the product rule.

    Use in algorithm design: To solve an instance of size n, *assume*
                             you can solve instances of size n-1.
                             

    Example: Largest permuted subset (in the paper)

             Given f:X->X (X is finite), find a largest S on which
             f acts like a permutation.

             2^|X| subsets, so exhaustive search is not a good idea.
 
             Try to reduce size of X by removing elements.

             Idea: Any y that is not hit by f cannot belong to S.
             Remove it and solve the problem recursively.

             E.g. f(x) = x^2 mod 5, on X={0,1,2,3,4}
             A largest S is {0,1}.

             Can implement in O(|X|) time using reference counts (see paper).

    Example: Balance factors in binary trees

             Height of a binary tree = max # of edges in a root-leaf path
             Balance factor = height(left subtree) - height(right subtree)

             Want to compute balance factors of all subtrees

             Obvious induction won't work, since a tree's balance factor is
             not determined by balance factors of subtrees.  For example,
             let T_k be the complete binary tree of height k (has 2^{k+1}-1
             nodes, and balance factor 0).  The trees

                     *                         *
                    /  \                      /  \
                   T_2  T_2                  T_2  T_3

             have balance factors 0 and -1.

             Fix by strengthening the induction hypothesis: To handle a tree
             w/ n nodes, assume we can find both balance factor and height
             for all trees with <n nodes.

             Resulting pseudocode:  If T has subtrees L and R,

               height(T) = max{height(L), height(R)} + 1
               balance factor(T) = height(L) - height(R).
 
             The base cases are not so obvious.  A single node has height
             0, and to make the recurrence relation work, we say that
             an empty node has height -1.  Both have balance factor 0.

Lecture 3
1/29/18 Monday

    Constant-factor induction.  See the course web page for

        A. Cauchy, in G. Polya and G. Szego, Problems and Theorems in Analysis,
                   p. 64.
        Karatsuba and Ofman, Soviet Physics Doklady, 1963.

    Arithmetic-geometric mean inequality:

        If A1,...,An >= 0, then (A1 * ... * An)^(1/n) <= (A1 + ... + An) / n.

        This is the same as: A1*...*An <= [ (A1 + ... + An) / n ]^n.

        Cauchy gives an algebraic proof for this (no calculus).

           If n=1 there is nothing to do

           If n=2, use the polarization identity:

                A1 A2 = 1/4 * [ (A1 + A2)^2 - (A1 - A2)^2 ]
                      <= [ (A1 + A2) / 2 ]^2

           Use the same idea to handle 4, 8, 16, ...

           Now use *reverse induction*: if it holds for n+1, it holds for n.
           This will fill in the gaps and take care of all n >= 1.

           Let n>=3.  Let Abar = (A1 + ... + An)/n.  Assume >0, it not, done.

           A1 * ... * An * Abar <= [ (A1 + ... + An + Abar) / (n+1) ] ^(n+1)

                                = Abar  ^ (n+1).

           Cancel Abar to get desired <= .

        Note: We descended step by step but Cauchy uses the same idea to go
        directly from a power of 2 to a number below it.

     Karatsuba's Subquadratic Multiplication Algorithms

        ~ 1960: It was believed that the "school" algorithm for
                multiplying n-bit integers, which uses O(n^2) bit
                operations, was optimal.

        Karatsuba found an algorithm with o(n^2) bops.

           Use polarization identity to reduce to squaring

           Let n=2^k, and x = x1 * 2^{n/2} + x0, where 0 <= xi < 2^{n/2}.

           x^2 = x1^2 * 2^n + 2 x0 x1 * 2^{n/2} + x0^2.
                  ^              ^                 ^
                  |              |                 |
                  A              B                 C

           Compute A, C, D = (x0 + x1)^2.  Then B = D - A - C.

           Reduces squaring an n-bit number to 3 squarings of
           n/2-bit numbers.

              Note: Suppose the sum overflows, i.e. x0+x1 = 2^{n/2} + y.
                    Then use (x0+x1)^2 = 2^n + 2^{n/2+1} y + y^2.

           If S(n) is the number of bit ops needed to square n-bit numbers,

              S(n) <= 3 * S(n/2) + const * n
              S(1) = const.

              (The const*n is for the work of adding, subtracting, shifting.)

           Since we want a big-O estimate, we can take the constants to be 1,
           and consider the corresponding equality.

           Consider the recursion tree.  The i-th level has 3^i problems of
           size n/2^i.  Summing up the work by levels, we get

              S(n) = n * [ 1 + (3/2) + (3/2)^2 + ... + (3/2)^k ]
                   <= 2 * n * (3/2)^k    (sum geometric series)
                    = 2 * 3^{log_2 n}
                    = 2 * n^{log_2 3}
                    = O( n^ 1.59... ).

           The same result holds for any n, since we can just pad by 0's
           and replace n by 2^ceiling(log_2 n} <= 2n.

        Further work has improved this bound.  The best known result,
        due to Furer, is barely above O(n log n).

Lecture 4
1/31/18 Wednesday

    Dynamic Programming

    K. Pruhs, SIGACT News v. 29, 1988, 32-35.

    Historical origins: Optimization Principles in Physics

       Principle of least time (Fermat, 1600's): Among all paths from
       A to B, a light ray will pick for which the transit time is least.

           Example: Refraction

                    Two media separated by a plane boundary
                    A is in the "fast" medium, B in the "slow" one.
                    The light ray will bend so as to make its path
                       longer in the fast medium.
                 
                    Can find the crossing point using algebra
           
                    Leads to Snell's Law for refraction angles

        Similar ideas in mechanics (principle of least action).
        Feynman's Lectures on Physics, v. 2, has a wonderful lecture
        on this.

     Bellman (1950s) exploited this for computational problems

        Exploited the "principle of optimality:" Each part of an
        optimal solution must itself be optimal.

            For refraction example, suppose the crossing pt is X
            The A-X and X-B paths must be straight.  Reduces the
            search space for finding an optimal path.

        Dynamic programs usually compute more information than the
        problem strictly demands.

            Example: To find a sortest path from s to t in a graph,
                     we must know shortest paths from s to all other
                     vertices.

        Subproblems are typically solved using a (multidimensional)
        recurrence relation.

     Even if we want to systematically solve subproblems, it is not
     obvious which ones to consider.

     Pruhs suggests attacking the problem "from above":

          Make enumeration tree of all feasible configurations
          Apply pruning rules to reduce the tree to a "small" size
          Develop a recurrence relation to generate the nodes
            level by level (from root to leaf).

          Example: Longest increasing subsequence of a1,...,an.
                   Assume ai are distinct.

          Full tree has, at kth level, all the subsequences from a1,...,ak.

             Keep only increasing sequences
             If two level k seqs have the same length, keep only the
                one with smallest final element (it can be used in any
                extension where the other one can).

          Cuts tree down to O(n^2) nodes, with <= n+1 nodes per level.

          Let BEST(k,l) = the length l increasing subsequence with smallest
                          final element, which occurs at level k.

                             | BEST(k-1,l)  /* don't use ak */
          BEST(k,l) = one of |
                             | if this increasing,
                             | BEST(k,l) o ak  /* use ak */

                 We choose the one with smallest final element

          This gives an O(n^3) algorithm.  We can reduce to O(n^2)
          by storing, for each k,l:

                 location of last element
                 which of the above two choices was used

Lecture 5
2/2/18 Friday

    Collision Algorithms

    J.M. Pollard, Math. Comp. 1978

    Idea: In many situations, agreements between pairs are more numerous
    than specific kinds of single elements.

    Example: A sock drawer with m kinds of socks
             Pick one, look for another to match it ==> O(m) search time
             Pick socks until two of them match ==> O(sqrt m)

    Subset Sum Problem

        Given a1,...,an;b (all positive integers)
        Find S [ {1,...,n} for which sum_{i in S} a_i = b.
        IP feasibility version: sum_i x_i a_i = b, x_i in {0,1}
 
        2^n choices for the x_i's
        Decision version (can we make sum <= b?) is NP-complete
        Solvable using dynamic programming in O(n b) arithmetic opns

        Collision idea: split (x1,...,x_n) into left and right halves.
        Try to solve

                sum_{i <= n/2} x_i a_i = b - sum_{i>n/2} x_i a_i

        Enumerate all left sides, all right sides.  Sort and look for
        a match.

        Size of our sorting problem, N, is <= 4*2^{n/2}

        So this is an O(N log N) = O(n 2^{n/2}) algorithm.  Much better
        when n is large.

    Discrete Logarithms (Pollard)

        Background on number theory:

          When p is prime, for some q (relatively prime to p),
          the powers 1,g,g^2,...,g^{p-2}, reduced mod p,
          exhaust 0,1,...,p-2.

          [Small example: p=7, g=3]

          g is called a primitive root or generator for p.

          If g^x == a mod p, we call x an index or discrete logarithm
          (base g) for a.

          Fermat's little theorem: when a,p are relatively prime,
          a^{p-1} == 1 mod p.  (Prove by group theory, or show a^p == a
          by induction on a.)

          Therefore, the powers of g make a periodic periodic sequence,
          with period p-1.  For this reason we only care about the index
          mod p-1.

          x --> g^x is easy (repeated squaring)

          Determining x from a is hard (no known poly time algorithm)

        Pollard's Kangaroo Method

          Choose hash fn Delta : {1,...,p-1} ---> {0,...,p-2}
          Let f(x) = x*g^Delta(x).  Think of this as an irregular
          step with a known distance.  In fact,

             log f(x) = log x + Delta(x)   (mod p-1).

          Consider two sequences:

           1) x_0 = 1, x_i = f(x_{i-1}), i=1,...,m.
           2) y_0 = a, y_i = f(y_{i-1}), i=1,...,m.

          We hope that for some i,j, we have x_i = y_j.
          Then the two sequences are coupled and will agree forever.

          Pollard calls 1) the "kangaroo trap".  We don't have
          remember the entire sequence, just the last element x_m
          (and its distance from x_0).  Think of a hunter hiding at
          x_m:

           x_0 --> x_1 --------> x_2 -> x_3 ---- ... ---------> x_m.

          A second kangaroo that lands in the trap, and has identical
          jumping behavior, will end up at x_m.  This can be detected
          by comparing each successive y_j to x_m.

          Analysis (only heuristic)

             Assume: All samples (except for starting points) are
                     i.i.d. uniform on Zp* = {1,2,...,p-1}.

             Suppose x_1,...,x_m are distinct.  Then

                 Pr[y_j hits trap] = m/n    (n = p-1).

                 Pr[y_1,...,y_m avoid trap] = (1 - m/n)^m

                 With m = sqrt(n) this is (1 - 1/sqrt(n))^sqrt(n) ~ e^{-1},
                      about 0.367.

                 Also (for this m) expected number of duplicate pairs in
                 the list x_1,...,x_m is (m choose 2)*(1/n) ~ 1/2. 
                 So with high probability the trap has about sqrt n elements.

              This gives (heurstically, anyway) a way to compute discrete
              logs using about sqrt(p) time and space for O(1) residues
              mod p. 

              Note that exhaustive search would have been O(p).

Lecture 6
2/5/18 Monday

   Linear programming

   Reference: Lectures 3-4 from R.M. Karp's course, Great 
   Algorithms, Winter 2006.

   LP = Optimization of a linear function subject to linear
        constraints.

   1939: Kantorovich, industrial production models
   ...
   1947: Dantzig, USAF logistics problems
                  Simplex algorithm

   Early example: optimal diets (Stigler 1945)

          Data = list of foods with cost/unit and amounts/unit
                   of various nutrients
          Want to meet daily requirements at lowest cost 
          With ad hoc analysis, found a "cheap" diet (about $2/day
           in 2017 dollars) using wheat flour, evaporated milk, cabbage, 
           spinach, navy beans.  (Optimal?)

   Relevance to our course:

        1. Can reduce combinatorial problems to LP (e.g. network flow)
        2. Structural insight (polyhedral combinatorics)
        3. Can use for approximation algorithms
        4. Makes papers by Karp, Edmonds,... more accessible

   Formal development

        R^n = n-dim'l real vectors
        inner product, length, angle
        vector inequalities: <=, <

        hyperplane: soln set to ax = b
        half-space: soln set to ax <= b

        polyhedron: intersection of finitely many half-spaces
                    need not be bounded

        example: x <= 1, y <= 2, in R^2.

        polytope: bounded polyhedron

        Linear programming: Optimizing linear functions on polyhedra

             e.g. min of cx s.t. Ax >= b.

                c = row vector
                x = column vector
                A = m x n matrix
                b = column vector

        May or may not have a solution (e.g. for above polytope,
        consider max of x+y, min of x+y).  Even if solution exists,
        it need not be unique (consider max of x).

   Vertices

        We'd expect to find solutions at "corners" of the polytope.

        Defn: x in P (polytope) is a *vertex* if there is no line segment
        a--b within P with x strictly between a and b.

            Note the "forbidden configuation."

        Theorems (not proved here)

        1. x is a vertex of P iff for some c, x is the unique solution
           to max cx s.t. x in P.

        2. Let x in P, where P [ R^n is the intersection of m half-spaces. 
           Then x is a vertex iff exactly n of the defining constraints 
           hold with equality.

        Degeneracy: If Ax <= b is a system of linear inequalities (x in R^n),
        a vertex of the polyhedron they define is called degenerate if
        >n of the constraints hold with equality.

            Note that this depends on how we represent the polytope

            If v is degenerate, must there be a redundant inequality?
            No -- consider the apex of a pyramid.

        Degeneracy is important because it can cause the simplex algorithm
        to misbehave.

Lecture 7
1/7/18 Wednesday

    The Simplex Algorithm

    Karp, op. cit.

    LP: min cx s.t. x in P (polytope)
        P [ R^n given explicitly by linear inequalities

    Possibly P is empty (infeasible) or unbounded in the direction of
    decreasing cx (unbounded).  Under "normal" conditions, P is
    the convex hull of finitely many extreme points (vertices):

            [draw a picture for n=2]

    Simplex algorithm: descent from vertex to vertex

            Start with some vertex v
            While there is an adjacent vertex w that improves objective,
               v <- w [move there]
            Return v

    The trick is to do this with algebra

    Equality form for LPs:    min cx s.t. Ax = b >= 0, x >= 0.
                              A is an m x n matrix of rank m.

    To put into this form:

          a) replace unconstrained vble by difference of non-negative vbles
          b) replace any ax <= b by ax + s = b, where s>=0 is a slack vble
          c) delete redundant rows from A
          d) flip signs to make b >= 0

    Descent (Pivot) Step:

        The algebra in Karp's notes is pretty clear on this so we'll 
        just give the highlights.

        Eqns are  [ B N ] [x_B] = b
                          [x_N]

        Assume B is mxm, invertible.  Setting x_N = 0 and solving gives
        a vertex.  Assume for simplicity that x_B > 0.

        Try to raise some x_j in x_N.  Then

               new x_B = B^{-1}(b - A_j x_j)   [A_j = A's j-th column]

        The idea is that x_j will be a local coordinate for the line
        segment linking the old vertex to the new one.

            new obj value = [old obj value] + (c_j - C_B B^{-1} A_j) x_j

                                                     ^
                                                     |
                                            if this is >= 0 raising x_j
                                            won't help.

                                            if this is >= 0 for every j,
                                            stop (vertex is optimal).

            pick a j making the expn in ()'s negative.

        Let d = B^{-1} A_j.  Then

            new x_B = [old x_B] - d x_j ,  needs to be >= 0.

            if d <= 0, we can increase x_j forever, so the problem
            is unbounded (stop).

            Suppose some d_i > 0.  If the old x_B > 0,
            we can (strictly!) increase x_j until some x_i in X_B
            becomes 0.

        For simplicity, assume this occurs for a unique i.

        Then we have removed a column from B (the i-th one) and
        replaced it by one from N (the j-th one).

        This gives a new vertex.

        Exercise: Prove that the invertibility of B is an invariant
                  of the algorithm.  (This is easiest to prove in
                  the nondegenerate case where x_j increases, but it 
                  actually holds in all cases.)

   Degeneracy

     The vertex [x_B] is *degenerate* if some entry of x_B is 0
                [ 0 ]

       That is, the number of 0 entries is "too big."

       This is a special definition, just for the simplex algorithm.

     In this case, the new x_i could come in at 0, meaning that
     the line segment we wanted to traverse to improve the objective
     has collapsed to a point.

     This can cause cycling: remove x_i, then x_j, then x_i, ...

     How do we avoid this?  Here are three strategies:

        1. Perturb the problem.  If the perturbation is small enough,
           you will get the same B at the end.
        2. Randomly pick x_j from the available contenders.
        3. Use a combinatorial anti-cycling rule, such as the following
           (due to R. Bland): always pick the smallest index.

   The Number of Descent Steps

     This is usually small, but examples are known with an exponential
        number of pivots (Klee & Minty, 1972).
         
     Therefore, simplex is not worst-case polynomial time.

Lecture 8
2/9/18 Friday

    Review of simplex:  Using notation of the last lecture,
    we pick a variable x_j in x_N (the "inactive" vbles, set to 0)
    and imagine raising x_j.  We have

       new  =  old + (c_j - c_B B^{-1} A_j) x_j 
       obj     obj
                       x_j is *eligible*
                       if the number in 
                       ()'s is > 0.

    If there are no eligible variables, we can stop as the current
    vertex is optimal.

    Otherwise, an eligible x_j is raised as much as possible (might
    be not at all), with the effect of 

           new B = old B with i-th column removed and j-th column added.

    Invertibility of B is an invariant of the algorithm (exercise).

    Starting the simplex method

        Make up a new LP with an "obvious" vertex

                min sum_i y_i s.t.
                Ax + I y = b (>= 0)
                x,y >= 0.

        Run simplex starting wih the vertex x=0, y=b.

        If min sum_i y_i > 0, original LP is infeasible.

        If min sum = 0, then all y_i = 0.  Consider the columns
        that are now in B.

          i) If they are all columns of the original A, 
             throw away the y's, and run simplex (with the original
             cost function) starting with these columns.

          ii) If some columns of B appear in I, they can be "swapped"
              with columns of A to achieve i).  (Exercise for you.)  
              Continue as in i).

    Duality

       Every minimizing LP has an associated maximizing one, called its
       dual.  (Vice versa for maximizing LP's)

       Think of original LP (primal) and its dual as two halves of the
       same problem.

       P:   min cx s.t                     D: max yb s.t.
            Ax = b, x >= 0.                   yA <= c  (y unrestricted)

       Weak duality theorem:  If x,y are feasible for P,D then

             yb = y Ax = yA x <= cx

                              ^
                              |

                          (since x>=0)

       Corollary: If P is unbounded, D is infeasible

                  If D is unbounded, P is infeasible

       Strong duality theorem: If P has a finite optimum, then
       D has an optimum, and the two objective functions have the
       same value.

       Pf: Run simplex on P, to get

               c_B  c_N

             [ B    N  ] [ x_B ]               = [ b ]
                         [ x_N ] <- 0 here       [   ]

           min cx = c_B x_B = c_B B^{-1} b

           Termination ==> for all x_j in x_N,

                      c_j - c_B B^{-1} A_j >= 0,    all x_j in x_N.

           Let y = c_B B^{-1}.  Then:

           1) y is dual-feasible:

                  yA = y c_B B^{-1} A,  

              and

                  B^{-1} A = B^{-1} [ B  |   A_j's for j>m]
                           =        [ I  |   B^{-1} A_j's for j>m]

              so
                  c_B B^{-1} A = [ c_B  |   c_B B^{-1} A_j's for j>m]
                                            
                                            <= c_j  (j is not eligible)

              That is, yA = c_B B^{-1} A <= c.

            2) y is dual-optimal:  By weak duality

               yb = c_B B^{-1} b = c_B x_B = cx,

               (so yb is as large as any dual objective can be)

       We can think of simplex as solving the dual along with the
       primal (!).  Some implementations do this explicitly, by
       maintaining a representation of B^{-1}.

    Interpretation of dual problem

       Forming the dual is a mechanical procedure, which rearranges
       the data.  It is more interesting and useful to ask ourselves
       what the dual *means*.  This is application dependent.

       Example: optimal diet problem.  Suppose the nutrition requirements
                are met exactly, so we have

                  min cx s.t. Ax = b (>=0),   x >= 0

                Rows of A represent nutrients (how much is in each food),
                columns represent foods (how much of each nutrient it
                contains).

                Primal expresses the consumer's problem: meet nutritional
                goals at least cost.

                Now imagine a supplier of artifial nutrients enters the
                market.  (For simplicity, assume he has enough on hand
                to sell.)  

                  He might charge y_i for 1 unit of the ith nutrient

                  He wants to maximize his revenue yb

                  yA <= c means he is competitive food-by-food.

                  Suppose this is not the case.  Then, for some j,

                        y [j-th column of A] > c_j.

                  and the consumer could satisfy a portion of his demands
                  more cheaply by buying the j-th real food.

Lecture 9
2/12/18 Monday
      
    Set Cover Problem
    V. Chvatal, Math. of OR, 1979

    Setup: Given S1,...,Sn subsets of X = {1,...,m}
           c_j = cost of Sj (positive)
           want to choose a set J of indices so that
             the S_j, j in J, cover X at minimal cost

    Example: Car buying
             elts of X are features we'd like: video in back, CD player, ...
             sets are packages that include various features
             we can't buy features individually, only as part of a package

    Decision version of Set Cover is NP-complete (VERTEX COVER is
    a special case).  So we don't expect to solve exactly, and look
    for approximations.  That means (in this case)

         the sets we pick cover X
         the cost of these sets is not too far from the minimum cost

    Example (m=4, n=5):

                 columns are         The 0-1 array is called the
                    sets             incidence matrix

                   0   1   0   1
                   0   1   1   1
                   1   1   0   0    rows are elements
                   1   0   1   1
                   0   0   0   1

           costs   3   1   4   5

           cost
           ----    3/2 1/3 2 5/4
           element

     One thing we could do is just pick sets in decreasing cost order,
     skipping over any that don't give us any new elements.  If we do 
     that here, we incur cost 1 + 3 + 5 = 9.   Note that S3 is skipped.

     Here is a more sophisticated strategy (GREEDY, should be called STINGY).
     At each step, choose a set that enlarges the collection, at minimal
     cost per "new" element.

        On the example above, GREEDY tells us to pick S2 (which covers
        1,2,3).  Then the costs (per element remaining) are revised to be

           cost
       ----------  3   --- 4 5/2
       new element

          Because elements are effectively being removed from the sets,
          the costs per element go up, or at least do not decrease.  

        Now the next best choice is S4, which adds 2 elements at cost 5.

        So GREEDY's cover is S2, S4.   Cost is 1+5 = 6.

        This is an optimal cover: choosing S4 is the only way to cover
        5, and any additional set costs >= 1.

      We don't expect GREEDY to always find the optimum (since Set Cover
      is NP-hard), but we can ask how well it does compared to the
      best cover.

      To answer such questions (for minimizing problems), we need lower
      bounds on the cost of the optimum.  LP duality gives us such
      lower bounds.

    Integer program (just like LP, but variables must be integral) for
    set cover:

                  min sum_j c_j x_j s.t.

                  Ax >= 1 [all 1's column vector]    (A = 0-1 matrix)
                  x_i >= 0, integral

                  (Note: no harm in allowing x_i > 1 here.)

    Relax to a linear program:

                  same, but x_i >= 0 (can be fractional).

          Intuitively, the LP is allowing the car buyer to purchase
          packages at a fractional level: say half of the sports package,
          half of the entertainment package, ...

    The dual of this is:

                max y1 = sum_i y_i s.t.

                yA <= c, y >= 0.

        What does it mean?  We can think of Set Cover as a kind of
        optimal diet problem where the demand for each nutrient is
        the same (you need it, but any amount will do), and foods cannot
        be split (the grocer will not sell you half a cucumber).
        Using this analogy, y_i is the price for the i-th element.

        On the example, GREEDY paid

             (1/3 1/3 1/3 5/2 5/2).

        Is this dual-feasible?  No.  If we multiply this by the
        last column of the incidence matrix, we get 2/3 + 10/2 > 5. 
        However, it can be "shrunk" by a logarithmic factor to become
        dual-feasible.

   Theorem: If y denotes the vector of prices paid by GREEDY, then 
            y* = y/H_m is dual-feasible.  (H_m = 1 + 1/2 + ... + 1/m.)

   Proof: Consider the set S = {e1,e2,...,ek} with
          the elements numbered in the order they were covered by
          GREEDY.  (Resolve ties arbitrarily.)

          Right before ei was covered, S had >= t - (i-1) uncovered 
          elements.  (At most i-1 elements of S could have been covered
          before ei.)

          So GREEDY paid at most cost(S) / (t-i+1) for ei.  (It might have
          covered ei as part of another set, but that would have been
          cheaper, as marginal costs only go up.)

          Therefore the total cost paid for elements of S is 

                <= sum_{i=1}^t cost(S) / (t - i + 1)
                <= H_m * cost(S).                     [Since t <= m.]

          Suppose the optimum cover is S1,...,Sk.  Then GREEDY's cost
          for all elements is

            <= sum_{j=1}^k GREEDY's total price for Sj
            <= sum_{j=1}^k H_m * cost(S)
             = H_m * [cost of optimal cover].

    To get our approximation factor, we apply weak duality:

                   (dual)                    (primal)

    . . . ..........x................|==========XX=====X==X=== = = =
      
                    ^                           ^
                    |                           |
                                                 
             sum of yi*                  cost of best
          = GREEDY cost / H_m   <=      integral soln
                                        (also a fractional
                                         soln)

         so [GREEDY cost] <= H_m * [cost of min cover]

Lecture 10
2/14/18 Wednesday

   Is there a better analysis for GREEDY? 
   Integrality Gap
      
   Review of Set Cover problem (notation from last lecture)

   GREEDY's approximation factor [cost of GREEDY soln]/[cost of best cover]
   is <= H_m.  This cannot be improved:

           Let X={1,...,m}.  Let
           S_0 = X, with cost 1+epsilon and
           S_i = {i}, cost 1/i.

           GREEDY will pick S_m,...,S_2,S_1 in that order.  To see why,
           consider the first step.  This picks S_m and leaves the
           same problem but with m reduced to m-1.

           So cost of best cover = 1+epsilon
           Cost of greedy cover = H_m, so
          
                                    cost for GREEDY    H_m
             Approximation ratio = ---------------- = ---------
                                    optimum cost      1+epsilon

                                 ~ (1 - epsilon) ln m.

   Now let's consider any Set Cover strategy, like GREEDY, that
   is based on comparing a dual solution to the best integral
   cover.  The best we could hope for is to use the optimal
   dual value, which (by strong duality) is the cost of the best
   fractional cover.  The performance of such a strategy will be
   limited by the difference

        [best integral cover cost] - [best fractional cover cost].

   This is called the *integrality gap.*  (Note: some authors use
   the ratio instead of the difference.)

           [Draw the picture.]

   A Set Cover example with integrality gap Omega(log m)

       Background: F2 = {0,1}, with + and * taken mod 2.

          [Give tables for +, *]

       F2 satisfies all the laws of high school algebra.
       Systems like this are called *fields.*

          [Give examples, such as distributive law, and existence
           of multiplicative inverses for nonzero elements.]

       Let V = F2 x F2 x ... x F2  (n times).

       Elements of V are sequences of 0's and 1's, and they can
       be considered "vectors" for the system of "scalars" F2.

           Note: you can do this for any field F.  Calculus textbooks
           usually treat n=2,3 and F = R (the reals).  

       Inner product: vw = sum_i v_i w_i (mod 2).

       Now we make a set cover problem.

          Let X = V - 0   (all bit vectors that are not all 0's)
          |X| = 2^n - 1 := m
          For each v in X, let

             S_v = {w : vw = 1}     [note: w != 0 so S_v [ X]

          Each S_v contains 2^{n-1} elements of X

             Proof: Wolog v_1 = 1.  We can pick w_2,...,w_n
                    arbitrarily, in 2^{n-1} ways, and then solve
                    v_1 w_1 + sum_{i >= 2} v_i w_i = 1, to get
                    w_1 = 1 - sum_{i >= 2} v_i w_i = 1.

          By the symmetry of the inner product, each w in X is contained
          in 2^{n-1} S_v's.

          We will assign each S_v a cost of 1.

      Example: n=2, so m=2^n-1=3.

                           sets

                                _
                      *1   1*  **

                   01  1    0    1

             elts  10  0    1    1   <--- 3x3 incidence matrix A,

                   11  1    1    0        in this case symmetric.

      In general, the number of 1's in each row and column of A
      is 2^{n-1}.  [Can we always make A symmetric?]

      Let c be an all-1's row vector, and b an all-1's column vector.
      The LP relaxation for our set cover problem, and its dual, are

          P:    min cx s.t.          D:   max yb s.t.
                   Ax <= b                yA >= c
                    x >= 0                y >= 0.

      Let x = (1/2^{n-1},...,1/2^{n-1}) and y = x^T (transpose).
      Then x is primal feasible, y is dual feasible, and they
      have the common value

                m/2^{n-1} = 2 - 1/2^{n-1}.

      Weak duality tells us that x is the best fractional cover.

      Claim: We cannot cover X with fewer than n of the S_v's.

        Proof: Choose k<n such sets, and call them S_1,...,S_k.
               Writing ' for the complement of a set relative to X,
               we have

                   (S_1 u S_2 u ... u S_k)' 
                 = S_1' n ... n S_k'
                 = [intersection of k (n-1)-diml vector spaces of V]  - 0 
                 = subspace of dimension >= n-k (<= 1) - 0
                 = nonempty set .

               So S_1 u ... u S_k != X.

       Therefore, the integrality gap for this example is

            n - 2 + 1/2^{n-1} ~ log_2 m.

       Note that log_2 != ln.  Examples are known with integrality
       gap ~ ln m.    

       As a final note, it has been proved that no polynomial-time
       Set Cover algorithm has an o(log m) approximation factor,
       unless P=NP.  See ???, Proc. STOC 2013.

Lecture 11
2/16/18 Friday

  New Section: Paths, Flows, and Matchings

  Shortest Paths

     Let G = (V,E) be a directed graph.  Each edge has a *cost* (>= 0),
     which we will think of as a distance.  For given s,t in V we want
     a least-cost path from s to t.  (For simplicity assume that s
     can reach t.)

       Example: Vertices s,a,b,t
                Edges s->a, s->b, a->b, a->t, b->t.
                       e1    e2    e3    e4    e5

     We can represent G by a *signed* incidence matrix A:


                  e1  e2  e3  e4  e5            sign convention:
              s   -1  -1   0   0   0
              a    1   0  -1  -1   0               - ---> +
              b    0   1   1   0  -1
              t    0   0   0   1   1

     Lots of information about the graph can be read off this matrix.
     For example, the out-degree of a vertex is the number of -1's
     in its row.

     Think of a path from s to t as carrying 1 unit of some
     commodity from s to t.  We let

                  1 if ei is in the path
            xi = 
                  0 if not

     Then, Ax = b, where 

                [-1 ]
                [ 0 ]
            b = [...]
                [ 0 ]
                [+1 ]

     The first and last equations say that s is a source and t is a sink.

     The other equations enforce Kirchoff's current law: if v != s,t
     the flow into v (number of path edges ending at t) equals
     the flow out of v (number of path edges beginning at t).

     Note that these paths need not be simple.

     Let cj be the cost for ej.
     As an IP, the shortest s-t path problem is:

           min cx s.t. Ax = b, xj in {0,1}.

     Since cj >= 0, we can simplify this to:

           min cx s.t. Ax = b, xj in Z_{>=0}.

           (Using extra "flow" will not reduce the cost.)

     LP relaxation: min cx s.t. Ax = b, x >= 0.  This allows the
     use of fractional paths.  As we'll see shortly, these don't
     improve the optimum.

     It can be shown (exercise for you) that A (signed incidence matrix
     for a digraph) is *totally unimodular*.  This means that any
     r x r submatrix has determinant 0 or +- 1.  For example, taking
     rows s,a,t and columns e1, e3, e5 in the above gives us

                e1 e2 e3
           s  | -1  0  0  |
           a  |  1 -1  0  |  = +1.
           t  |  0  0  1  |

     Thm: Every vertex of the polytope Ax = b, x >= 0 has integral
     coordinates.
                          
         Proof: Use the simplex algorithm.  We can pick a cost function
         for which this vertex is optimal, and then we get (after running
         the algorithm), a vertex given by

            [B  N][ x_B ] = b,
                  [ x_N ]

         with x_N = 0 and B nonsingular.  Solve B x_B = b by Cramer's rule.
         The denominators will be det(B), which equals +-1.  So the
         (unique) solution to B x_B = b is integral.

     Every simple s-t path gives a vertex.

         Why? Let c_j = 1 if e_j is on the path, and M (very large)
         if x_j isn't on the path.  Then, the vector x, given by

                      1 if e_j is on the path
               x_j = 
                      0 if it isn't

         is the unique minimizer of cx.  So it is a vertex.

         [Can we describe *all* of the vertices?]

     Dual of the LP:  max yb s.t. yA <= c  (y unrestricted)

     For our example we get

         [y_s y_a y_b y_t] [ -1 -1  0  0  0 ] <= [c1 c2 c3 c4 c5]
                           [  1  0 -1 -1  0 ]
                           [  0  1  1  0 -1 ]
                           [  0  0  0  1  1 ]

     The dual constraints are associated with edges.  Considering a 
     typical column, we have

           e
        v --> w     ===>    -y_v + y_w <= c_e, i.e.
                             y_w <= y_v + c_e          (*)

     This suggests we think of y_v as the shortest distance to v, since 
     (*) says

             distance to w <= distance to v + cost(v,w).

     There is a symmetry: if y is dual-feasible, so is the vector y'
     obtained by adding a constant to each coordinate of y.  Put
     another way, dual feasibility depends only on differences between
     dual coordinates.  So let's normalize by taking y_s = 0.

     Minty's "analog" solution to the dual: Represent each edge by
     a piece of string of the appropriate length.  Connect these strings
     at the vertices, and then pull s and t away from each other until
     some path becomes tight.  

     Dijkstra's shortest-path algorithm:

        The strategy is to set dual variables, one by one, starting with s.

        We grow a tree T whose vertices are those whose distance from s
        is known.

        Initially: T = {s}, y_s = 0
                   For (s,v) in E, y_v = c(s,v) 
                   All other v get y_v = +infinity.

        We stop when T includes t.

        The algorithm maintains a list F of vertices that are one edge
        away from T, e.g.

                T:      s
                       / \
                      /   p --> v       Each v in F has a tentative
                     /     \            distance y_v, plus its parent
                    ---------           p by which this tentative distance 
                                        was last assigned

        Tree growing step:

             Choose a v in F that minimizes y_v
             Add p-->v to T
             For each e = (v,w):
                 If y_v + c_e < y_w /* violation of dual constraint */ then

                    y_w := y_v + c_e       /* "relaxation" step */
                    set w's parent to be v
                    put w in F (if not already there)

         Observe that the dual variables decrease as the algorithm proceeds.
         You can think of each dual value as giving the best distance found
         so far.

         It can be shown, using induction on |T|, that for every v in T,
         y_v is the minimum distance from s to v.  (This relies on the
         costs being non-negative.) 

Lecture 12
2/19/18 Monday

     Flows in Networks
    
     Reference: R.E. Tarjan, Data Structures and Network Algorithms,
                Chapter 8.

  Setup: Directed graph G = (V,E) with distinguished source s, sink t.
         We'll assume s != t.

         Think of each edge as a link through which some commodity can
         flow.  We give each edge a *capacity* (>= 0).

         Goal: Send as much as possible (per unit time) from s to t,
               subject to conservation laws and capacity constraints.

         Example:             4 vertices, all capacities 1
                   a
                 ^ | \
                /  |  v
               s   |   t  
                \  |  ^
                 v v /        
                   b

          We can send 1 unit along s->a->t and 1 unit along s->b->t,
          for a total of 2.  This is best possible since the capacity
          out of s equals 2.

  Formally:
  
      c(x,y) is the capacity of the edge (x,y), or 0 if no such
      edge exists.  (Note: Our graph model doesn't allow self loops
      or multiple edges.)

      A *flow* is a function f : V x V --> R  such that:

         i)    f(x,y) = - f(y,x)     [skew-symmetry]

         ii)   sum_x f(v,x) = 0, all v != s,t  [Kirchoff current law]

      Note: Skew-symmetry makes some formulas look nicer.  Many
      authors say that a flow is a function from edges to R.

      Item ii) is a mass conservation law.  To see this, split
      the sum into the contributions going into v and the contributions
      going out of v.  These must cancel, which is saying that the
      commodity is neither created nor destroyed at v.

      Sometimes an additional requirement is included:

          iii)  f(x,y) <= c(x,y)      [flow bounded by capacity]

      Defn: The value of a flow is the net flow out of the source:

           value(f) = sum_{x in V} f(s,x).

      Thm: The value also equals the net flow into the sink.

           Physically evident.  Also follows from the next theorem.

      Defn: If G is a flow network, a *cut* is a partition of V
      into S (containing s) and T (containing t).  ("Partition" means
      that V = S n T and the intersection of S and T is empty.)

           E.g. The network above has 4 cuts:

                       S       T
                       s       a,b,t
                       s,a     b,t
                       s,b     a,t
                       s,a,b   t

           In general, if there are n vertices, there are 2^{n-2} cuts.

      The *capacity* of S,T is

              sum_{x in S, y in T} c(x,y).

      Note that we only count capacity in the forward direction.

      Thm: (Flow can be measured across any cut)

           value(f) = sum{x in S, y in T} f(x,y)

           This is also physically evident.  (Draw a boundary around
           the vertices of S.)  For a calculation to prove this, see
           Tarjan p. 97.

      Note: Unlike the definition of capacity, an edge carrying flow
            in the reverse direction (from T to S) will be included,
            but with a minus sign (because of skew symmetry).

          E.g. consider


            S        .          T
                     .
               -------------->
                     .
               -------------->
                     .
               -------------->
                     .
               <--------------
                     .
               <--------------
                     .
                     .

            If each edge has 1 unit flowing toward the arrow,
            the net flow from S to T is 1+1+1-1-1 = 1.

      From the LP perspective, cuts are dual to flows (and vice versa)

      From now on we consider only flows that satisfy all of i)--iii).
      In particular, flow <= capacity along any edge.

      Weak Duality: If f is a flow, C is a cut,

          value(f) <= capacity(C).

      Proof: Imagine

            S        .          T       sign of edge flow from S to T
                     .
               -------------->           + 
               -------------->           +
               -------------->           +  
               <--------------           -
               <--------------           -
                     .
                     .

             The value of the flow includes the negative terms, and
             if these are omitted, the sum can only get larger.
             For each positive term, the edge flow that it represents
             cannot exceed the capacity of the edge.  Summing all these
             edge capacities gives the capacity of the cut.

      Strong Duality: (Max-flow / min-cut theorem) 
      In any flow network,

          max flow value = min cut capacity.

          We will prove this by means of an algorithm that finds a
          maximum flow and a minimum cut, simultaneously.

      You should observe that this shows that finding the maximum
      flow value is a finite problem, since there is a finite
      number of cuts.

Lecture 13
2/21/18 Wednesday
   
     Construction of max flows via path augmentation
     Flow deconstruction
 
     Tarjan op. cit.

     Review of flow networks (see last lecture).

     The "obvious" way to compute a maximum flow (repeatedly find
     s-t paths of positive capacity, and send flow through them)
     doesn't work.   Consider the unit-capacity network from last time:

                   a
                 ^ | \
                /  |  v
               s   |   t  
                \  |  ^
                 v v /        
                   b

      If we send 1 unit of flow along s->a->b->t, all s-t paths contain
      at least one saturated edge, so the algorithm is stuck.

      Ford and Fulkerson (1950s) realized that the algorithm could
      continue if we adjust capacities to allow for path flows to cancel.

      Defn: If f is a flow, the *reserve* capacity for (v,w) is

                 r(v,w) c(v,w) - f(v,w).

      This is a function on vertex pairs.  To see it is the "right"
      definition, consider one edge carrying flow; we'll assume there
      isn't an edge in the opposite direction.

                0 <= f <= c
               v ---------> w

          r(v,w) = c(v,w) - f(v,w)   (as we'd expect)
          r(w,v) = c(w,v) - f(w-v)
                 =    0   -( -f(v,w) )
                 = f(v,w).

      That is, we could change the flow along the edge either
      by making more of it go in the same direction, or decreasing
      it.

      If f is a flow, its residual graph R has:

           same vertices, source, sink
           edges v-->w whenever r(v,w) > 0.

      For our example flow above, the residual graph is

                   a
                 / ^ \
                v  |  v
               s   |   t  
                \  |  ^
                 v | /        
                   b

      This graph has the s-t path s->b->a->t.   If we add 1 unit
      along this path, we get a maximum flow.

      Max Flow / Min Cut Theorem:  The following are equivalent:

             1) f is a maximum flow
             2) there is no augmenting path in R
             3) value(f) = capacity of some cut S,T

      Proof: We will show each implies the next one, in cyclic order.

      1) ==> 2):  Show  ~2) ==> ~1) instead.  If R has an augmenting path,
                  flow can be increased, so f wasn't maximal.

      2) ==> 3):  Search in R from s.  This will not reach t.   Let
                  S = reachable vertices, T = V-S.  R cannot have an
                  edge from S to T, or the search would have continued.
                  So all S-T edges are used to maximum capacity, and
                  value(f) = cut capacity.  (Why is there no "backwards"
                  flow?)

      3) ==> 1): Any bigger flow would violate weak duality.

      This suggests a method to compute a maximum flow: Start with
      f = 0.  Repeatedly look for an augmenting path in R, and when
      one is found, increase flow as much as possible along that path.

         This is due to Ford and Fulkerson.

         Note similarity to the simplex algorithm: We find something
         to increase, and make it as large as we can.

      Theorem: When all capacities are integral, there is an integral
               maximum flow.  (This means all f(v,w)'s are integers.)

         Proof: Integrality of all f(v,w) is an invariant of the method.
                Each augmentation increases the flow value by >= 1, 
                so the FF method terminates with a maximum flow.

      Is FF polynomial time?  No.  Take the above "diamond" graph
      and put capacity 1 on a->b, and capacity M on all other edges.
      Alternating between the two "zigzag" s-t paths requires 
      2M = 2^{log_2 M + 1} augmentations.

      Even worse, there are examples (with irrational capacities)
      where FF fails to terminate.

         By the way, this shows that FF is not a disguised version
         of the simplex algorithm.
     
Lecture 14
2/23/18 Friday

   A strongly polynomial-time maximum flow algorithm

   Edmonds and Karp, J. ACM, 1972

   Restriction for today: The network for the max-flow problem
   does not have any antiparallel edge pairs:

             ------>
            v       w
             <------

   This can be enforced by adding extra nodes if we wish.

   Review of Ford-Fulkerson procedure (see last lecture).
   This is not deterministic (arbitrary choice of augmenting path).
   To get a good running time bound, we can be smarter about which
   path to use.

   Theorem (Dinits, Edmonds & Karp).  If we always choose a path with
   the fewest number of edges, FF runs in time poly(|V|,|E|).  (This
   is a *strongly polynomial* bound: it does not involve the capacities.)
          
   Shortest paths.

      These can be found in O(|V|+|E|) steps using breadth-first search.
      This assigns each vertex a level, e.g.

       -------------------------
       |                       |
       |                       |
       |     ---------> a ---> t
       |    /           |   /
       |-> s            |  /   level 2
            \           v /
             ---------> b 

       has s at level 0, a,b at level 1, c at level 2.

       The presence of an edge implies a relation between levels:

               v ---> w  ===>  level(w) <= level(v) + 1

       (this is just one of the dual constraints for a shortest path
        problem, in which all the edge costs are 1).

    Now we let G=G0, G1, G2, ... be the successive residual graphs
    resulting from shortest-path augmentation.  (Part of the problem
    is to show this sequence is finite!)  Let level_i(v) be the level
    of i in G_i, or +oo if v can't be reached from s in G_i.

    Lemma 1 (Propn 3 in E&K):  For all vertices v, 

               level_{i+1}(v) >= level_i(v)

    Thus, as the algorithm proceeds, vertices get further away from 
    the source.

    Proof: Do this for i=1,2,... in turn.  We use induction on the
    distance from s to v.

        Basis: v=s, whose level never changes (it's always 0).

        Induction step: Consider a shortest path to v:

             s ----> ... ----> u ----> v 

              level_{i+1}(v) = level_{i+1}(u) +1   (shortest path)
         (1)                 >= level_i(u) + 1     (induction)

           Now, we consider the edge u ----> v.

             Case 1: u--->v is in G_i.  Then

                  level_i(v) <= level_i(u) + 1    (dual constraint)

             Case 2: u--->v is not in G_i.   Then v--->u was on the 
                     ith augmenting path, and became saturated.  This
                     path was

                        s ----> ... ----> v ----> u 

                     So level_i(v) = level_i(u)-1 <= level_i(u) + 1.

           In both cases, then

         (2)  level_i(v) <= level_i(u) + 1

         Combine (1) and (2) to get the result.

    Lemma 2 (Lemma 2 in E&K):  Suppose, for the edge u-->v, we have

        u-->v . . . . . . . . . . . . .   u-->v      j>i.
        in G_i  [not in G_{i+1},...,G_j]  in G_{j+1}
            
    Then level_j(u) >= level_i(u) + 2.

    [Proofs of Lemmas 1 and 2 followed a lecture by Jeff Erickson.]

    Proof:    u-->v is on the ith augmenting path (it disappeared)
              v-->u is on the j-th augmenting path (it disappeared)

       So level_j(u) = level_j(v)+1 >= level_i(v) + 1 = level_i(u) + 2.

    Proof of the theorem: Suppose that u-->v is saturated (and thus
    removed) k >= 1 times.  Each of these events, except the last
    is followed by a reappearance of u--> v.  Since level increases by
    2 with each reapparance, 2(k-1) <= |V|-1, or

                k <= (|V|+1)/2.
    
    There are 2|E| possible edges (for each edge in G, its reversal
    might be needed, too).  At each augmentation, mark one of the edges
    that disappears (we don't care which one).  If there are more
    than |E|(|V|+1) augmentations, some edge gets marked (and therefore
    was saturated) more than (|V|+1)/2 times, by the pigeonhole principle.

    To summarize: If shortest-path augmentation is used, the number
    of augmentations is O(|E||V|), and the total work is O(|E|^2 |V|).

Lecture 15
2/26/18 Monday

   Combinatorial applications of network flow

         Bipartite matching
         Baseball elimination (K. Wayne, SIAM J. Disc. Math., 2001)

   Matchings

      G = (V,E) directed graph
      A *matching* is a set M [ E with no two edges sharing an endpoint.
      That is, "bigamy" such as

                   y
                   |
                   |
             x --- z

      is forbidden.  
      Is is *perfect* if |M| = |V|/2.  (No bigger matching can exist.)

      Perfect matchings need not exist, even if |V| is even.

          Example: a tree with only a root and leaves.

      G is *bipartite* if V can be partitioned into (disjoint) X,Y
      with all edges linking X to Y.

          Why are bipartite graphs important?  Many real-world situations
          have an asymmetry, e.g. workers being matched to jobs.  Many
          graph theory problems become easier for bipartite graphs.

      We can test if G is bipartite (and find suitable X,Y) using
      breadth-first search.  This takes time O(|V|+|E|).

          There is a "forbidden subgraph" theorem here, too: G is bipartite
          iff it has no odd cycles.  (Cycle means simple cycle: no repeated
          vertices.)

      Using network flow, we can find a maximum matching efficiently
      in any bipartite graph.

          Proof by example:   Start with

                     x1------y1
                       \ 
                        \
                         \
                          \
                           \
                     x1----- y2

          Direct all edges from X to Y, and add new source and sink:

                     x1----->y1
                    ^  \       \
                   /    \       v
                  s      \       t
                   \      \     ^
                    v      v   /
                     x1----->y2

          All capacities are 1.  Using, say, the algorithm from last
          lecture, we find a max flow that is integral:

                         1
                     x1----->y1
                    ^  \       \  1
                1  /    \       v
                  s      \ 0     t
                 1 \      \     ^
                    v      v   /  1
                     x1----->y2
                         1

           The X-Y edges carrying flow form the matching.  Note
           that bigamy is prevented by the capacity constraints
           on the "new" edges.

           This matching is maximum size: If there were a larger one,
           we could have found a flow of greater value.

           Suppose |S|=|Y|=n.  Then since value(f) <= n, at most
           n flow augmentations are needed.  Each is O(n^2), so the
           total work is O(n^3).

      I believe that versions of this technique go back to the 1930s.
      For a purely graph-theoretic presentation, see Section 5.4 of
      Bondy and Murty, Graph Theory with Applications.

   An application

      Many theorems of graph theory give sufficient conditions for
      a perfect matching.  Here is one.

      Theorem: Let G be bipartite with |X|=|Y|=n.  Suppose, for some
      d >= 1, each vertex has d neighbors.  Then G has a perfect matching.

         Proof: In the resulting flow network, there is a fractional
         flow of value n: just put flow 1/d on each X-Y edge.  Then,
         there is an integral flow of the same value, which gives
         the perfect matching.

   Baseball Elimination

      Imagine a baseball league in which teams play only within
      the league.  No ties are allowed (as in US professional baseball).
      The pennant is won by the team that wins the most games, or
      shared if there is more than one such team.

      Problem: It is late in the season.  Determine, from the record
      so far, and the remaining schedule, if a given team can win the
      pennant.

      Here is an example (n=4):

              wins wi    games left    sum       remaining schedule
               wi            gi      wi + gi

                                                 ATL PHI NYM MON
       ATL     83             8           91       -   1   6   1
       PHI     79             4           83       1   -   0   3
       NYM     78             7           85       6   0   -   1
       MON     76             5           81       1   3   1   -

       (The losses accrued have been suppressed.)

       Any team that cannot win as many games as the league leader
       (defined by number of wins) is clearly eliminated.  For example,
       Montreal cannot win.  But this criterion is not sufficient
       (consider Philadelpha).

       A network flow model.

          Suppose there are g_ij remaining games scheduled between i
          and j.  We can think of these as "wins" to be distributed
          between i and j.  This suggests a flow network, one piece
          of which is:

                                 (i)
                                ^   \
                               /     \
                           oo /       \
                             /         v
             s -------> {i,j}          t 
                 g_ij        \        ^
                           oo \      /
                               \    /
                                v  /
                                 (j)

          (The values of g_ij can be read off the schedule.) 
          We include such a subgraph for every pair of teams.)

          Consider team k.  If it can finish first at all, it can
          do so by winning every one of its remaining games.  So let

                 w = wk + gk,

          the maximum number it could end up having won.  Team k is
          still a contender if there is a future in which no other
          team wins too many games.  So we amend the network by
          adding team-to-sink capacities:


                                 (i)
                                ^   \
                               /     \  w - wi
                           oo /       \
                             /         v
             s -------> {i,j}          t
                 g_ij        \        ^
                           oo \      /  w - wj
                               \    /
                                v  /
                                 (j)

           If the flow out of i is within capacity, the number of future
           wins by i (call this xi) must satisfy

                  xi <= w - wi,

           that is,

                  wi + xi [# future wins by i] <= w .

           Theorem: Team k is viable iff the above network has
           an integral flow of value equal to the total number
           of remaining games.

           Note: It is possible to use a smaller network, from which
           team k has been eliminated.  In that case, we want the
           flow value to be the number of remaining games not involving k.

        Wayne's paper has another interesting theorem.  Suppose
        that, for two teams i and j,

            wi + gi <= wj + gj.

        Then, if j is eliminated, so is i.  This means we can determine
        all eliminated teams with O(log n) max-flow computations.
        Even faster algorithms are possible (see the paper).

Lecture 16
2/28/18 Wednesday

   Matchings and Coverings
   Ko"nig's Theorem (Strong duality in bipartite case)

   Review concepts of matching and vertex cover.  Remember that we start
   with an undirected graph G = (V,E).

   Thm (weak duality): If M is a matching and C is a vertex cover, then
                       |M| <= |C|.

        Proof: For every edge in the matching, at least one of its
               edges must be in the cover.

   This inequality can be strict: A triangle has |M| <= 1, |C| >= 2.

   Thm (constant factor approximation for vertex cover): 
       |smallest vertex cover| <= 2 * |largest matching|

       Pf: Let M* be a martest matching:

              ... o ===== o ...
              ... o ===== o ...
                     ...
              ... o ===== o ...

           Every edge is either in M* or adjacent to a vertex in it.
           (If not we could extend the matching)
           So the 2|M*| matched vertices cover all the edges.
           The minimum cover is no larger than this.

   Note: For this proof, M* need only be maximal in the sense of set
   theory (no proper superset is a matching).   This gives a simple
   "greedy" 2-approximation for the minimum vertex cover.

   2 is the best possible constant.  It is attained by a graph consisting
   of n disjoint triangles.  For a connected example, consider K_n, which
   has
          |max matching| = floor(n/2),    |min vertex cover| = n-1 .
 

   Thm (Ko"nig, 1931): If G is bipartite, 

       |largest matching| = |min cover|.

   Pf: Assume G has no perfect matching (if it does, we can just use X).
   Use a unit-cost flow network to compute a largest matching.
   Consider R, the residual graph for the resulting maximum flow.

            X:           Y:
        
            W_X          W_Y
     s                            t       (edges not shown)
            D_X          D_Y

    W_X ("wet") nodes in X are reachable in R, W_Y ("dry") nodes are not.
    Same for W_Y and D_Y.

    Claim: C = D_X u W_Y is a vertex cover.

           (How to remember this: nodes in X tend to be wet, nodes in Y
            tend to be dry.  Nodes in C are on the "wrong" side.) 

       Proof of claim: An uncovered edge of G, x --- y, would have
       x in W_X, y in D_y.  So in R,
  
                x ---> y is not possible   (y would be wet)
            Is  x <--- y possible?  No.

                 This would mean x---y is in the matching, so 
                 s --> x is saturated (in the original network)
                 Since x is wet, it got wet through some y' in W_Y.
                 So the residual graph has x  <--- y', and x is
                 bigamous:

			 x --- y'    are both matching edges
			 x --- y

     Now we check the size of C:

            X:           Y:
                        * 
            W_X         *W_Y
     s                            t       (edges not shown)
            D_X*         D_Y
               *
               *

     Each x in D_X is matched (if it weren't it would be wet, via s).
     Its partner is in D_Y (if not, x is wet).

     Each y in W_Y is matched (the search didn't reach t) and its
     partner is in W_X (where it got wet from).

     So |M| >= |D_X| + |W_Y| = |C|.  By weak duality, |M| = |C|.

     Note: this pf gives a particular vertex cover from a given
     matching.

Lecture 17
3/2/18 Friday

   Hall's Marriage Theorem
   Weighted Bipartite Matching (Assignment Problem)

   A recurring theme in graph theory is to find sufficient conditions
   for a graph to have a "nice" subgraph.  Here is one by Philip Hall,
   from 1935.

   Thm: Let G be bipartite, with |X|=|Y|.  G has a perfect matching iff
   for all A [ X, |N(A)| >= |A|.  (N(A) denotes the set of neighbors
   for the vertices in A.)

      Proof: (=>) is clear, since each a in A is matched to a neighbor.

      (<=): Suppose G has no perfect matching.  Build a flow network
            with the usual edges, but now with infinite capacity on
            each x --> y edge.  The "outer" capacities are still 1,
            to prevent bigamy.

            Find a maximum flow, and from it, an s-t cut S,T of capacity
            < n.  The picture is:

                        X:             Y:
			[ ]            [ ]   r edges     Let A = X n S
			[ ]            [ ]   from here       B = Y n S
		s       [ ]            [ ]   to t
			[ ]            [ ]
	    ---|-|---------------------------|-|-|-
	       v v      [ ]      ^     [ ]   v v v
			[ ]      |     [ ]
	      l edges   [ ] no X to Y  [ ]  
	      from s    [ ] edges      [ ]
	      to here   [ ] cross the  [ ]
			[ ] cut here   [ ]           t
 
            Observe r+l < n, so r < n-l.

            N(A) [ B, so |N(A)| <= |B| = r < n-l = |A|.

     The constructive proof of Hall's theorem identifies an "obstruction"
     to the existence of a perfect matching.

     A more general version holds (exercise for you): If |X| <= |Y|,
     there is a matching for X iff the same condition holds on subsets of X.

     Assignment Problem

       This is one of the all-time classics of combinatorial optimization.
       We have n workers, n jobs.  a_ij is the cost of assigning worker
       i to job j.  We (management?) want to minimize

             sum_sigma a_{i, sigma(i)}

       over the n! permutations of 1..n.   Without loss of generality,
       the a_ij are non-negative.

       IP version:

            min sum_ij a_ij x_ij s.t.
            sum_j x_ij = 1, all i
            sum_i x_ij = 1, all j
            x_ij in {0,1}.

       This looks really hard, but (as we will see soon), all vertices
       of the "assignment polytope" (same, but with x_ij >= 0) are
       integral.  This is the Birkhoff-von Neumann theorem.

          Alternate statement of B-vN: Any doubly stochastic matrix
          (entries >= 0, row and column sums are 1) is a convex
          combination of permutation matrices (0-1 matrices with
          exactly one 1 in each row and each column).

          This suggests there is an efficient algorithm.

   Assignment Problems with Integer Data

     Suppose a_ij >= 0, integral.  

     Key observation: Adding a (possibly negative) constant to the
     entries in any row or column doesn't change the identity of
     the optimal matching.  

     The algorithm will work with an nxn matrix, and try to decrease
     the entries (a few may go up).  All entries stay non-negative.
     At the end, there will be a matching of 0's, which gives the
     optimum solution.

     First let's translate our bipartite graph ideas into matrix language.

         Edge = position i,j with a 0
         Matching = set of non-attacking rook positions
         Vertex cover = set of lines (rows or columns)

     Consider the matrix

          3 4 7
          1 2 5
          3 1 1

     Decreasing row and column entries, we get

          3 4 7       3 4 7       0 2 5
          1 2 5  -->  0 1 4  -->  0 2 5
          2 0 0       2 0 0       2 0 0

     Think of the 0's as indicating edges in a bipartite graph, and
     find a minimal vertex cover (indicated by *'s);

                  o -- o*  <- column 1
                     /
                    /
                  o    o
                     /
                    /
       row 3 ->  *o -- o

      On the matrix, this is

          |
          0 2 5
          0 2 5
         -2-0-0-
          |


     Let delta = 1 be the smallest uncovered entry.  To create a new
     0 in the uncovered part of the matrix, we can:

	   subtract delta/2 from each uncovered row
           add delta/2 to each covered row
           do same for the columns.

     This is called a *pivot*.  For our matrix delta=2 so we get

          0 0 3
          0 0 3
          4 0 0 

     which now can have 3 non-attacking rooks at (1,1), (2,2), and (3,3).
     Going back to the original numbers, the optimum assignment cost
     is 3+2+1 = 6.

   Thm. (Kuhn?) If the a_ij are integral, the algorithm terminates.

     Proof: Show that each pivot reduces the sum of the entries.
            Suppose there are n1 horizontal, n2 vertical lines.
            Let n1' = n - n1, n2' = n - n2.  n1 n2 entries go up
            by delta, n1' n2' entries go down by delta.   Net change
            (in units of delta) is:

               n1 n2 - n1' n2' = - n [n - (n1+n2)] (algebra)
                               < 0                 (since n1+n2 < n)

            So the total number of pivots is bounded by the initial 
            sum, which is <= n^2 max{a_ij}.

    This is a *pseudopolynomial* time bound -- something that looks 
    like polynomial time, but really isn't, because it is not polynomial
    in bit length.

Lecture 18
3/5/18 Monday

   Strongly Polynomial Time Bound for Assignment Problem
   (Due to J. Munkres, 1957.)

   Review of Assignment Problem (last lecture)

   Kuhn's "Hungarian" Algorithm:

        while there's no size n rook placement (matching) on the 0's

            cover all 0's with the smallest number of lines (rows, columns)
            delta := minimum uncovered entry (> 0)
            subtract delta from uncovered entries
            add delta to covered entries

        Final rook placement gives the best assignment

   Example:  

       1  2  
       3 10 

         |  delta = 1         
         v

     --0--1-- 
       2  9 

         |  delta = 2    
         v

       |
       0  1   
       0  8  
       |

         |  delta = 1
         v

       0  0* 
       0* 8 

   Review of: Using max flow to find a largest matching
              Ko"nig's theorem
              K-cover: dry X-vertices and wet Y-vertices in residual gph

   Theorem: If the Hungarian algorithm uses the K-cover, the number
            of iterations is poly(n).

   First, translate into matrix terms:

               X-vertices ==  Rows of the matrix
               Y-vertices ==  Columns of the array

          G (bipartite) is the graph formed from the 0 locations

          On the matrix, the K-cover consists of dry rows and wet columns:

                (wet)
                | | | 
               ---------------  (dry)
               ---------------
                | | | 
                | | | 
                | | | 

           (Remember that "dry" and "wet" refers to the last search,
            in the residual graph.)

    Lemma: After each iteration, either:

               max matching is larger
           or 
               set of wet nodes becomes larger


    Let's see this in action, for our example.  Vertices put between
    braces are wet.

                      0-graph G     and    its flow ntwk

       1  2           o   o              s -->o  o--> t
       3 10           o   o                -->o  o-->

                           first flow is 0,  its residual gph is

                                        s -->[o] o--> t
                                          -->[o] o-->

                           so the cover is empty
                           pivot using delta=1 to get

       0  1           o---o              s -->o-->o--> t
       2  9           o   o                -->o   o-->

                           flow = 1 along top.   residual gph is

                                        s <-- o <--o <-- t
                                          -->[o]   o -->

                           cover is the top row; the 2 9 row is uncovered.
                           pivot using delta = 2 to get

       0  1           o---o              s -->o-->o--> t
                         /                      ^
       0  7             /                      / 
                      o   o                -->o   o-->

                           flow unchanged.   residual gph is

                                        s <--[o]<-[o]<-- t
                                                  ^
                                                 /
                                          -->[o]   o -->

                           cover is the first column.
                           pivot using delta = 1 to get

        0  0*      Has perfect matching (indicated by *'s)
        0* 7

   Proof of claim:

      Singly covered 0's are unaffected by pivoting.
      
      Consider the doubly covered 0 entries:

                (wet) 
                | | | 
               -0-------------  (dry)
               ---0-----------
                | | | 
                | | |   
                | | | 

      None can be a matching edge, since it would mean R has

                 dry      wet
                  i <----- j

      Therefore, they are all like this in R: 

                 dry      wet
                  i -----> j

      In particular, the search in R did not go through these edges.
      Pivoting will remove these but since they were inessential
      to the search, no wet node becomes dry.

      Finally, consider a new 0 entry created by pivoting:

                (wet)   j 
                | | | 
               ---------------  (dry)
               ---------------
                | | | 
         i      | | |   0   
                | | | 

                              wet     dry 
      Its edge in R must be    i ----> j

      so one more vertex becomes wet.  If the connection from j to t
      is intact in R, flow (and hence matching size) is increased.
     
   Thm.  If K-covers are used, the number of pivot steps is O(n^2).

       Proof: The matching can be enlarged <= n times, and in between,
       the number of wet nodes increases <= 2n times (since s is wet
       to start, and 2n+2 wet nodes indicates an augmenting path).
       So <= 2n^2 pivots.

   Each pivot requires O(n^2) work, and the cost to search R is
   also O(n^2), the total run time is strongly polynomial: O(n^4).

   The algorithm can be implemented to use O(n^3) steps.

   Unbalanced Matching Problems

        Let G = K_{2,3} with these weights:

                       C  D  E
                    A  3  1  4
                    B  1  5  9

        We want the best assignment for A,B.  To solve this,
        introduce a new vertex X with uniform costs in its row:

                       C  D  E
                    A  3  1  4
                    B  1  5  9
                    X 10 10 10

        Solve the resulting assignment problem, and use the
        results found for A,B.  This works because X can match
        any of A,B,C, and all choices have the same cost.

Lecture 19
3/7/18 Wednesday

   Primal-Dual Method

      This is a "master plan" for the design of combinatorial algorithms.
      It has been very influential, particularly in the area of approximation.

      The idea:  Express problem as LP
                 Reduce so a sequence of "simpler" LPs
                 The simpler LPs are often combinatorial and can be
                     solved efficiently.

      Consider

         P:  min cx s.t.            D:   max yb s.t.
             Ax = b (>=0)                yA <= c
              x >= 0                     (y unrestricted)

      Complementary slackness (we won't prove this):

         Feasible x,y are optimal iff 

                    y(Ax-b) = 0
                    (yA-c)x = 0

         (each is orthogonal to the other's residual).

       The primal-dual method works by finding better and better dual
       feasible points:

             y -> y' -> y'' -> ...

       until the best y is reached, and we can determine corresponding x.
       Failure to find the x allows us to "improve" y.

   The Basic Cycle

      Start with dual-feasible y (e.g. y=0 if c>=0).
      Plug into second comp slax equality, which is easy to solve:

           yA-c = (0...0 **...* )
                         <0 here

         0 = (yA-c)(x') <-- vector of r unknowns, anything >= 0 works
                   (0 ) <-- forced

      Which x' will also satisfy the first comp slax condition?

      Let A' = first r columns of A.  Our task reduces to solving

         Ax' = b,  x' >= 0.

      As in the first phase of the simplex algorithm, we do this by
      solving another LP:

         RP:   min sum_i w_i s.t.
               (A' I) (x') = b
                      (w )
               x', w >= 0
              
        w is a vector of r slack variables.  Think of sum_i w_i as
        measuring how far x' is from being feasible.

        RP stands for "restricted primal" (it isn't really, we have
        added some new variables).

        In many cases, RP is unweighted even if the original problem had
        weights.

      If the min = 0, we're done as (x') is primal feasible.
                                    (0 )

      Otherwise, consider the dual of the new LP, which is

         DRP:   max zb  s.t.
                z(A' I) <= (0...0 1...1)
                (z unrestricted)

         DRP is "dual of restricted primal".  The constraints can also
         be written as ZA' <= 0, z_i <= 1, all i.
              
      Let z solve this.  By strong duality, zb = sum_i w_i > 0.
      We can use z to improve our original dual.  Compare

             zA = z(A' A'') = (*...* ?...? )
                               <= 0  signs
                                     unknown
      to
             yA - c         = (0...0 *...* )
                                      < 0

      Since the last part of yA-c consists of strictly negative entries,
      there will be some lambda > 0 for which

             ( lambda z + y ) A - c  <= 0.

      Choose the largest possible lambda for which this is true, replace
      y by
              y' = lambda z + y,

      and continue.  y' is a better dual vector, since y'b > yb.

   Comment: This is a lot to swallow.  Every problem now generates
   *four* linear programs: P, D, RP, DRP.   It is useful to see
   this in action.

   Transportation Problem

      Consider a small (2x2) network, with transportation costs
      as indicated:

                0
        (1) A ------ C (2)
              \    /
             2 \  /
                \/
                /\
             3 /  \
              /    \
        (3) B ------- D (2)
                0

       The parenthesized numbers indicate supplies (for the producers
       A,B) and demands (for the consumers C,D).  Note that total supply
       equals total demand.   We can also give the data for the problem
       in a table:

                  2  2 (demands)

               1  0  2
               3  3  0
      (supplies)   ^
                   |
                transportation
                costs

       We want a shipping schedule that meets the supply and demand
       constraints, and has minimal cost.

       The assignment problem is the special case of n producers,
       n consumers, and supply and demands all equal to 1.

       Write this as an equality-form LP:

                 c      x             A      x       b

         min [0 2 3 0][x_11] s.t. [1 1 0 0][x_11] = [1]
                      [x_12]      [0 0 1 1][x_12]   [3]
                      [x_21]      [1 0 1 0][x_21]   [2]
                      [x_22]      [0 1 0 1][x_22]   [2]

         x_ij denotes the amount shipped from producer i to consumer j.

         Since c>=0, y=0 is dual feasible, and

             yA - c = -c = [0 -2 -3 0] 

         So we try to solve (yA-c)x = 0, which is

             0 = [0 -2 -3 0] [x_11] <-- free
                             [x_12] =0
                             [x_21] =0
                             [x_22] <-- free

         The restricted primal (you should write this out) is a 
         max-flow problem:

                      oo
                  A -------> C
          w1     ^  f = x_11  \ 2   w3
              1 /              \ 
               /                v
              s                  t
               \                ^
              3 \              / 2
          w2     v  f = x_22  /     w4
                  B -------> D
                      oo

            Capacities are indicated, and the wi's are slack variables
            for unused capacity.

         Solving this by inspection, we get

              x_11 = x_22 = 1, w1 = 0, w2 = 1, w3 = 1, w4 = 0.
              sum w_i = 2.

         The restricted primal's dual is:

              max z1 + 3 z2 + 2 z3 + 2 z4  s.t.

                  z1 + z3 <= 0
                  z2 + z4 <= 0
                  all zi <= 1

         with solution z = (-1 1 1 -1).   (It's optimal because
         the objective value is -1 + 3 + 2 - 2 = 2.) 

             DRP can be viewed as the selection of a cut.  Consider the
             residual graph for our flow:

				    <-------
				  A -------> C
		      z1 = -1    ^            \^    z3 = +1
				/              \\ 
			       /                v\
			      s                  t
			      ^\                ^
			       \\              /  
		      z2 = +1   \v            /     z4 = -1
				  B -------> D
				    <-------
			       
              This gives the cut S = {s,B,D} and T = {A,C,t}.
              Note that +1 values go with edges inside the cut sets,
              and -1 values go with edges crossing the cut.

              In general, if we use this rule, we have (considering only
              edges from s or to t)

                cz = sum_{e inside cut} cap(e) - sum_{e across cut} cap(e)  
                   = sum_{all e} cap(e)      - 2 * sum_{e across cut} cap(e)  

              so 
                  max cz = total edge capacity - 2 (min cut capacity)
                         = total edge capacity - 2 (max flow value)
                         = sum of unused capacities

         We then get

              zA = [0 -2  2  0]
          yA - c = [0 -2 -3  0]
   
         From the third entry, 2 lambda - 3 <= 0, so max lambda = 3/2.3

         Next y is y' = (3/2)z + 0 = [-3/2 3/2 3/2 -3/2].

         For the next iteration (you should fill in details)

               y'A - c = (0 -5 0 0)

         so
             0 = [0 -5 0 0] [x_11] <-- free
                            [x_12] =0
                            [x_21] <-- free
                            [x_22] <-- free

         Next RP is the max-flow problem

                      oo
                    f = x_11 
                  A ------> C
          w1     ^         ^ \ 2   w3
              1 /         /   \ 
               /  f =    /     v
              s   x_21  /       t
               \       /       ^
              3 \     /       / 2
          w2     v   /       /     w4
                  B ------> D
                    f = x_11 
                      oo

          with solution x_11 = 1, x_21 = 1, x_22 = 2, and
          all wi = 0.

          So we're done, and the optimal shipping schedule is

              1 from A to C
              1 from B to C
              2 from B to D

          (cost 1*0 + 1*3 + 2*0 = 3.)

Lecture 20
3/9/18 Friday

   Primal-Dual Approximation Algorithms
   D. Williamson, Math. Programming B, v. 91, 2002, pp. 447-478

   Primal-Dual Method (for exact optimization):

         Repeatedly find better and better dual solutions.
         If a dual solution is not optimal (we can't solve the
         complementary slackness equalities), we solve another linear
         program to find a "direction" along which the dual can be
         improved.

         Earliest example: Kuhn's Hungarian algorithm for the assignment
         problem.  

         Many combinatorial algorithms fit into this framework.

    Primal-Dual ideas can also be used for approximation.

    Hitting Set Problem

         E = set of n elements
         c_e (>=0) is the cost of element e
         Sets T_1,...,T_m [ E
         Find min-cost S [ E such that S n T_i is nonempty, all i.

         IP version

           A = 0-1 incidence matrix for the sets (each row represents
               a set).

           c = cost vector, b = all 1's "target" vector

           x = vector of non-negative integers.  Indicates a set S
               (e is in if x_e > 0, out if x_e = 0).  In practice
               x is a 0-1 vector, as there is no need to use
               more than one copy of any element.

			   min cx s.t. 
			   Ax >= b
			   x >= 0 (integral)

           This is a *covering* problem.  x wants to be small, but
           it has to be big enough to touch all the given sets.

           Inequalities are associated with sets: sum_{e in T_i} x_e >= 1.

        Mathematically identical to the set cover problem (use columns
        of the matrix A to represent sets).

        Hitting Set can naturally express many other problems, for example:

          1) Vertex Cover.  E = vertices of an undirected graph 
                            sets are the edges {v,w}
                            cost of each vertex is 1

             This shows that Hitting Set is NP-hard, as Vertex Cover is.

          2) Shortest s-t Path.  E = edges in a graph (undirected)
                                 sets = crossing edges for s-t cuts.
                                 If c>0, the min-cost set must be an
                                 s-t path, which hits all specified
                                 sets.  Note that there are exponentially
                                 many of these.

           (Williamson mentions many others.)

   Primal-Dual Approximations

      Relax the IP to LP (allow any x >= 0).  The dual of the LP is

             max sum_i y_i s.t.
             yA <= c
             y >= 0

             y wants to be large.  Think of "packing" y values into 
             the "container" given by yA <= c.

             We have a dual variable y_i for each set.

             Dual constraints are associated with elements:

                     (y1...yn) A_e <= c_e
                                ^
                                |
                           column listing which
                           sets element e is in

   The Strategy
                    _
       Want to find x bar (primal feasible and integral) and
       y (dual feasible) for which 
                  _
                c x <= alpha yb      (with alpha "small").

       If we can to this we will have (for the least-cost hitting set x^)
      
                                _
                yb <= c x^ <= c x <= alpha * yb

           So 
                  _
                c x <= alpha * yb <= alpha * cx^   (by weak duality).

           We say the approximation factor is <= alpha.

   Hochbaum's Algorithm (1980 -- published 1982)

       Find dual optimal y*
       S := {e : y* A_e = c_e}   (elements whose dual constraint is tight).

       Claim: S is primal feasible.  Follows from complementary slackness:

                y(Ax - b) = 0    (1)
                (yA - c)x = 0    (2)

       (These are necessary and sufficient conditions for optimality.)

       Let x* denote a primal optimal solution (might be fractional).  
       By (2)
                          tight    < 0
          (y*A - c)x* = (0 ... 0 * * ... *)( x_1* ) = 0.
                                           ( x_2* )
                                             ...
                                           ( x_2* )
                                           (  0   )
                                             ...
                                           (  0   )

        Since x* is feasible, Ax* >= 1.   Note x*_e > 0 iff e is in S.
        We can assume no component of x* exceeds 1.         _
        Round each component upward to get incidence vector x for S.

        Cost estimate:  Let's assume |T_i| <= alpha for each i.

                  sum_{e in S} c_e 
                = sum_{e in S} y* A_e
                = sum_{elts e in S} sum_{sets T_i containing e} y_i*
                = sum_{sets T_i} sum_{elts e in T_i} y_i*
                = sum_i |T_i n S| } y_i*.
                le alpha sum_i y_i*.

        For vertex cover, alpha <= 2, so we get a factor 2 approximation
        for this NP-hard problem.

   Bar-Yehuda and Even's Improvement (1981)

        Don't solve the dual LP.  Just raise individual y_i as needed.

             y = 0
             S = empty
             while S doesn't hit every T_i
                 choose k with S n T_k empty (we "missed" T_k)
                 increase y_k until the inequality for some e in T_k 
                                    becomes tight
                 S := S u {e}

        This is very fast.  Still works, since the proofs for feasibility
        (S hits every T_i) and the cost bound only used tightness of
        of the dual constraints.

        Note that we increase y by lambda z = lambda (0 ... 0 1 0 ... 0).
        This is similar to P-D optimization, but we did not try to find
        the "best" z.

Lecture 21
3/12/18 Monday

   New Section: Greedy Algorithms and Matroids

   Greed Before Edmonds
   Alan Hoffman, On Greedy Algorithms that Succeed, in Ian Anderson, ed.,
                 Surveys in Combinatorics, 1985, 97-112.

   Is it always best to make "locally advantageous" choices?

      No.  Printers are an example.  Initial outlay (cost of device)
      is insignificant compared to ongoing expense (cost of toner).

      Greedy strategies are easy to implement and often used as
      preliminary solutins.

      If we are going to use them, we need theory to guide us.

   A greedy strategy for linear programming

      Let c >= 0.  We want to solve

              max cx  s.t.  
              Ax <= b, x >= 0,    <---- call this P

      Can we order the variables x1, x2, ... and then raise each one
      to the maximum possible, without regret?

             x1^ = largest x1 s.t. for some x2,...,xn, (x1^,x2,...,xn) in P
             x2^ = largest x2 s.t. for some x3,...,xn, (x1^,x2^,...,xn) in P
             etc.
   
      [Draw picture in R^2.  Note that for this to work, the LP is pretty
       special.]

    Earth Moving Problem (Monge, 1781)

       Given two regions A,B [ R^2.  Dirt to be dug out from A, placed on B.
       (Assume constant depth.)  We want precise directions telling us
       where each load of dirt goes.

       Monge observed that an optimal directive has no crossings:

                I --------- J
                      X
                R -------------- S

       Indeed if R sent 1 load to J and I sent 1 load to S, it would have
       been cheaper (by the triangle inequality applied to triangles 
       IXJ and RXS) to use I-J and R-S.

    Monge's Inequalities

       Consider a maximizing transportation problem:

                          b1 b2 ... bn   demands
            supplies a1
                     ...        c_ij is profit gained from 1 shipment
                     am              from i to j.

       We'll assume total supply = total demand.

            LP:  max sum c_ij x_ij s.t.
                 x_ij row sum = ai,
                 x_ij column sum = bj
                 x_ij >= 0.

       As we know, choosing "most profitable" edge first need not work:

                 e.g. assignment problem

                 100 99  w/ demands = supplies = 1
                  99  1

       Defn: The array c_ij satisfies Monge's inequalities if

             i <= r, j <= s    ==>    c_ij + c_rs >= c_is + c_rj

             (Note that LHS - RHS is similar to a determinant.)

       Note that this depends on how rows and columns are ordered.
       For example, the 2x2 array above violates the one Monge inequality,
       but swapping the columns makes it hold.

    A greedy transportation algorithm

       Assume supplies and demands are integral.  If Monge holds, the
       following algorithm finds an optimal schedule:

            x_11 := mu = min(a1, b1)
            a1 := a1 - mu, b1 := b1 - mu (one of these will be 0)
            if a1 = 0, cross off row 1
            if b1 = 0, cross off column 1
            process remaining array recursively

       The base case for this is one row or one column.  Then, the 
       allocation is forced.  So we are doing induction on m+n.

       Proof.  At least one optimal schedule is integral.  (This follows 
               from the algorithm presented Wed 3/7/18, if you solve 
               max-flow "correctly").

               Any such solution with y_11 < mu can be made "more greedy".
               Indeed, there must be r,s for which the solution is

                    y_11 ... >0        (by supply/demand constraints)
                     ...    ...
                     >0  ... y_rs

               Let delta = min{ y_r1, y_s1 } (> 0).  We can do a pivot,
               and add

                    delta ...  -delta
                     ...        ...
                   -delta ...   delta

               without violating constraints.  The change in objective is

                    delta(c_11 + c_rs - c_r1 - c_rs) >= 0,

               so the new solution is still optimal.

               If the y_ij are integral (we can assume this, see
               end of lecture Wednesday 3/7/18, and make sure the
               maximum flows are all integral), the new solution will be
               too.  Therefore, with a finite number of such modifications,
               we reach the greedy value for x_11.

               We can now use induction since the new problem is strictly
               smaller.

        A Generalization

             Let (i,j) >> (r,s) mean that we consider x_ij before we
             consider x_rs.  Then, if

                (i,j) >> (i,s)   (row)
                      >> (r,j)   (column)
                      >> (r,s)   (diagonal)

             the greedy algorithm still finds an optimal solution.
           
    Another Application: Coloring Interval Graphs

        I_1, I_2, ..., I_n are closed intervals in R^1, sorted by
        left endpoint.  E.g.

             I_1                       I_3
             |-------------------|     |------------|
                   I_2
                   |------------------------|
                                          I_4
                                          |--------------|

        We want to "color" each interval so that overlapping intervals
        get different colors.  (This is basically a scheduling problem.
        Overlapping intervals can't be scheduled on the same machine.)

        We can reduce this coloring problem to an easy bipartite matching
        problem.  

        Consider any coloring.  The intervals given the same color
        must look like this:

             |------| |-------------|    |-------|    ...

        So we are just matching the end of each interval to
        the beginning of the next.  If there are m such pairings,

             # of colors = n - m.

        Conversely, if we have such a matching M, we can use it to
        make a coloring with n - |M| colors.  So

           min # colors + max matching size = n.

        To treat this formally, let B be a bipartite graph on 
        {1,...,n} x {1,...,n} with

             (i,j) in E   exactly when    i<j and I_i n I_j = empty

        (equivalently, it is consistent to give I_i and I_j the same
        color).   For the example (draw it) the edges are (1,2) and (1,4).

        We want a maximum matching in B.

        The adjacency matrix for B has monotonic rows (0's precede 1's).

             For our example,

                    0  0  1  1
                    0  0  0  0
                    0  0  0  0
                    0  0  0  0

        We can view this as the cost matrix for a (maximizing) assignment
        problem.

        Suppose we process the rows in order.  Within each row, consider
        the 1's left to right and then the 0's.  It can be shown that
        this is a Monge ordering (case analysis).

              For the example above, one "greedy" assignment is

                    0  0  1* 1
                    0* 0  0  0
                    0  0* 0  0
                    0  0  0  0*

              The locations with x_ij = 1 give the matching.  In this
              case we match I_1 to I_3 (clearly maximal).

              This gives the coloring:

                        I_1                       I_3
                (red)   |-------------------|     |------------|
                              I_2
                (white)       |------------------------|
                                                     I_4
                (blue)                               |--------------|

        In practice, it is convenient to sweep through time, and
        give each "new" interval the smallest color that is not
        currently used.  This results in an O(n log n) algorithm.

Lecture 22
3/14/18 Wednesday

  Introduction to Matroids

         J. Edmonds, Math. Prog. 1971 (work from 1967)
         Lawler, Chapter 7 (especially 7.3, 7.5)

  Kruskal's Algorithm

    The best known "greedy" algorithm was published by Kruskal in 1956.
    Suppose G is a connected undirected graph, with non-negative edge weights.
    Think of w(e) as the "profit" gained by using the edge e.

    To find a max-wt spanning tree: 

              Consider edges in weight order, largest first.
              Add an edge (to the current forest) unless it is redundant, in
                the sense of making a cycle with edges already chosen.

         Example:    
                         1
                    A *------* B
                      | \    |
                    4 |  \3  | 2
                      |   \  |
                      |    \ |
                    D *------* C
                          5

          We pick edges 5,4, reject 3, pick 2, reject 1.
          Total weight = 5+4+2 = 11.

    There can be exponentially many spanning trees (e.g. K_n has n^{n-2}
    of them).  So some "magic" is guiding us to the best tree.

    How to detect cycles?  

           One way is to use linear algebra.
           Let M = vertex-edge incidence matrix for G.  For our example

                     e1 e2 e3 e4 e5 
                A  [ 1  0  1  1  0 ] 
                B  [ 1  1  0  0  0 ] 
                C  [ 0  1  1  0  1 ]
                D  [ 0  0  0  1  1 ]

           Fact. A set S of edges is acyclic iff the corresponding set of
                 columns has rank |S| over the integers mof 2.

                  E.g.

                     e2 e5 e4 
                A  [ 0  0  0 ]
                B  [ 1  0  0 ] <--- this 
                C  [ 1  1  0 ] <--- 3x3 piece
                D  [ 0  1  1 ] <--- has odd determinant

                  vs.

                     e1 e2 e3
                A  [ 1  0  1 ]   here e1+e2 = e3, so rank is <= 2. 
                B  [ 1  1  0 ] 
                C  [ 0  1  1 ]
                D  [ 0  0  0 ]

           So we can state our problem more generally?

             Let M = 0-1 matrix with positive weights on columns.
             Among the columns sets that are linearly independent mod 2,
               choose one of maximum total weight.

    Common to both problems is the idea that the eligible subsets
    avoid redundancies.  In 1935 the topologist (and graph theorist)
    Whitney noted that this could be captured axiomatically.

    Defn.  Let E = finite set.  II = collection of subsets of E
           Both are nonempty.  (II is supposed to be "script I")
 
           (E,II) is a matroid if:

              (i) I [ J, J in II ==> I in II   (inheritance)

              (ii)  I,J in II with |I|+1 = |J|  ("exchange" axiom)
                    ==> there is an e in J-I
                        with I+e in II.

              [draw pictures]

           Subsets in II are called *independent*.

    We've seen two examples:

             E = edges in a graph
             II = acyclic

             E = columns of a matrix over some field
             II = linearly independent

    We can get geometric insight by considering an mxn matrix over R.
    Columns are just vectors in R^m

         (i) says that if a set of vectors point in "independent"
             directions, the same holds for any subset

         (ii) says that if the span of a set J has dimension 1 lower
              then the span of another set I, we can always pick a vector
              from I that points "outside" the directions available to J.

         [draw pictures]

    Defn. A maximal independent set (maximal in the sense of set theory)
          is called a *basis*.

    For the graph example, when G is connected, basis means "spanning tree."
    All spanning trees have the same number of vertices (n-1 if there are n
    vertices).  This observation generalizes.

    Thm. All bases of a matroid have the same number of elements.  (Their common
         size is called the *rank* of the matroid.)

    Proof (by contradiction):

         Let B,B' be two bases with |B| < |B'|.
         Remove elements from B', if necessary, until |B'| = |B|+1.
         By (i) B' is still independent.  By (ii) we can find e in B'
         for which B+e is independent, so B wasn't maximal.

    Counting Matroids

      Intuitively, independent sets in a matroid are not too "large".

      Illustrative example.  Let E = {1,...,n}.  Let k satisfy 0 <= k <= n.
                             II = the subsets with <= k elements.
                             This matroid has (n choose k) bases.
                             Its rank is k.

      Two extreme cases: if k=0, only the empty set is independent
                         if k=n, every subset is independent

      Suppose the elements are given positive weights.  Then the greedy
      algorithm finds a subset of maximum total weight.

          Example: Sort the elements by weight, say

                       10 7 5 4 2 1

                   Choose the top k (here k=3):

                       10 7 5 4 2 1
                       ------

                   If we chose any other k-subset, there would be a "hole"
                   where one of 10, 7, 5 wasn't used.  This could be traded
                   for the smallest element of the subset, and the result would
                   be no worse.  Continue this until we have 10,7,5.

Lecture 23
3/16/18 Friday

   Properties of GREEDY (Lawler 7.6)
   Dual matroids (Lawler 7.8)

   Review of matroid axioms, bases, matroid rank (see previous lecture)

   Construction of a basis

      To find a basis, let S = empty set
      For each e in E
          If S + e is independent, add e to S.

      (Note: For this to work, it was crucial that the empty set
      be independent, and that E be finite.)

   Finding a maximum weight basis

      Suppose each element e has a weight w(e) >= 0.

               GREEDY: Sort elements by weight (largest first)
                       B = empty set
                       While unexamined elements remain:
                             e = next such element
                             If B + e is independent, B := B + e
                             Else reject e.

      Theorem: GREEDY finds a basis B of maximum possible weight.

         Proof: Let B = {x_1, x_2, ..., x_r} at the end of the algorithm
                Let C = {y_1, y_2, ..., y_r} be a basis of greater weight
                Both are listed so that weights don't decrease

                Let i be first index for which w(y_i) > w(x_i)  (has to exist).
                Take
                        I = {x1,...,x_{i-1}}
                        J = {y1,...,y_{i-1},y_i}

                Both are independent, so for some k, there is a y_k in J-I
                that we could add to I.  But

                   w(y_k) >= w(y_i) > w(x_i)

                so GREEDY would have considered it before x_i, and accepted it.
                This contradiction proves the theorem.

      Here is an extension (exercise for you): A greedy basis is Gale-optimal.
      What this means: Let x_1,...,x_r be a basis found by GREEDY (weights
      non-increasing), and let z_1,...,z_s be any other independent set (ditto).
      Then
                  w(x_i) >= w(z_i)   for 1 <= i <= s.

      What if we want a minimum weight basis?  Replace w(e) by W - w(e), where
      W >= sum_e w(e).  Since all bases have the same size, a max-weight
      basis for the new weights is a min-weight basis for the original weights.

      GREEDY characterizes matroids, in a certain sense.

      Theorem: If M is a subset system satisfying (i) but not (ii),
               then GREEDY will fail for some choice of non-negative weights.

         An equivalent statement is: If M satisfies (i) and GREEDY always finds
         a maximum-weight independent set, then M satisfies (ii).

         Proof: Let (ii) fail for I,J.  That is, I,J are independent, |I|+1 = |J|, 
                but no element of J-I can be added to I and preserve independence.

                Let |I| = m.  Choose weights as follows:

                        Elements of I get w=1
                        Elements of J-I get w=1-epsilon
                        All other elements get w=0.

                        where 0 < epsilon < 1/(m+1).

                GREEDY picks all of I
                       rejects all of J-I
                       may pick other elements of weight 0

                GREEDY's solution has total weight m
                But
                         weight(J) >= (m+1)(1 - epsilon) 
                                    = (m+1) - (m+1)epsilon
                                    > m+1 - 1              [by choice of epsilon]
                                    = m.
                So GREEDY failed.

      [Question: Where did we use (i)?]

      We assumed nonnegative weights.  What happens if some are negative?
      Then, the maximum-weight basis found by GREEDY need not be the maximum-weight
      independent set.

           Example: Counting matroid (independent means has <= k elements)
                    Weights are 1,1,...,1 (k-1 times), -1,-1,...,-1 (n-k elements)
                    GREEDY gets a basis of total weight k-2 (one -1 must be used)
                    but the max-wt independent subset has weight k-1.

           If we want a max-weight independent set, we can modify GREEDY so
           that is rejects all elements of negative weight.

   Matroid Duality

      This is similar to linear programming.  Every matroid has a closely
      associated "dual" matroid.

	  Let M = (E,II) be a matroid.  Its *dual* is the matroid

	      M* = (E, II*),

	  where II* consists of the complements of bases of M, together
          with the subsets of these complements.

      Theorem: M* is also a matroid.  (See Lawler p. 280 for proof.)

      This allows us to solve more problems.

      Example: The Mayor's Dilemma.  Let G be a connected graph with
               positive edge weights.  Among the largest sets of edges whose
               removal leaves G connected, choose one of minimum total cost.

               The idea is that G represents a road network, with w(e) equal
               to the cost of rebuilding e.  The mayor wants to fix as many
               roads as possible, without disconnecting anyone, 
               at minimal cost.

      For this problem: E = edges of G
                        II = edge sets containing no cycle
                        M = (E,II) is the *cycle matroid* for the graph

                        J is independent in M* 
                          <==> there is a spanning tree T s.t T n J is empty
                          <==> (V,J') is connected  (J' is the complement of J.)

                        M* is called G's cocycle matroid.

      The minimizing version of GREEDY (STINGY?) will solve this.

           Example:    
                         1
                      *------*               m = 5
                      |    / | \             n = 7
                    2 |  5/  |  \ 3
                      |  /  4|   \
                      | /    |    \
                      *------*-----*
                          6     7

                                      Accept    Reject

                                           1
                                                     2
                                           3
                                           4
                                                     5
                                                     6
                                                     7

            (Note: Contrary to intuition, accepted edges are taken out of service.)

            So the city can fix 3 (= m+n-1) roads, at total cost 1+3+4 = 8,

            Observe that we also found a minimum spanning tree (with weight
                    2+5+6+7 = 20).

            Effectively, GREEDY solves two problems at once:

                     Find a max-wt basis in M
                     Find a min-wt basis in M*.

Lecture 24
3/19/18 Monday

   More Matroid Examples:

       Polygamous Bipartite Matching
       Partition Matroids
       Matching Matroids

   Review of matroid axioms (i) and (ii); See 3/14/18 lecture.
   Inheritance (i) is pretty intuitive but it is hard to get a good sense
   for (ii).  One idea is that more (larger) is better (the larger set
   has some "capability" that could be added to the smallet set).

   Review of matroid examples: vector spaces, acyclic edge sets in a graph,
   "small" subsets (counting matroids).

   1) Polygamous Bipartite Matching

           G = (V,E)      E [ X x Y
           Think of X as a set of men and Y a set of women
           E gives all the compatible pairs
           Each woman can have at most one husband, but a man
           can have multiple wives.

           Example:     A--- a              doesn't have a perfect matching,
                        B--- b              but we could "marry" A to a,
                          \                 and B to b and c.
                           \
                            \
                        C    c

           A set of edges is independent if no Y vertex commits bigamy.  That
           is, there is a forbidden subgraph

                        x -- y
                           /
                          /
                        x'
                         
           The analog of the (maximizing) assignment problem can be solved 
           by a greedy algorithm: Consider each column in turn, and choose 
           a maximum element from each column.  E.g. for n=3

                      a  b  c
                   A  3  7* 3
                   B  9* 5  5*           starred cells are chosen.
                   C  8  1  3 

           This uses n(n-1) ~ n^2 comparisons of entries.  
           On the other hand, every algorithm for this problem must
           look at all the entries: if one number was unexamined, it might
           be the largest one in its column.

   2) Partition Matroids

           Divide E into disjoint blocks B1, B2, ..., Bm.
           Block i has a capacity ci (non-negative integer).
           I is independent when |I n Bi | <= ci, all i.
           (Exercise: Check that this satisfies the matroid axioms).

           This is a generalization of the last example, for which

                 E consists of pairs (x,y) with x in X, y in Y
                 We bin the pairs by their last coordinate.
                 All capacities are 1.

           The greedy algorithm can be interpreted as throwing balls into bins:

                  |    |    |    |    |
                  |----|    |----|    |
                  |    |----|    |    | 
                  | *  |    |    |----|
                  | *  | *  |    | *  |  <-- if the next ball goes into B4,
                  ---------------------      it is rejected.
                    B1   B2   B3   B4
                c =  3    2    3    1

           For polygamous matching, this changes the order in which columns
           are processed.  For the above example, we would consider the cells
           labeled by 9,8,7,5,5,... and find
           
                      a   b   c
                   A  3   7*  3          starred cells are accepted
                   B  9*  5x  5*         cells with x got rejected
                   C  8x  1   3          we stopped as soon as each column
                                          has a match.

               bins | B | A | B |

           Note that the end result is the same, we just found it differently.

   3) Matching Matroids

           G = (V,E) undirected graph (need not be bipartite)
           I [ V is independent when there is a matching M of G that
                                     "hits" each vertex in I

           Note that this is a vertex model.  Inheritance (i) is clear, but
           the other axiom needs proof.  See Lawler p. 271.  (It relies
           on the idea of alternating paths.)

           Example:   a --- b --- c    has 4 independent sets

                        empty, a, b, c.

                      So it is the k=1 counting matroid on 3 elements.

           Here is a non-obvious application of this. 

           Unit Length Job Scheduling with Deadlines and Penalties

                 Jobs j=1,...,n   Each takes unit time
                 Deadlines dj     (positive integers)
                 Penalties pj     (non-negative reals)

                 There is only one processor, and pre-emption is not allowed
                 (once started, a job has to run to completion).

                 A set J of jobs is called *independent* if all jobs in J
                 can be completed by their deadlines.

                 This looks awful (there are |J|! orderings of
                 the jobs), but there is actually an easy test: If the deadlines
                 for the jobs in J are d1 <= d2 <= ... <= dm, then they can
                 be scheduled without penalty iff j <= dj holds for all j.
                 (Exercise for you.  The idea is that if a schedule is
                 penalty free, so is the one with all slack times removed.)

                 Let's think of pj as the profit (reward) gained by completing
                 job j on time.  Notice that we are just matching jobs to times.
                 For example, if the data is

                          jobs        A   B   C   D   E
                          deadlines   3   1   4   1   5
                          penalties   2   4   1   5   7

                 we need to solve the (maximizing) assignment problem given by

                             1  2  3  4  5   times
                   jobs   A  2  2  2  0  0 
                          B  4  0  0  0  0
                          C  1  1  1  1  0
                          D  5  0  0  0  0
                          E  7  7  7  7  7

                 We can cover all nonzero entries with column 1, and rows A, C, E.
                 By Ko"nig's theorem, the maximum number of independent jobs is 4.
                 
                 The greedy algorithm for this problem is: Consider jobs in
                 penalty order.  For each job, accept if iff there is a free
                 slot before its deadline, and put it in the last available such
                 position.  Schedule all tardy jobs at the end (in any order).

                 For the example, we'd consider jobs E, D, B, A, C and get

                    t = 0    1    2    3    4    5
                        |----|----|----|----|----|
                           D         A    C    E
                           B is rejected and will go last

                 We can compress this to

                    t = 0    1    2    3    4    5
                        |----|----|----|----|----|
                           D    A    C    E    B     (total penalty 4)
   
Lecture 25
3/21/18 Wednesday

  Matroid Intersection

  R.E. Bixby and W.H. Cunningham, Matroid Optimization and Algorithms,
  in Handbook of Combinatorics, ed. R.L. Graham et al., Elsevier, 1975.

  Review of matroids and their use in "greedy" optimization.

  Using Two Matroids

     Many problems have constraints that can usefully be thought of
     as independence in a matroid.  Sometimes, more than one matroid
     is required.

     Example 1: G = (V,E) bipartite graph with E [ X x Y.
                M [ E is a matching when:

                  No pair shares a left endpoint x (independence in M1)

                  No pair shares a right endpoint y (independence in M2)

                It is easy to find maximum independent sets in each
                of M1, M2.  The trick is to find a largest set that is
                simultaneously independent in both of them.

     Example 2: Directed spanning trees.

                Let G be a directed graph, in which a distinguished
                "root" vertex r has in-degree 0.  We seek a spanning tree
                with root r, in which each root-to-leaf path is
                directed.  (Edmonds has called this a "branchocracy.")

                E.g. if G is             r
                                        / \
                                       /   \
                                      v     v
                                      a --> b
                                        <-- 

                     one possible tree is r --> a --> b

                Thinking of this tree as a set of edges, there are two
                requirements.

                M1: In the underlying undirected graph, the edge set
                    is acyclic.  (Cycle matroid.)

                M2: No two edges point to the same vertex.  (Partition
                    matroid, in which all "bins" have capacity 1.)
                
                Although this problem can be solved by breadth-first search,
                the weighted version cannot, and so it is useful to
                have this description.

      Abstractly:  Given matroids M1 and M2 with common universe E,
                   choose, from among the sets that are independent
                   in both M1 and M2, one of largest size.

                   Note that the empty set qualifies, so this problem
                   does have a solution.
     
  Additional Matroid Concepts

     Rank.   Suppose that (E,II) is a matroid.  If S [ E, its *rank*
             (written r(S)) is the size of a largest independent set
             within S.  (Exercise: Check, using the axioms (i) and (ii),
             that all such subsets have the same size.)

             This agrees with the rank we defined earlier.  The rank
             of the matroid is just r(E).
    
     Circuits.  In a matroid, a *circuit* is a minimal dependent set.
                These can have different sizes, e.g. the cycle matroid
                for
                      *        *-----*
                     / \       |     |
                    /   \      |     |
                   *-----*-----*-----*

                has circuits of size 3 and size 4.

                Circuits can be found by taking a dependent set and
                removing elements until it becomes independent.

                If J is independent, and J+x is not, there is a unique
                circuit within J+x.  We call this C(J,x).  The elements
                of C(J,x) are "equivalent" in the following sense: if
                y is in C(J,x), then J+x-y is independent.  (Proof:
                C-y is independent, and |C-y| <= |J|.  If they are not
                the same size, add "new" elements from J, not in C-y,
                until the resulting independent set J' has the same size
                as J.  This J' must be J+x-y.)
                
     Independence oracle.  In discussing abstract matroid algorithms, we
                           assume we have a test to recognize independent
                           sets.  For the examples we've seen, such a test
                           can be done in polynomial time.

  Duality Results

     M1, M2 are two matroids on E, with rank functions r1 and r2.
     If J ranges over sets that are independent in M1 and in M2,
     and S ranges over subsets of E, we have

           |J| <= r1(S) + r2(S')       (Weak Duality)
     and

        max{ |J| } =  min{ r1(S) + r2(S') }  (Strong Duality)

     Here S' = E-S denotes the complement of S.

     See Bixby and Cunningham, p. 9 for a proof.

  Finding a Largest Common Independent Set

     The first poly-time algorithm for this problem was given by Lawler.

     Let M1, M2 be two matroids on E.  We seek a largest J [ E that
     belongs to both II1 and II2.  The algorithm finds J by a sequence
     of shortest-path computations, each of which either:

         Replaces J by a larger set
     or
         Constructs an S giving a "strong duality" proof that J is optimal.

     Initially J is empty.

     Given J in II1 n II2, we construct a digraph D as follows:

           Vertices: E u {s,t}.

           Edges:
                   e --> t for e in J', J+e in II1
                   s --> e for e in J', J+e in II2
                   e --> f for e in J', f in J with f in C1(J,e)
                   f --> e for e in J', f in J with f in C2(J,e)

     Then:

       If s, e_1, f_1, ..., e_k, f_k, e_{k+1}, t is a shortest s-t path
       in D, the larger set J - f's + e's is in II1 n II2.

       If no such path exists, the set S of reachable vertices from E
       satisfies |J| = r1(S) + r2(S') (and so J is optimal).

     This gives a polynomial time algorithm: If |E| = n, there are <= n^2
     edges in D; also |J| can increase at most n times.

     This will not be proved (you can look at Bixby and Cunningham, p. 10),
     but examining some simple cases makes the first claim plausible.

     1) D has s --> e --> t.   The first edge tells us that e can be added
        in M2, and the second tells us the same for M2.  Therefore it is
        OK to replace J by J+e.

     2) We have                 s
                                |
                                |        <-- e can be added in M2
                                v
                             ...e...
                             .  |  .
                             .  |  .     <-- but makes ckt C1(J,e) in M1
                             .  v  .
                             ...f...
                             .  |  .
                             .  |  .     <-- ckt C2(J,e') in M2
                             .  v  .
                             ...e'..
                                |
                                |        <-- we can add e' in M1
                                v
                                t

         with e != e'.

         The element f is common to both circuits.

         In M1, J+e-f and J+e' are independent
                |J+e-f| < |J+e'|
                So we can add a "new" element from J+e'.
                This can't be f, because J+e-f+f = J+e is dependent
                So it must be e.
                Therefore, J+e-f+e' is independent.

         Similar argument applies to M2.

    Here is a small example.  We want a maximum matching in

		 X:   Y:

		 A -- a
		   \\
		    \\
		 B -- b

        M1 is the matroid for no bigamy in X, M2 similarly for Y

        Suppose we start with J = {Ab}, which is indicated with a 
        doubled edge.

        Then D is
                                s
                                |          Aa and Ab indep in M2
                                v
		         e:   A -- a  
                                |          C1(J,e) is Aa, Ab
                                v
                         f:   A == b
                                |          C2(J,e) is Ab, Bb
                                v
		         e':  B -- b
                                |          Ab, Bb indep in M1
                                v
                                t

        So we can replace Ab by {Aa, Bb}.  This is optimal.

        Next step: The only element not in J is Ab.  This can't be
                   added in M1 (A becomes bigamous) or in M2 (b does).
                   So D has no connection from s to anywhere.
                   This makes S empty, S' = E.
                   r1(S) = 0  (the rank of an empty set is always 0)
                   r2(S') = 2 (use the greedy algorithm from 1) of 3/19/18)
                   So

                       |J| = r1(S) + r2(S')

                   and we're done.

  Weighted Matroid Intersection

     Suppose each element e in E is given a weight w(e) (>= 0).
     Among the largest sets that are independent in both matroid,
     we seek one of maximum total weight.

     This is similar to the assignment problem.  It can be solved
     in polyomial time, by repeatedly searching in a weighted version
     of the digraph D defined above.  As the algorithm proceeds, the
     weights are updated.   See Bixby and Cunnningham, pp. 15-16.

     In particular, this gives a polynomial time algorithm for finding
     a directed spanning tree of maximum total weight.

Lecture 26
3/23/18 Friday

   Greedoids

      Korte and Lovasz, SIAM J. ALgebraic and Discrete Methods, 1984

      Bjorner and Ziegler, Introduction to Greedoids, in: Matroid Applications,
      ed. Neil White, 19XX.

   Greedoids were introduced by Korte and Lovasz in 1981.  As motivation, consider
   two questions: 1) What happens if we relax the requirements for being a 
   matroid? 2) Are there interesting greedy algorithms that are not matroid-based?

   The matroid axioms can be stated as follows.  Let E be a finite set
   and II a set sytem (collection of subsets).  Then (E,II) is a matroid if:

        (o)  The empty set is a member of II
        (i)  Full inheritance: X [ Y in II ==> X in II
        (ii) Exchange: X,Y in II with |X|>|Y| ==> there is x in X-Y
                                                  with X+x in II.

   (Previously, we required (ii) only when |X|=|Y|+1.  In the presence of (i),
    however, the stronger version of (ii) holds.)

   How might we relax these requirements?

   Defn. (E,FF) is a *greedoid* if axioms (o) and (ii) hold.
         The sets in FF are called *feasible*.

   Thm. If (E,FF) is a greedoid, then

        (iii) Deconstruction: If X in FF is nonempty, then there is an
              x in X such that X-x is in FF.

   The idea here is that not every subset needs to be feasible, but there
   is always one element whose removal preserves feasibility.

   To prove the theorem, build up.  Start with the empty set (a proper subset
   of X, and use (ii) to successively add elements.  Stop when you have a set
   that is one element shy of being X.

   Note: K+L remark that we could use (o), (iii), and (ii) for |X| = |Y|+1
   as axioms.  This is perhaps more satisfying because the axioms refer to
   incremental modification of feasible sets.

   There are many examples of greedoids.  Here are a few.

      1) Any matroid.

      2) Connected graph with distinguished root r.  E = {edges}, FF is
         the empty set plus the edge sets of trees that touch r.  This
         is called the undirected branching greedoid.

      3) The directed version of 2), in which edges point away from r.

      4) Bipartite graph with E [ V x U, where we have an ordering
         u1,u2,...,u_n of U.  The feasible sets are the A [ V
         that can be matched to u1,...,u|A| in U.

         Edmonds calls this the medieval marriage greedoid.  In The Taming
         of the Shrew, Act 1, Scene 1, Baptista says:

            [I resolve]
            not to bestow my youngest daughter
            before I have a husband for the elder

   Before discussing optimization, we need a few more concepts.

       a) If (E,FF) is a subset system, its *hereditary closure* is
          (E,FF'), where FF' includes every subset of a set in FF.

       b) The *rank* of a set is defined as it is for matroids.
          r(A) is the size of a largest feasible subset of A.

       c) A *basis* is a maximal feasible subset.  All bases have the
          same size.  The same proof we used for matroids works here.

       d) A subset A of E is *closed* if

          {x in E : r(A+x) = r(x)} = A.

          Note that the set above always contains A.  So the idea is that
          anything we try to add from outside A will increase the rank.

          For example, suppose A is a finite set of vectors in R^m.  Then
          A is closed (in the linear-algebra matroid) if all vectors in
          E-A are outside the span of A.  (Note that this is different from being
          topologically closed.  Indeed E has the discrete topology, so
          all subsets of A are closed, in the usual sense.)

   Now we give each element e in E a weight w(e).   GREEDY is

                 start with S = empty set
                 while S is not a basis:
                    choose x to maximize w(x) for x in E-S and S+x in FF.
                    (if there is no such x we can stop.)
                    add x to S.

   To use GREEDY, we need a way to test if a set is feasible.
   This can be done easily in the cases noted above.  

   When will GREEDY work?  Here is one result.

   Theorem (in K+L 1984): Let (E,FF) be a greedoid with hereditary closure
                          (E,FF').  GREEDY finds a max-weight basis for every
                          weight function iff:

                          1) (E,FF) is a matroid, and 

                          2) A closed set in (E,FF) remains so in (E,FF').

   Let's apply this to an undirected branching greedoid of G=(V,E).
   We assume G is connected and denote the root by r.

     Ad 1): Any subset of E that is acyclic (in the usual sense) can be
            extended to a spanning tree.  So (E,FF') is just the cycle
            matroid of G.

     Ad 2): The nonempty closed subsets in (E,FF) are just the trees T
            connecting to r (that is, their edge sets), augmented with
            any other edges both of whose ends are in T.  Such a set is
            closed in (E,FF').

    In this case, GREEDY is Prim's algorithm (1957):

                 R = {r} (set of vertices)
                 S = empty (set of edges)
                 while S is not a basis (i.e. has < |V|-1 edges)
                    choose a max-wt edge e = {v,w} with v in R, w not in R.
                    add w to R
                    add e to S

    (Historical note: Prim actually rediscovered the algorithm published
    in Czech by Jarnik in 1930.)

    In their paper, K+L also give a sufficient condition for GREEDY to work
    on a given weight function.  

    Finally, we note that there is a "greedoid" interpretation for Dijkstra's
    single-source shortest path algorithm.  For this problem, the weight
    function involves a sum over paths, so we use a version of GREEDY that
    considers entire sets, not just individual elements.
 
Lecture 27 
4/2/18 Monday

   Randomization (this is a new topic)

      Long used for entertainment, divination, ...
      1940s: Use by Erdos to prove combinatorial theorems   
      1970s: Randomized algorithms popularized by Rabin
      Now a standard part of discrete math and algorithm design

   Today's applications

      Lower bound for Ramsey numbers (from N. Alon and J. Spencer, The
                                      Probabilistic Method)
      Probabilistic identity testing (J. Schwartz, J. ACM 1980)

   Ramsey Theory

      Started by Frank Ramsey (UK) in the 1930s.  Basic theme is that
      in any sufficiently large object, structure is unavoidable.

      Ramsey's Theorem: Fix k>1. There is an n such that any graph   
                        with >=n vertices has either a k-clique or
                        a k-anticlique (= independent set with k vertices).

      We will not prove this here.  (A proof can be found in M. Hall, 
      Combinatorial Theory, p. 57.)

      Note that the forced appearance of a clique or an anticlique 
      "switches on" and then stays on forever.  So we define R(k) 
      to be the minimum n for which the assertion of the theorem is true.

      Theorem. R(k) >= 2^{k/2 - 1} if k >= 2.

      Proof: Start with n vertices.  For each possible edge, flip a fair coin
             to determine if that edge is in the graph or not.  These
             coin flips are independent.

          Pr[ there exists a k-subset S that is a clique or anticlique ]

          <= sum_{S of size k} Pr[S is a clique or an anticlique]  (union bound)

           = (n choose k) * 2 (1/2)^(n choose k)  (note factor 2)

           <= (n^k / 2) * 2 (1/2)^{n choose k}

            = n^k / 2^(k choose 2).

              If n<2^{k/2 - 1}, n^k < 2^{k(k-2)/2} < 2^{k choose 2} and
              this makes the probability above < 1.

              Therefore, there some point in the sample space (= way to put
              edges in the graph) that avoids both k-cliques and k-anticliques.
              The assertion follows.

       How good is this bound?  Erdos and Sekeres (1935) showed that

              R(k) <= (2k - 2 choose k-1) = O(4^k / sqrt k),

       so the sequence R(k) grows exponentially.  Very few 
       values of R(k) are explicitly known  As of 2007, the proved values were 

                  k   0 1 2 3 4
                 R(k) 0 1 2 6 18

       (see sequence A120414 in OEIS).

       Note that this could have been done by counting, but the theory
       of probability gives us tools and intuition.

   Polynomial Identity Testing

       Suppose we find an efficient way to evaluate a complicated expression,
       such as

            | 1 x1 x1^2 ... x1^{n-1} |
            | 1 x2 x2^2 ... x2^{n-1} |
            | 1 x3 x3^3 ... x3^{n-1} | =  product{i<j} (xi - xj)
            |           ...          |
            | 1 xn xn^2 ... xn^{n-1} |

       How might we check it?

       1) Brute force.  Expand and simplify both sides and see if they are
          the same.  Tedious and error prone.  Here the determinant has
          n! terms in it.

       2) Look for a clever proof.  Since R[x1,...,xn] (polynomials in n variables
          with real coefficients) has unique factorization, we can observe that

           LHS vanishes if xi-xj,   so (xi - xj) divides the right hand side

          Both sides have degree (n choose 2), so

                determinant = const * product

          The constant can be figured out by considering one term, such
          as the "diagonal" product 1 x2 x3^2 ... xn^{n-1}.  (As an exercise
          you should do this.)

          This only solves one problem, and requires ingenuity and specialized
          knowledge.

       3) Plug in random values.  For example, when n=1 and x1=0, x2=1, x3 = -1,
          both sides evaluate to 2.  

          This is very easy, but how reliable is it?

       Intuitively, we expect this to be pretty good, since the zero set of
       a polynomial that doesn't vanish identically has measure 0. 
       (If you like analysis, this is a good exercise to try.)  This means that
       if we choose a "box" in R^n, and throw a dart at it, the probability 
       of landing on a zero of f equals 0.  

       In 1980, Schwartz observed that if f in R[x1,...,xn] is not identically 0,
       and we choose the ai's at random from a large enough set, then
       Pr[ f(a1,...,an) = 0 ] <= 1/2.

       How large is enough?  When the degree of f is d, any set with 2d or more
       elements will do.  This follows from the result below.

       Theorem.  Assume f in R[x1,...,xn] has degree d and is not identically 0
                 Let S be a subset of R.  If we choose a1,...,an independently
                 from the uniform distribution on S, the probability that
                 f(a1,...,an) = 0 is <= d/|S|.

       Put another way, |Z(f) n S^n| <= d |S|^(n-1).

       Proof: Use induction on n.

          When n=1, |Z(f)| <= d by the remainder theorem.

          Consider n>1.  Then we can order the variables so that

            f = x1^d1 f1(x2,...,xn) + [terms of degree < d1 in x1]

          So

          |Z(f) n S^n| <= d1 |S|^(n-1)      +   |S| * (d - d1)|S|^(n-2)

                           ^                              ^
                           |                              |

                     if f1 != 0, there            by induction, the number of
                     are <= d1 ways to pick       ways to make f1 vanish is
                     a1                           bounded by this.  For each,
                                                  there are |S| ways to pick a1.

                         = (d1 + d - d1)|S|^(n-1)

                         = d |S|^(n-1).

Lecture 28 
4/4/18 Wednesday

   Continuing with randomized algorithms

           Finite fields
           Computing square roots mod p

   Today's problem nicely illustrates the connection between randomized
   algorithms and algebraic structure.  We have already seen an example
   of this: the Schwartz algorithm for identity testing relies on
   the idea that a degree d polynomial f(X) in R[X] can have at most
   d zeroes.

   Finite Fields

      Defn.  A *field* is a system (F,+,*,0,1) that obeys the laws
             of high school algebra.  More precisely:

                + and * are commutative and associative
                + and * obey the usual distributive law
                0 is the identity for +
                1 is the identity for *
                for each x there is a y making x+y=0
                for each x != 0 there is a y with x*y = 1.

             We also require that F have at least two elements.
             (As an exercise, show that 0 != 1.)

      Familiar examples include Q (rational numbers), R (real numbers),
      C (complex numbers).

      Let Z_n = {0,1,2,...,n-1} with addition and multiplication taken
      mod n.  (This means that we do the operation, and if it overflows,
      we take the remainder after division by n.)

      Fact. Z_n is a field iff n is prime.

          Proof sketch:  If n = n1 n2 with 1 < n1,n2 < n, we have
                         n1*n2 == 0 (mod n) but n1 != 0.

                         If n is prime, let 1 <= a <= n-1.  Then
                         we can use Euclid's algorithm to solve

                                  ax + ny = 1

                         in integers, because a is relatively prime to n.
                         (This is proved in every book on elementary 
                         number theory.)  Then ax == 1 (mod n).


      Field Extensions.

        Suppose that F is a field, and f(X) = X^2 + bX + c is 
        irreducible over F.  This means that you can't express it
        as a product of two degree 1 polys with coefficients in F.
        Then, we can make a bigger field F' as follows:

             The elements of F' are polys of degree <= 1, say uX+v w/ u,v in F
             Addition: just add corresponding coefficients
             Multiplication: multiply the polys as usual, then subtract
                             the multiple of f that cancels the degree 2
                             term.
             The reason that F' is a field is similar to the case of 
             integers mod p.  (In a more thorough treatment, this would
             be proved.)

        Example: F = R, f = X^2 + 1.  (This has no real zeroes.)

                 (X+2) + (3X+4) = 4X + 6

                 (X+2) * (3X+4) = 3X^2 + 10X + 8
                                -3(X^2       + 1)

                            -->          10x + 5

                 The new field F' is just the field of complex numbers,
                 since X = sqrt(-1) in F'.

                 Can we continue and get even bigger fields F'', F''', ... ?
                 No.  According to the fundamental theorem of algebra,
                 every polynomial with complex coefficients has a complex
                 zero.  So the only irreducibles are linear.

        If F = Z_p, |F'| = p^2, because we can choose the coefficients
        u and v in p ways, independently.

        Note: The same process can be done using irreducibles of degree d.
              We only need the case d=2 today, though.
      
   Computing Square Roots mod p

      From now on, p denotes an odd prime.

      How might we solve the quadratic equation x^2 = a in Z_p?

      To get started, consider what squaring does for small p.  E.g.

            p=11

            x = 1* 2  3* 4* 5* 6  7  8  9* 10
           x^2= 1  4  9  5  3  3  5  9  4   1

            The squares (marked with a *) seem to be distributed irregularly.

            We could try brute force: try all x < p/2.  This is exponential
            in the length of p.
                
      
      Cipolla's algorithm (1903)

            To solve x^2 == a (mod p)
            Choose t from Zp
            Form f = Z^2 - tZ + a (we hope it is irreducible)
            Compute x = Z^{(p+1)/2} mod f
            Return x as a square root of a

         Nowadays we imagine that t is chosen at random.

         Note that x can be computed efficiently using repeated squaring.
         So the number of Z_p operations required is O(log p).

         Why does it work?

            Suppose f is irreducible.  Then we are working in the 
            extension field F' (which has p^2 elements).

            In F', Z is a zero of f (plug it in).

                   Z^p is also one.  This takes a little argument.

                      f(Z^p) = Z^(2p) - t Z^p + a 
                             = Z^(2p) - t^p Z^p + a^p 
                             = ( Z^2 - t Z + a )^p 
                             = 0.

                      The calculation above relies on two facts.

                      In Z_p, a^p = a (Fermat's little theorem)
                      In F', (r+s)^p = r^p + s^p (Freshman's binomial theorem)

                      Both are consequences of the following reault:
                      If 1 <= i <= p-1, then p divides (p choose i).

                   Also, Z != Z^p.  (The equation x^p - x = 0 has
                   only p roots, and they are all in F.)
                  
            All of this implies that

               (Z^{(p+1)/2})2 = Z^p * Z = a  

            since the constant term of an equation is the product of its
            roots.

            If a is a nonzero element of (Z_p)^2, then Z^{(p+1)/2} is
            in Z_p, since there are only two solutions to x^2 = a, and
            both are in Z_p.

         How likely is f to be irreducible?

              Thm. If t is chosen from the uniform distribution on Z_p,
                   Pr[f is irreducible] = (p-1)/2p) ~ 1/2.

              Proof.  We count the reducible monic polynomials

                          (Z - alpha)(Z - beta)

                      with alpha, beta in Z_p and alpha*beta = a.

                      For two choices of alpha, alpha^2 = a, and 
                      we get the polynomials (Z - alpha)^2.

                      There are p-3 ways to pick the remaining alpha,
                      and then beta must equal a*alpha^{-1}.
                      This gives (p-3)/2 polynomials, since alpha,beta
                      and beta,alpha give the same polynomial

                      So # of reducible polys is (p-3)/2 + 2 = (p+1)/2.

                         # of irreducibles is p - (p+1)/2 = (p-1)/2.

Lecture 29 
4/6/18 Friday

   Detecting Perfect Matchings

        Mulmuley, Vazirani, Vazirani, Combinatorica, 1987

   G = (V,E) undirected graph

     When G is bipartite, we can use network flow techniques to find
     a maximum matching in G.  What can we say in general?

   Let PM = {G : G has a perfect matching}

     PM belongs to NP (just exhibit a matching)

     PM' = {G : G has no perfect matching} also belongs to NP (!)

         This is a consequence of a theorem proved by Tutte (1947).
         For X [ V let q(X) = # of connected components in G-V having
         an odd number of vertices.

         Example: suppose G is a triangle.  Then

                   q(triangle) = 0
                   q(*--*) = 1
                   q(point) = 0
                   q(empty set) = 1

         Tutte's theorem says that G has a perfect matching iff for
         every X [ V, q(X) <= |X|.  Note that this is similar to Hall's
         marriage theorem (proved 3/2/18).
      
         So a certificate for PM' is a set X violating the inequality. 

     In a famous paper (Paths, Trees, and Flowers, 1965), Edmonds showed
     that PM belongs to P.

     We'll present a randomized algorithm for testing if G has a perfect
     matching.

     Review of Schwartz's theorem (see lecture 4/4/18).  We proved
     this for real coefficients, but the result is actually true
     for polynomials over any field.

   The Tutte Matrix

     Let V = {1,...,n}.  For each pair i<j, choose a variable x_ij>
     Let
                         0,       if {i,j} not in E
                 t_ij =  x_ij,    if {i,j} is in E and i<j
                        -x_ij     if {i,j} is in E and i>j

      The Tutte polynomial is det(t_ij)

      Example:   Suppose G is

                       1 ---- 2
                       |      |
                       |      |
                       3 ---- 4

                 Write x for x_12, y for x_13, z for z_24, w for x_34

                 The matrix is

                      [ 0 x  y  0]
                      [-x 0  0  z]
                      [-y 0  0  w]
                      [ 0 -z -w 0]

                 Note that this is the adjacency matrix of G but with
                 the 1's replaced by +- variables.

                 To compute the determinant

                              | 0 x  y  0|
                              |-x 0  0  z|
                              |-y 0  0  w|
                              | 0 -z -w 0|
        
                                   | (shift rows up)
                                   v
        
                              |-x 0  0  z|
                       =    - |-y 0  0  w|
                              | 0 -z -w 0|
                              | 0 x  y  0|
        
                                   | (shift columns right)
                                   v
        
                              | z -x 0  0|
                       =      | w -y 0  0|
                              | 0 0 -z -w|
                              | 0  0 x  y|
        
                       = |z -x| * |z -w|  = (xw - yz)^2.
                         |w -y|   |x  y|
        
      This always happens.  Any skew-symmetric determinant is the square
      of a quantity called the *Pfaffian*.   For a proof of this, see
      N. Jacobson, Basic Algebra I, section 6.2.  Note that if n is even,
      det(t_ij) = P^2, and P has degree n/2.

          *--*  
          |  |
      So  *--*  belongs to PM, since its Tutte poly is != 0.

      On the other hand, a triangle has Tutte polynomial

                      | 0  x  y |
                      |-x  0  z |  = 0.
                      |-y -z  0 |

      It can be shown that G is in PM <==> G's Tutte polynomial is != 0.

   Lovasz Randomized Perfect Matching Test (FCT 1979)

      Let |V| = n, n even
      Choose a prime p >= n (this is >= 2 * degree(P)).
      Let S = {0,1,...,n-1} [ Z_p
      Choose a_ij i.i.d. from the uniform distribution on S
      Evaluate f(a_ij's) mod p, where f is G's Tutte polynomial
      If this is   0, say G has no perfect matching
      If this is !=0, say G has a  perfect matching

      This has one-sided error:

                 If G is in     PM, Pr[error] <= 1/2
                 If G is not in PM, Pr[error]  = 0.

      We can evaluate f using O(n^3) arithmetic operations in Z_p.

      How large is p?

          There is always a prime between n and 2n.  This is Bertrand's 
          postulate (proved by Chebyshev).

          We can use the Sieve of Eratosthenes to find p.  The idea is
          to cross off all larger multiples of each prime in turn.
          So when n=4, we'd have

               2 3 4 5 6 7 8  (so 2 is prime)
               2 3   5   7    (so 3 is prime)
               2 3   5   7    (5 is prime, we stop since 5 >= n)

          Although better bounds are possible, the work for this 
          can be estimated as O(n^2), since we make <= n passes on 
          an array of length <= 2n.

Lecture 30 
4/9/18 Monday

   Review of Lovasz randomized perfect matching tester (see last lecture)

   Suppose we know (say, by running the Lovasz algorithm) that G has
   a perfect matching.  How might we construct one?

   Self Reduction

       This is a very general technique, applicable to a lot of problems.

       Assume G has a perfect matching
       Choose a vertex v (it must be matched)
         For each w adjacent to v:
             If G' = G-v-w has a perfect matching,    (*) 
               recursively construct a perfect matching M' for G'
               return M = {v,w} + M'
             Else go on to the next w

       Since the matching test is randomized, this might go through all
           w without making a recursive call on a smaller graph. 
           In this case, we can just try again.

       The probability that we are wrong on all edges linked to v is

          <= 1/2^{# of partners for v} <= 1/2   (since v has at least 1 partner)

       So E[# of time through the loop] <= 2.

       By linearity of expectation, the expected number of times we use
       the Lovasz graph tester is bounded by

           2n + 2(n-2) + 2(n-4) + ... ~ 2 n^2 / 4 = n^2 / 2.

       So the *expected* run time to construct a matching is polynomial.

       Under the promise that G has a PM, this is a Las Vegas algorithm:
       always right, fast in expectation.  Algorithms like the matching
       tester are Monte Carlo: always fast and probably right.
  
   Directly Constructing a Matching

       How might we get more information out of the Tutte polynomial?
       One idea is to use arithmetic.  (Fields like Z_p only have trivial
       arithmetic: if x,y are nonzero, x divides y and vice versa.)
       Let's take our evaluation points from Z, and evaluate T(a_ij)
       as an integer. 

       We will use powers of 2 as our evaluation points.

       Consider the example from last time:

                          x = 2^3
                        * ------- *
                        |         |
               y = 2^1  |         | z = 2^4
                        |         |
                        * ------- *
                          w = 2^1

         For the indicated evaluation points, 

            T = (xw - yz)^2 = (2^4 - 2^5)^2 = 2^8.

            The term with the smallest exponent of 2 comes from the
            matching of smallest multiplicative weight.
            (The multiplicative weight is the product of all the weights
            of edges participating in the matching.)
            This matching is

                               *---*

                               *---*
             
         When the exponents of the 2-powers are spread out enough, our
         intuition is that only one matching has the minimum exponent.
         MVV recommend taking exponents from 1..2m, when G has m edges.
         (We'll justify this choice later).  Assume this works, and
         that only one matching has the minimum exponent, say w.

         Then, if w_ij is the exponent for the edge {i,j},

             det(T(2^{w_ij}) = 2^2w * [odd number] = 4^2 * [odd number].

         By Cramer's rule, there is an integer matrix B for ahich

                B T(2^{w_ij}) = 4^w * E,

         where E = diag(epsilon1,...,epsilonn) is a diagonal matrix
         with odd entries.

         For the example, w=4 and we have

          [ 0  1  -8  0 ]   [ 0  8  2  0 ]   [ 24          ]
          [ 0  1  -8  0 ] * [-8  0  0 16 ] = [    24       ]
          [-3  0   0 -3 ]   [-2  0  0  2 ]   [       24    ]
          [ 0  1   4  0 ]   [ 0 16 -2  0 ]   [          24 ]

               B * 2^{-5}      T(2^{w_ij})       2^3 * 3.

         MVV prove that
                                                   B_ij  * 2^{w_ij}
            {i,j} is in the min-wt matching <=>   ----------------- is odd.
                                                         4

            The expression on the right has entries from a Hadamard
            product:  A (*) B = C, where C_ij = A_ij * B_ij.

        For the example:

                                  [  0  8 -16  0 ]
               B * 2^{-5} (*) T = [ 24  0  0 -48 ]
                                  [-48  0  0  24 ]
                                  [  0 16 -8  0  ]

        so

             B (*) T   [ 0  1* -2  0 ]
             ------- = [ 3* 0   0  6 ]
               2^8     [-6  0   0  3*]
                       [ 0 -2  -1* 0 ]

        The odd entries are marked with a *.  They are at
        1,2, 2,1, 3,4, and 4,3.  So M is {1,2}, {2,3}.

Lecture 31 
4/11/18 Wednesday

   The Isolation Lemma and Applications
   N. Ta-Shma, manuscript 2012.

Lecture 32 
4/13/18 Friday

Introduction to randomized rounding

Raghavan and Thompson, Combinatorica, 1987.

This can be useful for getting approximate solutions to NP-hard
optimization problems.

Example 1: Vertex cover

G = (V,E) undirected graph

Want to choose a smallest set of vertices that hits all edges
     (imagine putting guards in an art gallery)

Suppose G is a triangle.  min is 2.  As a 0-1 integer program:

	  min x1 + x2 + x3 s.t.
	   
	      x1 + x2         >= 1
	      x1      + x3    >= 1
		   x2 + x3    >= 1

	      xi in {0,1}.

	  Relax this to an LP (0 <= xi <= 1).
	  This has the fractional solution x1 = x2 = x3 = 1/2.
	  (Optimal, by considering the dual LP.)

	  Let's interpret the xi's as probabilities, and include
		vertex i if its coin comes up "heads".

	  The number of vertices chosen has a binomial distribution.
	  So

	       Pr[0 vertices] = 1/8   (infeasible)
	       Pr[1 vertex]   = 3/8   (ditto)
	       Pr[2 vertices] = 3/8   the optimal choice
	       Pr[3 vertices] = 1/8   feasible, not optimal.

	   Pr[feasible] = 1/2.

	   E[# vertices | feasible] = 2*3/4 + 3*1/4 = 9/4 = 2.25.
	   This is about 13% worse than optimal.

More generally, consider G = K_n (complete graph w/ n vertices).
The optimal fractional solution is still xi = 1/2, for all i.
Then

	 Pr[feasible] = (n+1)/2^n     (small for large n)

	 E[# vertices | feasible] = n^2 / (n+1)

	 E[# vertices | feasible]       n^2
	 ------------------------ = ------------ ~ 1 + 1/n^2 .

	 This is pretty good, provided we pick a feasible solution.
	 However, the chance of feasibility is small.
	    optimum 0-1 choice       (n+1)(n-1)
  
Example 2: Wire Routing

G = (V,E) undirected graph (usually a grid)

Given subsets S1, S2, ... S_k  [  V
For each i, there is a collection of trees that can be used to
	   connect Si.

Goal: Choose one tree for each S_i so as to minimize the maximum
     traffic over the edges.

Note: This is a so-called bottleneck optimization problem.
     Our metric is the min, over edges e, of the traffic on e

Example:  3 x 1 grid

   A *----*----*----* b         S1 = {A,a}
     |    |    |    |           S2 = {B,b}
   B *----*----*----* a         so we're connecting opposite corners

   Available trees:

     T11                       T12
     *----*                    *----*----*              for S1
	  |                              |    
	  *----*----*                    *----* 

     T21                       T22                      for S2
	  *----*----*                    *----* 
	  |                              |    
     *----*                    *----*----*

    Any feasible choice has 2 links on some edge.  (Use symmetry.
    Choose T11, then consider adding T21 or T22.)

As a 0-1 integer program

	  min W (width) s.t.
	  x_ij in {0.1}              x_ij = 1 means we use Tij
	  sum_j x_ij >= 1, all i     each set is connected by some tree
	  sum_{ ij s.t. e in Tij } xij <= W  (edge traffic <= W)

LP relaxation: same but with 
	      0 <= x_ij <= 1
	      sum_j xij = 1    (no reason to over pack)

For the example the optimal fractional choice is xij = 1/2 (all ij)
The width is 1, which is optimal. 

Randomized rounding recipe

   Solve the LP.
   For each i, we interpret the xij's as probabilities pij that sum to 1
   Each set chooses its tree at random, but with (possibly) non-uniform
					probabilities
   Note that the choices for each set are independent, and that
	     feasibility is guaranteed.

   E[traffic thru e] = sum_{Tij has e} p_ij <= W_LP (an LP constraint)

   We want the traffic thru edges to be *uniformly* small.
   This is made probable by exploiting independence.

A concentration result

   Lemma: Let Z = Z1 + ... + Zt be a sum of independent 0-1 random vbles
	      mu = sum_i E[Zi]   (these means might be different)
	      beta >= 0

	  Then Pr[ Z >= (1+beta)mu ] <= exp( - beta^2 mu / 3 ).

   This is a version of Azuma's inequality (sometimes associated with
   Bernstein, Chernoff, Hoeffding, ...).  See Angluin and Valiant,
   JCSS, 1979.

   Think of Zi as saying that the tree for Si runs through e.

   So we are expressing local traffic as a sum of independent rv's.

   The quantity (1 + beta)mu is called a "large deviation" (despite
       how it looks!).  When the expectations are all 1/2,
       the lemma says that the probability of observing a sum this
       far out decays exponentially to 0.  This is consistent with
       the DeMoivre-Laplace limit theorem, which implies that with 
       high probability, the sum is mu + O(sqrt mu).
   
Theorem (RT).  Let |E| = m.  Let 0 < epsilon < 1 be such that

		     3 log(m/epsilon) <= W_LP   (opt fractional width)

	      Then with probability >= 1-epsilon, the width of the
	      randomized rounding solution is

		  <= OPT + sqrt( 3 OPT log(m/epsilon) ).

	      Here (as is customary) OPT denotes the cost of the best
	      integral solution.

Proof.  W_LP <= OPT, so it will be enough to show that w/ the
	indicated probability,

	      width < W_LP + sqrt( 3 W_LP log(m/epsilon) ).

	(this constraint is more stringent).

	Consider the edge e, and let mu = E[traffic on e]

	 Pr[e's traffic >=   W_LP + sqrt( 3 W_LP log(m/epsilon) ) ]

      <= Pr[e's traffic >=   mu + sqrt( 3 mu log(m/epsilon) ) ]

		 (since mu <= W_LP from edge constraint)

	 Rewrite this event as

	    e's traffic >=   mu + mu * sqrt( 3 log(m/epsilon) / mu )
				       <------ call this beta ----->

	 By the lemma its probability is no more than

	       exp( -3 log(m/epsilon) / mu   *   mu / 3 )
	     = exp( - log(m/epsilon) )
	     = epsilon / m

	 One of the axioms of probability says that the probability
	 of a union of events is no more than the sum of their
	 individual probabilities.  Applying this,

	     Pr[ some edge has traffic >= ... ] <= epsilon.

Lecture 33 
4/16/18 Monday

    Randomized rounding for MAX-SAT
    Goemans and Williamson, SIAM J. Disc. Math., 1994

    An Optimization Version of Boolean Satisfiability

       Consider formulas in conjunctive form, e.g.

              phi = (x v y' v z) ^ (x' v z')

              This is an AND of clauses.  Each clause is an OR of
              literals (variables or their negations).

              Often called CNF (conjunctive normal form) even though
              it isn't unique.  For example, I could replace the
              last clause by (x' v y v z') ^ (x' v y' v z').

       Want to determine the maximum number of clauses that can
       be simultaneously satisfied.

       This is an NP-hard optimization problem, by the Cook-Levin
       theorem.  (phi is satisfiable iff the maximum is the
       number of clauses.)

       For simplicity we'll assume there are no repeated clauses,
       and no repeated variables in a clause.  Also write 1 for
       true and 0 for false.

    Random Assignment

       This is a simple strategy.  Flip a coin for each variable
       and then set it to 1 if the coin was "heads" and to 0
       otherwise.

       Let C be a clause with k literals.  We can assume that
       C is x1 v x2 v ... v xk.

       Unless all xi are 0, the clause C is satisfied.  So
       Pr[C is satisfied] = 1 - 2^{-k}. 

       If there are m clauses, each with >= 2 literals, the expected
       number of satisfied clauses is >= (3/4) m.
                
    Using Random Rounding

       Integer program for MAX-SAT:

               Given clauses C1, C2, ..., Cm using variables x1, ..., xn.

               Let Cj+ be the variables appearing positively in Cj
               Let Cj- be the variables appearing negatively in Cj

               E.g. for C1 = x v y' v z          C2 = x' v y'

                        C1+ is x,z               C2+ is empty
                        C1- is z                 C2+ is x,y

                  (Note these sets give another way to write the formula.)

               The IP is

                   max sum_{j=1}^m zj    s.t.

                   sum_{xi in Cj+} yi + sum_{xi in Cj-} (1-yi) >= zj, all j

                   yi, zj in {0,1}

                   yi indicates whether xi is set to 1 or not
                   zj is the indicator for satisfaction of the j-th clause
       
               The LP relaxation is

                   same, but with 0 <= yi, zj <= 1.

               To get an assignment we solve the LP, obtaining possibly
               fractional values yi^, zj^.  Then, set xi to 1 with 
               probability yi^.  (Note that feasibility is guaranteed.)

       The Probability of Satisfaction by Random Rounding

           Let beta_k = 1 - (1 - 1/k)^k.   
           Recall zj^ is gotten by solving the LP

           Claim.  If Cj has k literals, 

                        Pr[Cj is satisfied] >= beta_k zj^.

           Proof.  We can assume Cj is x1 v ... v xk.  As before,

                   Pr[Cj satisfied] = 1 - Pr[all xi are 0]

                                    = 1 - prod_{i=1}^k (1 - yi^).

                   Recall if alpha_1,...,alpha_k >= 0, sum alpha_i <= const,
                   the product alpha_1 * ... * alpha_k is largest when all
                   alpha_i are equal.  (This is a consequence of the
                   arithmetic-geometric mean inequality, for which see
                   1/23/18 lecture.)

                   From the LP,

                      sum_i yi^ >= zj

                   which is equivalent to

                      sum_i (1 - yi^) <= k - zj

                   Therefore, 

                      prod_i (1 - yi^) <= prod_i (1 - zj^ / k)

                              [the average of 1-yi^ is <= 1 - zj^ / k.]

                   So 

                     Pr[Cj satisfied] >= 1 - (1 - zj^ / k)^k

                                      >= (1 - (1 - 1/k)^k) zj^

                                      [2nd >= holds if 0 <= zj^ <= 1,
                                       prove this as an exercise.]

                                       =  beta_k zj^.

       A Combined Strategy

           Run both procedures.  Use the assignment that satisfies
           the largest number of clauses.

           Suppose random assignment satisfies n1 clauses
                   random rounding satisfies n2 clauses

           Theorem. E[ max {n1, n2} ] >= 3/4 sum_j zj^ >= 3/4 * OPT of MAX-SAT

           Proof. The second inequality holds because we relaxed a
                  maximizing problem to get the LP.

                  E[n1] = sum_k sum_{Cj w/ k literals} (1 - 2^{-k})
                       >= sum_k sum_{Cj w/ k literals} (1 - 2^{-k}) zj^ .

                  E[n2] >= sum_k sum_{Cj w/ k literals} beta_k zj^.

                  Since max >= average, E[max{n1,n2}] is lower bounded by

                    sum_k sum_{Cj w/ k literals} [(beta_k +  1-2{-k})/2] zj^.

                  The expression in brackets is >= 3/4 for k=1,2,...
                  (exercise for you), so

                  E[max{n1,n2}] >= (3/4) sum_j zj^ >= (3/4)*OPT.

Lecture 34
4/18/18 Wednesday

   New topic: Applications of nonlinear optimization

   Today: Introduction to convexity

          Papadimitriou and Steiglitz 1.5, 1.6
          Fleming, Functions of Several Variables, 1-4, 1-5, 2-4.

   Why? Convexity is the natural language for discussing optimization
   problems that go beyond linear programming.  In many cases, convex
   problems are "easy" to solve whereas non-convex problems may not be.

   Convex sets

     Defn. Let C be a subset of R^n.  C is *convex* if, for every x,y in C,
     the line segment joining x to y is within C.

     (The line segment joining x and y is given parametrically
      by z = lambda x + (1 - lambda)y, 0 <= lambda <= 1.)

             In R^2, boxes and discs are convex.
                     kidney-shaped sets aren't.
                     
             [draw the pictures]

       In R^1, convex sets are the intervals (possibly unbounded).

       Every convex set is connected (meaning that any two points
       in the set can be linked by a path).  Here the line segment
       between the points can serve as the path.

       Note: The defn of convexity involves universal quantifiers:
       for every x,y in C, for every lambda in [0,1] ...
       so the intersection of any family of convex sets is convex.

   Convex functions
     
     Defn. Let f : C (convex) --> R.  f is *convex* if for each x,y in C
           and each lambda, 0 <= lambda <= 1,

              f(lambda x + (1-lambda)y) <= lambda f(x) + (1-lambda) f(y).

     The idea of this is that f(average) <= average(f).

     This has a probabilistic interpretation.  Define X (a random variable)
     by

                  X = x w.p. p
                    = y w.p. q = 1-p.

     Then  f(E[X]) <= E[ f(X) ].  This is an example of Jensen's inequality.

     On an open convex set, a convex function is continuous. (For proof,
     see Fleming, p. 26.)  But it need not be smooth: consider 
     f(x) = |x| on R. 

        Warning: Continuity can fail at boundary points.  Consider, for
                 example,

                          f(x) = 0,  0 <= x < 1
                               = 1,  x = 1.

     If f is convex, the *sublevel set* {x : f(x) <= alpha} is convex.

                [draw the picture.]

     Typically we present a convex set as an intersection of
     sublevel sets:

            C = intersection_{i in I} {x : f_i(x) <= alpha_i }.

     Example: polytope, as an intersection of finitely many half planes.

            E.g. in R^2, x >= 0, y >= 0, x+y <= 2, same as

                         -x <= 0, -y, <= 0, x+y <= 2.

   Convex optimization (generalizes LP):

     Given C,f (both convex) find x^ minimizing f(x) on C.

     Warning: If f is convex, -f need not be.  So we can't interchange
              maximizing and minimizing problems as freely as we did
              for linear programming.

     Warning++: Even if C is bounded, the min need not exist. Consider
                min x+1 s.t. 0 < x < 1.  

                If C is compact (closed and bounded) and f is continuous,
                then the min must be attained.  This is true even if f isn't
                convex.

     By adding an extra dimension, we can make f(x) linear:

         min f(x) s.t. x in C  <==>  min t s.t. x in C, f(x) <= t.   

                                     (taking the min over x,t)

         To understand why this is true, consider a picture:

                   \ . . . . . /
                    \. . . . ./
                     \ . . . /
                      \. . ./
                       \ . /
                        \./
                         *

         The set of (x,y) with f(x) <= y is called the *epigraph* of f.
         Each "fiber" of the epigraph is a closed half-line.  We are
         looking for the *lowest* boundary point of these half-lines.

   Big theorem: Local minima are global minima

     Let x* be a local min for f : C --> R (both f and C convex).
     Then x* minimizes f over all of C.

     Proof: We will reduce to dimension 1 (this trick often works).

            Suppose (for contradiction) that x* isn't a global min.
            That is, there is a better x^ somewhere in C.  Since C
            is convex, the line segment from x* to x is within C.
            We use lambda in [0,1] as the local coordinate for that
            line segment, with lambda=0 indicating x* and lambda=1
            indicating x^.

            Let g(lambda) = (1-lambda)f(x*) + lamba f(x^).  (Draw the
            picture -- the graph of g is a straight line lying above
            the graph of f, and g' = f(x^) - f(x*) < 0.)

            If 0 < lambda < 1, 

              f((1-lambda) x* + lambda x^) <= (1-lambda)f(x*) + lambda f(x^)
                                            = g(lambda)
                                            < g(0)       (g decreases!)
                                            = f(x*).

            Since lambda can be arbitrarily small, x* wasn't a local min.

     This is a big deal because it means that we only need to
     investigate f locally to verify that we have an optimum.
     (Of course, we still have to prove global convexity.)

   Some examples to ponder.

    1) C = (a,b), f smooth on (a,b), f'' > 0.  Any critical point
       (place where f' = 0) minimizes f on the interval.

    2) C = [-1, 1].  f given by

                      0,          -1 <= x <= 0
               f(x) = 
                      exp(-1/x),  0 < x <= 1

       (This is a famous example of a smooth function that is not
        equal to the sum of its Maclaurin series around x=0.)

       There is an interval of critical points: [-1,0].  Each is
       a global min.

    3) Quadratic forms in R^2

       f = a x^2 + b xy + c y^2.  For simplicity assume a != 0.

       partial f / partial x = 2ax + by
       partial f / partial y =  bx + 2cy

          These both vanish at the origin, which is always
          a critical point.

       Hessian (matrix of second partials) is

             ( 2a    b )
             (  b   2c )

       A smooth function f is convex iff its Hessian matrix
       is positive semidefinite (means that z^T H z >= 0 for all z).
       So f is convex iff f(x,y) >= 0 for all x,y.

       When does this happen?  Complete the square:

          af = (ax + b/2 y)^2 + (ac - b^2 / 4) y^2.

       From this we can see that f(x,y) >= 0 whenever:

          a > 0  and  4ac - b^2 >= 0.

       In this case, the local min at x=y=0 is also a global min.

       What if a = 0?  The general result, which you can prove with
       similar reasoning, is:  f is convex iff a, c, 4ac - b^2 >= 0.

Lecture 35
4/20/18 Friday

  Semidefinite programming (SDP)

     Lovasz, Semidefinite Programs and Combinatorial Optimization, in
     Recent Advances in Algorithms and Combinatorics, ed. B. A. Reed
     and C. L. Sales, Springer-Verlag 2003, pp. 137-194.  This will be
     on the course web site.

  Convex optimization (min f(x) s.t. x in C, both f and C convex) has
  nice theoretical properties.  For example, any local minimum of f is
  also a global min.  However, it includes all kinds of weird non-intuitive
  possibilities, such as

                             1   if x = 0
                      f(x) =
                             x   if 0 < x <= 1.

  Here, [0,1] is compact (closed and bounded) but the min doesn't exist.

  Like convex optimization, SDP is a generalization of linear programming,
  but it is better behaved.

  Defn.  Let A be an nxn matrix with positive entries.   
         A is *positive semidefinite* if x^T A x >= 0 for all x in R^n.

         S = {nxn positive semidefinite matrices} [ R^{n^2}.

         S is convex:  if x^T A x, x^T B x >= 0, and 0 <= lambda <= 1, then
                       x^T ( lambda A + (1 - lambda) B ) x >= 0.

         S is a cone:  if x^T A x >= 0 and gamma >= 0, then 
                       x^T( gamma A ) x >= 0.

         Same for S' = {symmetric pos. semidefinite matrices} [ R^(n+1 choose 2)

  Lo"wner Ordering.    A >> B means that A-B is positive semidefinite
                       So A >> 0 means that A is pos. semidef.

  Some examples.

       (n=1)  ax^2 >= 0 for all x  iff  a >= 0.

                  [a b]
       (n=2)  A = [b c]    Associated quadratic form: ax ^2 + 2bxy+ cy^2.

                                        |a b|
              So A >> 0 <==> a >= 0 and |b c| >= 0.

       (general n):  Let A be n x n, symmetric:

                         a11 | a12 | a13 |  ... a1n
                         -----     |     |
                         a12   a22 | a23 |  ... a2n
                         -----------     |
                         a13   a23   a33 |  ... a3n
                         -----------------
                               ...

                     A >> 0 iff all n "upper left corner" determinants
                                are >= 0.

                     This means we can check whether A >> 0 or not
                     in polynomial time -- O(n^4) operations -- if we
                     use Gaussian elimination to compute these determinants.

  Note that optimization of linear functions over S or S' is pretty
  uneventful: the optimum is either 0 or the problem is unbounded.
  So we introduce further constraints.

  Defn. In R^n, an *affine subspace* is a translation of an ordinary
        linear subspace.  These can be specified in two ways:

           1) Solution set of linear equations

           2) Image of an affine map t --> At + bt

        For example, the line through (1,0) and (0,1) is either:

           1) Soln set of x+y = 1
        or
           2) All points (t,-t) + (1/2, -1/2).

  Semidefinite programming: minimize a linear function of the entries of
                            a symmetric matrix X, subject to:
                            i) affine constraints on entries, and
                            ii) X >> 0.

  Just as with LP, SDP's can be presented in two ways, reflecting the two
  ways to specify the affine constraints.

       1) min C o X such that Ai o X = bi, i=1,...,m
                              X symmetric, X >> 0

          ( C o X means sum_ij cij xij, the elementwise inner product. )

       2) Given A0, A1, ..., Am symmetric nxn matrices, c in R^m

          min cx s.t. A0 + A1x1 + ... + + Am xm >> 0.
               ^
           x is (x1,...,xm) here

  Theorem: SDP has LP as a special case.

           Proof. The idea is that LP is SDP on the diagonal.
                  We use the abbreviation diag(a1,...,an) for a
                  diagonal matrix with the indicated entries.

                  we translate min cx s.t. Ax = b, x >= 0 as follows:

                     X = diag(x1,...,xn),  so xij = 0 if i != j
                     xi >= 0, all i  iff  X >> 0
                     ax = b iff diag(a1,...,an) o diag(x1,...,xn) = g
                     cx is diag(c1,...,cn) o diag(x1,...,xn).

  Background on eigenvectors and eigenvalues

        Let A be a matrix.

        A number lambda for which there is a nonzero v making Av = lambda v
        is called an *eigenvalue* and the v is an *eigenvector*.

        The idea is that along the direction given by v, the action of A
        is extremely simple: it just stretches or shrinks v.  The amount
        of stretch or shrink is the eigenvalue.

        An n x n symmetric matrix with real entries has n real eigenvalues

                 lambda_1 >= lambda_2 >= ... >= lambda_n

        These are counted with multiplicity, and the corresponding eigenvectors
        are orthogonal.

        This is proved in G. Strang, Linear Algebra and its Applications, 
        pp. 235 ff.

  The mother of all SDPs: Find the largest eigenvalue of a real symmetric
                          positive definite matrix.

      Theorem.   max { lambda_i } = min { lambda : lambda I - A >> 0 }

                 This is like a strong duality result, but note that
                 the max is over a discrete set and the min is over a
                 half-line.

      Proof.       for all x   x^T( lambda I - A ) x >= 0
              <==> for all x   lambda ||x||^2 >= 0
              <==> for all x != 0,

                                            x^T A x
                                 lambda >= ---------
                                            ||x||^2

                                           (the Rayleigh quotient).

              Since the quotient is homogeneous (degree 2) we may as well
              let ||x|| = 1.  Then, write x = v1 + ... + vn where the vi
              are eigenvectors for lambda_1,...,lanbda_n.  Then

                    A x = lambda_1 v_1 + ... + lambda_n vn
                x^T A x = lambda_1 ||v_1||^2 + ... + lambda_n ||vn||^2
                       <= lambda_1 ( ||v_1||^2 + ... + ||vn||^2 )
                        = lambda_1 ||x||^2

              This upper bound is tight: take x = v1 / ||v1||.

      So, finding lambda_1 is an example of SDP (type (2)) with m=1:

                lambda I - A = -A + I lambda
                                ^   ^
                                |   |
                               A0  A1

                lambda = c lambda with c=1.

      [If A is not pos def can it easily be made so?]
 

Lecture 36
4/23/18 Monday

   Using SDP to approximate maximum cuts
   Goemans and Williamson, J. ACM, 1995

   Max-Cut Problem:

         G = undirected graph with edge weights (>= 0).
         Want to partition vertices into two sets and maximize the total
           weight of the crossing edges.
         (Note that this forces both sets to be nonempty.)

         Example: G = K_n (complete gph on n vertices)
                  If |S|= k, |V-S| = n-k.  
                  There are k(n-k) crossing edges.
                  We maximize this when k = floor(n/2).

         NP-hard, even if the weights are all 1 (Karp 1972).

         G+W found an approximation algorithm based on semidefinite programming.

   Review of SDP from last lecture

         pos semidef matrices
         SDP: minimize linear fn of pos semidef matrix coeffs, coeffs may be
              linearly constrained

   Factoring Positive Semidefinite Matrices

         The expression x^T A x looks like the square of a length.
         By factoring A we can make this explicit.

         Thm.  If A >> 0 (symm pos semidef) there is a real, upper-triangular M
               s.t. A = M^T M.

         This is called a Cholesky factorization.  

         Note that x^T A x = x^ (M^T M) x = ||Mx||^2.  So in the right 
         coordinates, sqrt(x^T A x) *is* a length.

         The matrix M can be computed by an algorithm using O(n^3) operations
         from {+,-,*,div,sqrt}.  The square root is necessary, since for n=1
         we see that a = m^2 implies m = +- sqrt(a).  Golub and Van Loan
         have a discussion of this.

   Relaxing Max-Cut to SDP

         Let w_ij be the weight of {i,j}, or zero if there is no such edge.

         Here is a nonlinear integer program for Max-Cut:

               max  (1/2) sum_{i<j} w_ij ( 1 - yi yj)  s.t. yk in {+-1}

         Note that we pick up 2 w_ij when signs disagree, i.e. when
         vertices i and j are on opposite sides of the cut.  The factor
         1/2 is there to make the objective values agree.

         Now we relax to a continuous problem:

               max  (1/2) sum_{i<j} w_ij ( 1 - (yi, yj) )  s.t. ||yk|| = 1
                                                                yk in R^n.

               The vectors yk could be chosen from a pair of antipodal 
               points on the unit sphere, making the inner product +1
               if the points are the same, and -1 if they are opposite.
               So this is a genuine relaxation.

         G+W observe that this is a disguised version of SDP.  Note that:

               X (symmetric) is pos semidef <==> X = Y Y^T,  say with

                                                     [row vector y1]
                                                 Y = [    ...      ]
                                                     [row vector yn]


               X_ii = 1                     <===>  || yi ||^2 = 1

               X_ij                         equals   (yi, yj)

         So we can write the max-cut relaxation as an SDP in equality form:

               min   sum_ij wij X_ij   s.t.   X >> 0,   X_ii = 1, all i.

         Assume that we can solve this, and factor the resulting X to
         get the yi's.  To get the approximate Max Cut:

                      Choose a random hyperplane H thru the origin
                      Label each vertex vi by +1 or -1, depending on
                        which side of H the vector yi is on
                      Use this vertex classification as the cut

         In expectation, the value of the G-W max cut is close to optimal.

                 Let theta_i, theta_j be two unit vectors, theta_ij the angle
                 between them, chosen so that 0 <= theta_ij <= pi.

                 H intersects the plane of yi,yj in random fashion, so

                      Pr[yi,yj on opposite sides of H] = theta_ij / pi.

                 Lemma.  There is a c>0 such that on [0, pi]

                              theta/pi >= c (1 - cos theta)/2

                         (Exercise for you.)

                         This is very plausible if you graph the functions
                         on the left and right side.  Observe that c=1 is
                         too big.  The largest c that works is alpha = 0.87856...
                         and can be found numerically.

                 Using this,

                   E[G-W cut value] = sum_{i<j} w_ij theta_ij / pi

                                   >= sum_{i<j} w_ij alpha (1 - cos theta_ij)/2

                                   >= alpha/2 sum_{i<j} w_ij (1 - (yi, yj)

                                   >= alpha * max over yk in +-1 of
                                      (1/2) sum_{i<j} w_ij (1 - yi yj)

                                    = alpha [maximum cut value]
                                    
         To finish up, let's see how we can make our random hyperplane H.  

                  Let Z = (Z1,...,Zn) where the Zi are i.i.d. normal with mean 0
                  Then U = Z / ||Z|| is a unit vector in a random direction.
                  H is the hyperplane of vectors normal to U, and
                  we give vi the sign of the inner product (yi,H).

                  Degeneracy (some yi on H) is possible, but this event has
                  probability 0.  If it occurs we can simply try again with
                  a different Z.

Lecture 37
4/25/18 Wednesday

    Search vs. Decision for LP
    Klee-Minty "bad" instances for naive simplex.

Lecture 38
4/27/18 Friday

    The Ellipsoid Algorithm

         Papadimitriou & Steiglitz, 8.2
         Lawler, Mathematical Intelligencer, v. 2, 1980
         Aspvall and Stone, J. Algorithms, 1980

    Used by Khachiyan to give the first proof that LP can be done in poly time

    Solves the "feasible point" problem: given a nonempty polytope C [ R^n,
          find an x in Co
    
    We will assume for simplicity that C is compact (closed and bounded)
          and has potitive volume

    Separation Oracles

        Hyperplane: {x : ax = b}.   This defines two other sets,

               H_ = {x : ax < b}
               H+ = {x : ax > b}

        Let C [ R^n, compact convex set

        A separation oracle for C takes in x in R^n, and either tells us that
        x is in C, or provides a hyperplane H containing x such that C is 
        entirely within H_.  (This means the distance from C to H is positive.)
        The idea:

                            H:
                               \
               C:               \
                     **          x
                    ****          \
                    ****           \
                    ****            \

         Any polytope {x : Ax <= b} has a poly-time separation oracle.
         If Ax <= b, say that x is in C.  Else, some inequality is violated,
         say A_i x > bi.  Let H be {x : Ai x - b = 0}.

    Ellipsoid as a Generalization of Binary Search

         At all times, we have an ellipsoid E that contains C.  


                  H:
          E:       |
              ----------
             /     |    \
             | C   x    |
             \     |    /
              ----------
                   |

         We test x, the center of E.  If x is not in C, we draw a new
         ellpsoid E' that encloses the "left half" of E.  E' will have
         volume that is smaller than the volume of E, by a constant factor
         that depends only on u.  Replace E by E' and continue.  Since
         the volume of E shrinks geometrically, this cannot continue forever. 

         For n=1, ellipsoids are intervals.  So:

                     
          E:        C:
              |-----[***]--x------------|
                           ^
                       test center (here)
                       if x not in C, we reduce the length (= 1-dimensional
                       volume) by 1/2.

          This is just binary search.
                       
    Computational Details

       Affine map on R^n:    x --> Mx + b, M nxn invertible

                             solve y = Mx + b to get inverse:

                              x = M^{-1}(y-b) = M^{-1} y - M^{-1} b.

       Ellipsoid:  { x : (x - x0)^T A (x - x0) <= 1 },  A symm pos def

       Using a polynomial number of operations, we can find an affine change
       of coordinates that brings the ellipsoid to the form ||x||^2 <= 1.
       The idea of the algorithm is to repeatedly complete the square, and
       bring the quadratic expression x^T A x to diagonal form.

       Note that ||x||^2 = x^T I x.    The center is x=0.

       Square root is needed, since x^2 <= 2 iff (x / sqrt 2)^2 <= 1.

       So we can assume E is {x : ||x|| <= 1}.  Its two "slices" are
                        E- = {x | ||x|| <= 1, x1 < 0}
                        E+ = {x | ||x|| <= 1, x1 > 0}

       We get E' by taking the center (x=0) and moving it to the left
       along the x1-axis, maintaining the property that E' touches the
       left end of E, and meets the topmost points (0,x2,...,xn).

            For this we can draw the picture for the (x1,x2) plane.
            By symmetry it is enough to consider these coordinates.

            A good choice for the new center is x' = (-1/(n+1),0,...,0).
          
               The new radius (distance to left end of E) is 1 - 1/(n+1).
               We want (-1/(n+1),1) to be on E' (we are setting all other
               coordinates to 0), which is

                          x1^2 / r1^2 + x2^2 / r2^2 = 1.

               Solving this for r2 (we know x1 = -1/(n+1), x2=1) we get

                          r2 = (1 - 1/n^2)^{-1/2}.

            Suppose 
                      Vol(E)  = c_n
                      Vol(E') = c_n* r1 * r2^{n-1}

           Then

                 Vol(E')/Vol(E) = r1 r2^{n-1}

                                = (1 - 1/(n+1))(1 - 1/n^2)^{n/2}

                                      ... (computations skipped here)

                                = 1 - 1/(2n) + O(1/n^2)


        So applying this 2n times will shrink volume by approximately

                (1 - 1/2n)^{2n} ~~ e^{-1}, about 0.37.

        Suppose we choose the initial ellipsoid so that

                C [ E_0 [ [-R,R].

        Then (to first order), if there are k iterations,

           Vol(C) <= Vol(E_k)
                  <= (1 - 1/2n)^k Vol(E_0)
                  <= (1 - 1/2n)^k (2R)^n
                  <= exp( - k/2n ) (2R)^n

        So
             log(Vol(C) <= -k/2n + n log 2R

        which rearranges into

             k <= 2n^2 log (2R) - 2n log( Vol(C) ).

        So after at most this many iterations, we will get an x in C.

Lecture 39
4/30/18 Monday

    Continuing with Ellipsoid Algorithm

    Review of ellipsoid algorithm (last lecture)

    Getting Started

       Suppose C is the polyhedron {x : Ax <= b} [ R^n
                                   all entries of A,b integral, 
                                       <= M in absolute value.

       C is a convex combination of its vertices (here is where we use
       that C is bounded).

       At each vertex v = (v1 ) there are n linearly independent tight
                          (...)           inequalities
                          (vn )

       So v is the unique solution to Bv = b, where B is a submatrix of A
       and det(B) != 0. (This follows from properties of the simplex 
       algorithm).

       By Cramer's rule

                              det(B with i-th col replaced by b)
                         vi = ---------------------------------
                                          det(B)

       So (expand the determinant into n! terms)

                        |vi| <= n! M^n

       Also, if |vi| = ri/si (fraction in lowest terms) we have
       log(ri), log(si) = O( n log(Mn) ).

       Suppose there are m vertices, and x in C.  Then

             xi = sum_{j=1}^m lambda_j vi^(j),

       where sum_j lambda_j = 1, and all lambda_j >= 0.  This plus
       the bound for vertex coordinates gives

             |xi| <= n! M^n

       implying

           ||x||^2 = sum_i |xi|^2 <= n (n!)^2 M^2n

       so 
           ||x|| <= sqrt(n) n! M^n  <---- make this R

       The initial ellipsoid can just be ||x||^2 <= R^2.

       As required, log(2R) = O( n log n + n log M) is polynomial
       in the number of bits needed to specify C.

    Note: Our argument that linear programming can be done in polynomial
    time is not complete, because we didn't examine the precision required
    in our numerical calculations.  You can find a discussion of that
    in Papadimitriou and Steiglitz.

Lecture 40
5/2/18 Wednesday

    Using the ellipsoid algorithm for SDP
    Lovasz, Semidefinite programming and combinatorial optimization (1995)

    So far we've used the ellipsoid algorithm for linear programming.
    The key concept is that of a separation oracle for a convex set C.
    Its job is to take a test point x, and report that either x is in C,
    or give a hyperplane H that goes through x and has C strictly on one side
    of it.  (See lecture 3/28/18.)  Using the separation oracle, we put
    C inside a equence of smaller and smaller ellipsoids, until we
    eventually find x in C.

    The ellipsoid algorithm can also be used for (well behaved) semi-definite
    programming problems.  We have to be careful, since some familiar
    features of LP need not hold:


         Optima need not rational              Example: eigenvalue problem,
                                               lambda_max is the largest
                                               root of an n-th degree poly
                                               equation.

         Strong duality need not hold          In general, we only have
                                               weak duality

         Optimum points can be huge            There is no analog of
                                               Cramer's rule.

    Let's look at how we can make separation oracles for SDPs. 

    Simplest case: C = { A : A >> 0 } = cone of symmmetric positive definite
                                        matrices

            Factor A = U^T D U,  D = diag(lambda_1,...,lambda_n).
            We can do this because A is symmetric (principal axis theorem)
            A is positive semidef iff all lambda_i >= 0.

            So if A is not pos semidef, lambda_1 (say) < 0. 

            Let v = U^{-1} [1  ].  This a particular vector that depends 
                           [0  ]   on A.
                           [...]
                           [0  ]

            Then

             v^T A v = [10...0] U^{-T} U^T D U U^{-1} [1  ]
                                                      [0  ]
                                                      [...]
                                                      [0  ]
            
                     = [10...0] D [1  ] = lambda_1 < 0.
                                  [0  ]
                                  [...]
                                  [0  ]

            Consider the linear map (X is a matrix)

                 L(X) = -v^T X v.

              All pos semidef matrices X have L(X) <= 0.
              For the "bad" A (not in C), L(A) > 0.

            So take

                H = { X : -v^T X v + v^T A v = 0 }.

            Note that A is in H, and if X is pos semidef,

                -v^T X v + v^T A v  < 0. 

                    ^         ^
                    |         |
                   <= 0      < 0
                                    
    If we are going to use ellipsoid for an SDP problem we need a good
    a priori bound on the optimal matrix.  This allows us to replace
    the (unbounded) cone by a compact set.

    Example: Goemans-Williamson max cut approximation (see lecture 4/23/18)

       Recall we needed max of (1/2) sum_{i<j} w_ij (1 - (yi,yj)),
       where the max is taken over ||yk||^2 = 1.

       Equivalent to finding

               min sum_{i<j} w_ij (yi,yj)   s.t. ||yk||^2 = 1.

       As an SDP (o denotes elementwise inner product of matrices)

                min W o X   s.t.  X  >> 0,  X_ii = 1.           (*)

           Here 
                       x_ij   if {i,j} is an edge in the graph
                  W = 
                       0      if no such edge

       To get the a priori bound, factor X = Y^T Y (Y is called the Gram
       matrix), and let yi be the ith row of Y:

                     [---y_1---]
                 Y = [   ...   ]
                     [---y_n---]

       By Cauchy-Schwartz, |Y_ij| = |(yi,yj)| <= ||yi|| ||yj|| = 1.

       So |X_ij| <= n.

   [Need to discuss the separation oracle for this problem.]
   
   Now, suppose we have the SDP min L(x) s.t. x in K, where r and R
   are radii for enclosing the feasible set K.  (That is, if 
   x in K, then r <= ||x|| <= R.) The following complexity bound 
   appears in a paper by Grotschel, Lovasz, and Schrijver, 
   Combinatorica, v. 1, 1981, 169-197.  (See eqn. (37).)

      Let X^ be the optimum for the above SDP.  To get 

                 |L(x) - L(x^)| < epsilon,

      poly(R/r, log(1/epsilon), n) calls to the separation oracle
      are sufficient.

   For the max-cut SDP, our goal is

                 |W o X - W o X^| < epsilon,

   where W is the weight matrix.

   What are good values of r and R?

          We can take R = n^2, since

              sum_ij |X_ij|^2 <= sum_ij n^2 = n^4.

          Finding r is more problematic.   [Watch this space.]

   Although this gives a polynomial-time algorithm for approximating
   a maximum cut, in practice the ellipsoid algorithm is quite slow.  
   This is a result of:

        1) A volume shrink factor close to 1 (to first order, 1 - 1/(2n))

        2) The number of iterations is not sensitive to the data, so
           the average case is basically the worst case

        3) The need for accurate arithmetic

   For these reasons, modern convex optimization routines use interior
   point methods.  (These are beyond the scope of this course.)

Lecture 41
5/2/18 Wednesday

 Lagrangian Relaxation

   This is a general-purpose approximation method to get bounds
   on solutions to "hard" problems.

   We'll consider minimizing problems

       min f(x) s.t.                         (A)
           x in X
           g(x) >= 0.

   The main idea is to move the constraint into the objective function,
   in effect introducing a reward for compliance: 

       min f(x) - u g(x)  s.t.   x in X      (B)
                                 u >= 0

   Fact. Let x^ be an minimum for (A) and x~ an minimum for (B).  (We'll
   assume all required optima exists.)  Then for any u >= 0,

       f(x^) >= f(x^) - u g(x^)    [x^ is feasible for A, so g(x^) >= 0]
             >= f(x~) - u g(x~)    [x^ in X, so feasible for B]

   This is a lower bound with a parameter in it, which is very valuable.
   Let's make the lower bound as tight as possible:

       min f(x) s.t.       >=     max     min     [ f(x) - u g(x) ]
           x in X                 u >= 0  x in X
           g(x) >= 0


       L(x,u) -- the Lagrangian -- is the expression in brackets
       theta(u) will denote min_{x in X} L(x,u), sometimes called the
                                                 dual function

   Note that we have a "weak duality" result:

         min { f(x) : x in X, g(x) >= 0 } >= max { theta(u) : u >= 0 }

   Now we see how this unfolds for a particular problem.
                            

 Time Constrained Shortest Path

   G = (V,E) is a DAG with distinguished source s and sink t.
   Each edge has a cost c and transit time t
   Given a deadline T, we wish to select, from among s-t paths
   taking total time <= T, one of least cost.

   Imagine a package delivery service accepting an order T 
   hours before Christmas.

   This is an NP-complete problem (Garey & Johnson, p. 214)

   For example (all edges directed left to right):


               (4,2)         
            o ------ o             labels are
    (1,1) /   \    /   \(1,3)  <-- (cost,time)
         /(2,2)\  /     \
        s       \/      t
         \      /\(3,1)/
    (2,1) \    /  \   /(1,3)
           \  /    \ /
            o ------o
               (6,7) 

   Using dynamic programming can show in s-t time 5

   There is one path taking 5 time units:

           s     o
            \  1/ \3          total cost 2+3+1=6
            1\ /   \
              o     t

   If we take use 6 time units, two more paths become available:
       
               o     t
             1/ \2  /3        total cost 1+2+1=4
             /   \ /
            s     o

       and
                   2
               o ----- o
             1/         \ 3    total cost 1+4+1=6
             /           \
            s             t

   To put this in our framework, let P = {s-t paths}, c = cost vector
                                     t = transit time vector

   By weak duality

      min cx s.t.   >=   max      min  { cx - u(T - tx) }
          x in P       u >= 0   x in P
          tx <= T

   If we fix u, finding min of L(x,u) is a shortest path problem,
   in which

              path cost = (c + ut)x - uT.

                             ^         ^
                         new cost    same for
                         vector      all paths

   We can solve this by dynamic programming (remember G is acyclic).

   For each x in P,  u --> L(x,u) is an affine-linear function of u.

   It is shown in convex analysis that the pointwise infimum of any
   collection of affine-linear functions (u --> au + b) is concave.
   (We have finitely many x in P, so the inf is actually the min.)

   So theta(u) is concave.  We can maximize it by a variation on 
   binary search.  The idea is to maintain three samples.  Leaving
   out configurations forbidden by concavity, there are three
   possible cases:


           *
          /
         *         max can't be in left half (right end is bigger)
        /          so remove this.  next sample is between the middle
       *           and the right points.

       *
        \ 
         *         (symmetric)
          \
           * 

         *
        / \        take 2 samples that bisect the "legs"
       /   \
      *     *

      More efficient search strategies can also be used, such as
      Fibonacci search.

   Upshot: if 0 <= optimum u <= U, then we can get the best Lagrangian
   relaxation bound for this problem in O(E log U) steps.