Here is an overview, in two sections, of what I've been working on, so we have
something to discuss. As you will see, both sections stem from what we
discussed at our last meeting, but they go in almost completely different
directions.

I. Simplicial complexes.

Problem. Given 

  beta={beta_i in NN : i=0,...,n}, possibly non-zero Betti numbers,

the task is to characterize the preimage of beta under the "Betti operator"
BET : KK -> NN, where KK denotes the collection of all abstract simplicial
complexes, and NN denotes the nonnegative integers. (This can be thought of as
a special case of the problem I described at our last meeting when all
pairwise distances between vertices are equal.) In particular, at least four
subproblems seem of interest:

  (i) Give an algorithm to enumerate the elements of the preimage.
 (ii) Better: Give a simple formula that specifies the elements of the
      preimage in terms of beta.
(iii) Give an efficient algorithm to find the element of the preimage that is
      ``smallest'', i.e. has fewest i-simplices, i=0,...,n.
 (iv) Better: Give a simple formula for (iii).

It is easy to give an algorithm for (i), but this algorithm will require
exponential running time, because the number of elements to be enumerated
grows exponentially.

I have been working intensively on problems (ii)-(iv), and related issues, for
the past weeks. A slight modification (add backtracking) of the algorithm
given in my previous email to you solves (iii), but requires worst-case
exponential running time. I believe (the details are not 100% worked out) I
have mostly obtained an efficient algorithm to solve (iii), along with several
interesting combinatorial facts needed for the proof.

Unfortunately, though I had searched the literature previously, I did not find
a paper which I did find this past weekend, namely Bjoerner and Kalai, ``An
extended Euler-Poincare theorem'', Acta Mathematica, 1988, which solves (ii)
and (iv), along with many other problems; the way (iv) is solved seems to lead
easily to an efficient algorithm for (iii).

So it seems I have wasted, at least from a short-term perspective, a large
amount of time due to not finding that article initially. (It uses terminology
I was not familiar with and did not search for.)

[One positive thing: I had worried that the above was not of much interest; in
fact last week I wrote, and then abandoned, a note to you describing my
progress with a subject ``Small problems for small minds''. But I think the
authors of the article above are quite well known, and they seem proud of
their result (e.g., it is listed as one of a few ``representative
publications''), so I think at least I found a problem that is interesting to
someone besides me, though unfortunately it was of interest twenty years ago.]

Open(?) and interesting(?) problems. 

(a) It could be of interest to solve (iii) and (iv) with a different
    definition of ``smallest'', e.g., when the 0-simplices are taken to be
    elements of a metric space and the ``bigness'' of a simplicial complex K
    with i-simplices K_i, i=0,...,n, is given by

      sum_i sum_{s in K_i} (mean pairwise distance between vertices of s)

    This will be much harder, though. I doubt that a polynomial-time algorithm
    (or a simple closed-form characterization) exists.

(b) The combinatorial results about the preimage of BET could be used to give
    a hardness result of the type I am going to describe below in section II,
    but for a discrete space Omega with pairwise uniform distances between
    elements of the space. Initial problem: Given samples from Omega,
    partition Omega. New problem: Let K_0 be a subset of Omega whose elements
    are the 0-simplices of a simplicial complex K with Betti numbers beta.
    Given beta and samples from K_0, partition Omega. Question: Is the new
    problem easier, in a sense to be made precise, than the initial problem?


II. Manifold learning is harder than partition learning.

Conjecture: It is not possible, in general, to benefit from manifold learning
as a feature-selection step prior to classification or clustering, unless
strong assumptions are made about the manifold to be learned. In other words,
the nonparametric approaches that seem to be popular with authors up to now,
e.g. Isomap, LLE, Laplacian eigenmaps, are doomed to failure except on toy
problems which happen to be favorable special cases. 

(Implication: For manifold-based feature selection to be useful, one must
either assume the underlying manifold belongs to a parametric family of
manifolds and try to learn its parameters, as you suggested at our last
meeting, or one must make some other strong assumption about the manifold,
e.g. about the Betti numbers, though I am increasingly not convinced this is
really the right thing to do. Perhaps one should assume something about the
``smoothness'' or ``curvature'' of the manifold, not just for short-circut
prevention, but in order to address the above problem. Before we can decide
any of this in a fundamental and general way, we need to understand precisely
what the problem is and what causes it.)

The intuition behind the conjecture is as follows. First, let us make the
setting precise. We are given nothing but 
  
  points and labels D_n={(x_i,y_i) in XX times {0,1} : i=1,...,n} (binary
    classification), or 

  points D_n={x_i in XX : i=1,...,n} (binary clustering),

and a metric ||.||_XX on XX. In both cases, we assume there is an unknown
joint measure mu on XX times {0,1} and D_n is drawn iid according to mu as
usual.  The goal is to partition the space XX, i.e., to learn a map g_n : XX
-> {0,1}, so as to minimize the risk R_n(g_n), defined by

  R_n = Pr[g_n(X) neq Y] (binary classification), or

  R_n = min(Pr[g_n(X) neq Y],Pr[g_n(X)=Y]) (binary clustering),

where the random pair (X,Y) is distributed according to mu. For both cases, we
define

  R* =     inf_g Pr[g(X) neq Y], and
  g* = arg inf_g Pr[g(X) neq Y].

Suppose we take the following approach to solving the problem. For
classification, let

  eta(x) = Pr[Y=1|X=x], and
  eta_n be a regression estimate of eta, based on D_n, that is weakly
    consistent, i.e. that satisfies lim_{n->infty} E[(eta_n(X)-eta(X))^2] = 0.

For clustering, let

  eta(x) = Pr[Y=1|X=x], and
  eta_n be a regression estimate of eta, based on D_n, that is weakly
    consistent, i.e. that satisfies lim_{n->infty} E[(eta_n(X)-eta(X))^2] = 0,
                                or, lim_{n->infty} E[(eta_n(X)+eta(X))^2] = 0.

Theorem [Devroye, Gyorfi, and Lugosi, Theorem 6.5] With the above definitions,

                           E[R_n] - R*
  lim_{n->infty} ----------------------------- = 0.
                  sqrt(E[(eta_n(X)-eta(X))^2])

In other words, the excess risk of the partition converges to zero faster than
the L_2 error of the regression estimate that the partition is based on. [DGL
only give the theorem for classification; the definitions for classification
above are also based on their exposition.] This is not surprising, because in
the partition problem we only need to worry about the decision boundary---we
don't need to worry about most of XX.

Now, the key step in manifold learning is to estimate a geodesic distance
metric ||.||_MM on the manifold MM subset XX. (As a second step, we may try to
find a space with a euclidean distance metric that approximates ||.||_MM, and
then project into that space; it seems clear that this approximation step can
only increase error in general, so we ignore it for now.)

How can we estimate ||.||_MM based solely on ||.||_XX and D_n? Here are some
options:

1. We ignore labels y_i if they are given.

   (a) We construct a k-nearest neighbor graph G of the x_i in D_n. We weight
       edges of G based on the distance in XX between their vertices.
   
   (b) Let G be the graph with vertices {x_i}. If the ball of radius eps
       B_eps(x_i) intersects B_eps(x_j), add an edge between x_i and x_j, with
       weight ||x_i-x_j||_XX.

   (c) Center an RBF with bandwidth eps above each point x_i in D_n. Let f :
       XX -> [0,1] be the kernel density estimate obtained by mixing these
       RBFs uniformly. Let G be a fully-connected graph with vertices {x_i}
       and edges weighted by int_{line between x_i and x_j} f(x) dx.
   
   We define ||x_i-x_j||_MM as the length of the shortest path in G between
   x_i and x_j.

2. Let ||x_i||_MM = |g_n(x_i)|, where g_n is a k-nearest neighbor classifier,
   or g_n is determined by weighted majority vote among the training data,
   weighted as in 1(a)-(c).

3. Let ||.||_MM be ||.||_XX.

Options 1(a) and 1(b) are standard. Option 1(c) is a little strange but along
the same lines, just smoothed. Options 2 and 3 are unorthodox as far as I
know.

In a classification setting, it seems option 2 will lead to lower error in
general than option 1, because option 2 takes the same ``approach'' to the
data as option 1, but additionally uses labels.

Questions and beliefs: Can all of options 1, 2, and 3 can be made to fit in
DGL's theorem above? For any projection map from ||.||_XX to ||.||_MM which
fits into DGL's theorem, we can find a partition function of XX based on
||.||_XX (and D_n) alone, without worrying about MM. If the projection map
from ||.||_XX to ||.||_MM cannot be fit into DGL's theorem, i.e., it is not
weakly consistent with respect to true a posteriori probability function
eta(X), then using ||.||_MM instead of ||.||_XX cannot help us with the
partition task.

I realize that this section is not completely coherent and that some of it is
likely incorrect. I do think that something like the above must be true,
though.


---------------------------


Which option will 

and 2 are standard. Option 3 is a little strange. Option 4 is
unorthodox. Claim: to the extent that these 

But which ones are consistent with the assumption 


   Note: 

2. We con

, ignoring y_i
if they are given

look at the k nearest neighbors of 

We can ignore the labels in D_n (if they are given) and estimate the
   density 


are essentially two steps in manifold learning. Step 1: Estimate
a geodesic distance metric on the manifold MM subset XX. Step 2: 

  (x0,y0),(x1,y1) = arg sup { mu(x0)mu(x1) : 
                         (x0,y0),(x1,y1) in XX times {0,1}, 
                         y0 neq y1, mu(x0) <= mu(x1) }
    i.e., the modes of each cluster,
  eta(x) = Pr[Y=y1|X=x], and
  eta_n(x) is an estimate of eta, based on D_n, that is weakly consistent,
    i.e., that satisfies lim_{n->infty} E[(eta_n(X)-eta(X))^2] = 0 or
                         lim_{n->infty} E[(eta_n(X)+eta(X))^2] = 0.

Now, let 


  f_n be a regression estimate of the 

f_n be a
weakly consistent regression estimate, i.e.,

  
[Devroye,
Gyorfi, Lugosi, section 6.7]. Let 


partition the space XX;
in the case of clustering we only judge error up to flipping the class. Let us
assume XX is euclidean.

and the goal is to partition the 

. The goal is to partition 

It is not immediately obvious how to formulate the conjecture rigorously, and
I have not worked out everything. Here is an outline. Devroye, Gyorfi, and
Lugosi [DGL], A probabilistic theory of pattern recognition, give the
following result (Theorem 6.5). First, some terminology:

Let

  D_n = {(X_i,Y_i) in RR^d times {0,1} : i=1,...,n}, 
             the iid training data, distributed like (X,Y);
  p(x) = Pr[Y=1|X=x], the posterior probabilities
  R^* = Pr[g^*(X) != Y]


each (X_i,Y_

Theorem. Let 

  eta_n be a weakly consistent regression estimate, that is,
          lim_{n->infty} E[(eta_n(X)-eta(X))^2] = 0,
  g^*(X)=argmin_g Pr[g(X)!=Y], the Bayes classfier
  R^*(X)=Pr[g^*(X)!=Y] be the Bayes risk, 
  g_n(X)=I[eta_n(X) > 1/2]
  
  a classifier trained on D_n,


  R(X) be the error probability of a classification function g_n

let R^*(x) be the bayes risk

and let

  g_n(x) = I[eta_n(x) > 1/2].

Then

                           E[L_n] - L^*
  lim_{n->infty} ----------------------------- = 0,
                  sqrt(E[(eta_n(X)-eta(X))^2])

that is, E[L_n] - L^* converges to zero faster than the L_2 error of the
regression estimate.


I follow Devroye, Gyorfi, and Lugosi [DGL], A probabilistic theory of pattern
recognition.

One
strategy is as follows.

Step 1. 


(If so, then it is necessary to either specify a parametric family of
manifolds and just learn the parameters, or to restrict the manifold in some
other way, e.g., to require that it be 

do something like you suggested
at our last 

I started to think about this because of your suggestion at our last
meeting 

. You mentioned at
our last meeting an idea to start with a parametric family of manifolds and
simply to learn the parameters, instead of doing something nonparametric.
Hypothesis: this is the *only* thing that will work, in general. Actually,
they hypothesis is not quite right. 


. (Actually,
the claim is too strong. But you must have some 


(By ``partition learning'' I mean clustering or classification.)


  (x0,y0),(x1,y1) = arg sup { mu(x0)mu(x1) : 
                         (x0,y0),(x1,y1) in XX times {0,1}, 
                         y neq yy, mu(x0) <= mu(x1), and
                         there exists eps>0 and x1' in XX such that
                         mu(x1') < mu(x1) and ||x1'-x1|| >= eps },