Here is an overview, in two sections, of what I've been working on, so we have something to discuss. As you will see, both sections stem from what we discussed at our last meeting, but they go in almost completely different directions. I. Simplicial complexes. Problem. Given beta={beta_i in NN : i=0,...,n}, possibly non-zero Betti numbers, the task is to characterize the preimage of beta under the "Betti operator" BET : KK -> NN, where KK denotes the collection of all abstract simplicial complexes, and NN denotes the nonnegative integers. (This can be thought of as a special case of the problem I described at our last meeting when all pairwise distances between vertices are equal.) In particular, at least four subproblems seem of interest: (i) Give an algorithm to enumerate the elements of the preimage. (ii) Better: Give a simple formula that specifies the elements of the preimage in terms of beta. (iii) Give an efficient algorithm to find the element of the preimage that is ``smallest'', i.e. has fewest i-simplices, i=0,...,n. (iv) Better: Give a simple formula for (iii). It is easy to give an algorithm for (i), but this algorithm will require exponential running time, because the number of elements to be enumerated grows exponentially. I have been working intensively on problems (ii)-(iv), and related issues, for the past weeks. A slight modification (add backtracking) of the algorithm given in my previous email to you solves (iii), but requires worst-case exponential running time. I believe (the details are not 100% worked out) I have mostly obtained an efficient algorithm to solve (iii), along with several interesting combinatorial facts needed for the proof. Unfortunately, though I had searched the literature previously, I did not find a paper which I did find this past weekend, namely Bjoerner and Kalai, ``An extended Euler-Poincare theorem'', Acta Mathematica, 1988, which solves (ii) and (iv), along with many other problems; the way (iv) is solved seems to lead easily to an efficient algorithm for (iii). So it seems I have wasted, at least from a short-term perspective, a large amount of time due to not finding that article initially. (It uses terminology I was not familiar with and did not search for.) [One positive thing: I had worried that the above was not of much interest; in fact last week I wrote, and then abandoned, a note to you describing my progress with a subject ``Small problems for small minds''. But I think the authors of the article above are quite well known, and they seem proud of their result (e.g., it is listed as one of a few ``representative publications''), so I think at least I found a problem that is interesting to someone besides me, though unfortunately it was of interest twenty years ago.] Open(?) and interesting(?) problems. (a) It could be of interest to solve (iii) and (iv) with a different definition of ``smallest'', e.g., when the 0-simplices are taken to be elements of a metric space and the ``bigness'' of a simplicial complex K with i-simplices K_i, i=0,...,n, is given by sum_i sum_{s in K_i} (mean pairwise distance between vertices of s) This will be much harder, though. I doubt that a polynomial-time algorithm (or a simple closed-form characterization) exists. (b) The combinatorial results about the preimage of BET could be used to give a hardness result of the type I am going to describe below in section II, but for a discrete space Omega with pairwise uniform distances between elements of the space. Initial problem: Given samples from Omega, partition Omega. New problem: Let K_0 be a subset of Omega whose elements are the 0-simplices of a simplicial complex K with Betti numbers beta. Given beta and samples from K_0, partition Omega. Question: Is the new problem easier, in a sense to be made precise, than the initial problem? II. Manifold learning is harder than partition learning. Conjecture: It is not possible, in general, to benefit from manifold learning as a feature-selection step prior to classification or clustering, unless strong assumptions are made about the manifold to be learned. In other words, the nonparametric approaches that seem to be popular with authors up to now, e.g. Isomap, LLE, Laplacian eigenmaps, are doomed to failure except on toy problems which happen to be favorable special cases. (Implication: For manifold-based feature selection to be useful, one must either assume the underlying manifold belongs to a parametric family of manifolds and try to learn its parameters, as you suggested at our last meeting, or one must make some other strong assumption about the manifold, e.g. about the Betti numbers, though I am increasingly not convinced this is really the right thing to do. Perhaps one should assume something about the ``smoothness'' or ``curvature'' of the manifold, not just for short-circut prevention, but in order to address the above problem. Before we can decide any of this in a fundamental and general way, we need to understand precisely what the problem is and what causes it.) The intuition behind the conjecture is as follows. First, let us make the setting precise. We are given nothing but points and labels D_n={(x_i,y_i) in XX times {0,1} : i=1,...,n} (binary classification), or points D_n={x_i in XX : i=1,...,n} (binary clustering), and a metric ||.||_XX on XX. In both cases, we assume there is an unknown joint measure mu on XX times {0,1} and D_n is drawn iid according to mu as usual. The goal is to partition the space XX, i.e., to learn a map g_n : XX -> {0,1}, so as to minimize the risk R_n(g_n), defined by R_n = Pr[g_n(X) neq Y] (binary classification), or R_n = min(Pr[g_n(X) neq Y],Pr[g_n(X)=Y]) (binary clustering), where the random pair (X,Y) is distributed according to mu. For both cases, we define R* = inf_g Pr[g(X) neq Y], and g* = arg inf_g Pr[g(X) neq Y]. Suppose we take the following approach to solving the problem. For classification, let eta(x) = Pr[Y=1|X=x], and eta_n be a regression estimate of eta, based on D_n, that is weakly consistent, i.e. that satisfies lim_{n->infty} E[(eta_n(X)-eta(X))^2] = 0. For clustering, let eta(x) = Pr[Y=1|X=x], and eta_n be a regression estimate of eta, based on D_n, that is weakly consistent, i.e. that satisfies lim_{n->infty} E[(eta_n(X)-eta(X))^2] = 0, or, lim_{n->infty} E[(eta_n(X)+eta(X))^2] = 0. Theorem [Devroye, Gyorfi, and Lugosi, Theorem 6.5] With the above definitions, E[R_n] - R* lim_{n->infty} ----------------------------- = 0. sqrt(E[(eta_n(X)-eta(X))^2]) In other words, the excess risk of the partition converges to zero faster than the L_2 error of the regression estimate that the partition is based on. [DGL only give the theorem for classification; the definitions for classification above are also based on their exposition.] This is not surprising, because in the partition problem we only need to worry about the decision boundary---we don't need to worry about most of XX. Now, the key step in manifold learning is to estimate a geodesic distance metric ||.||_MM on the manifold MM subset XX. (As a second step, we may try to find a space with a euclidean distance metric that approximates ||.||_MM, and then project into that space; it seems clear that this approximation step can only increase error in general, so we ignore it for now.) How can we estimate ||.||_MM based solely on ||.||_XX and D_n? Here are some options: 1. We ignore labels y_i if they are given. (a) We construct a k-nearest neighbor graph G of the x_i in D_n. We weight edges of G based on the distance in XX between their vertices. (b) Let G be the graph with vertices {x_i}. If the ball of radius eps B_eps(x_i) intersects B_eps(x_j), add an edge between x_i and x_j, with weight ||x_i-x_j||_XX. (c) Center an RBF with bandwidth eps above each point x_i in D_n. Let f : XX -> [0,1] be the kernel density estimate obtained by mixing these RBFs uniformly. Let G be a fully-connected graph with vertices {x_i} and edges weighted by int_{line between x_i and x_j} f(x) dx. We define ||x_i-x_j||_MM as the length of the shortest path in G between x_i and x_j. 2. Let ||x_i||_MM = |g_n(x_i)|, where g_n is a k-nearest neighbor classifier, or g_n is determined by weighted majority vote among the training data, weighted as in 1(a)-(c). 3. Let ||.||_MM be ||.||_XX. Options 1(a) and 1(b) are standard. Option 1(c) is a little strange but along the same lines, just smoothed. Options 2 and 3 are unorthodox as far as I know. In a classification setting, it seems option 2 will lead to lower error in general than option 1, because option 2 takes the same ``approach'' to the data as option 1, but additionally uses labels. Questions and beliefs: Can all of options 1, 2, and 3 can be made to fit in DGL's theorem above? For any projection map from ||.||_XX to ||.||_MM which fits into DGL's theorem, we can find a partition function of XX based on ||.||_XX (and D_n) alone, without worrying about MM. If the projection map from ||.||_XX to ||.||_MM cannot be fit into DGL's theorem, i.e., it is not weakly consistent with respect to true a posteriori probability function eta(X), then using ||.||_MM instead of ||.||_XX cannot help us with the partition task. I realize that this section is not completely coherent and that some of it is likely incorrect. I do think that something like the above must be true, though. --------------------------- Which option will and 2 are standard. Option 3 is a little strange. Option 4 is unorthodox. Claim: to the extent that these But which ones are consistent with the assumption Note: 2. We con , ignoring y_i if they are given look at the k nearest neighbors of We can ignore the labels in D_n (if they are given) and estimate the density are essentially two steps in manifold learning. Step 1: Estimate a geodesic distance metric on the manifold MM subset XX. Step 2: (x0,y0),(x1,y1) = arg sup { mu(x0)mu(x1) : (x0,y0),(x1,y1) in XX times {0,1}, y0 neq y1, mu(x0) <= mu(x1) } i.e., the modes of each cluster, eta(x) = Pr[Y=y1|X=x], and eta_n(x) is an estimate of eta, based on D_n, that is weakly consistent, i.e., that satisfies lim_{n->infty} E[(eta_n(X)-eta(X))^2] = 0 or lim_{n->infty} E[(eta_n(X)+eta(X))^2] = 0. Now, let f_n be a regression estimate of the f_n be a weakly consistent regression estimate, i.e., [Devroye, Gyorfi, Lugosi, section 6.7]. Let partition the space XX; in the case of clustering we only judge error up to flipping the class. Let us assume XX is euclidean. and the goal is to partition the . The goal is to partition It is not immediately obvious how to formulate the conjecture rigorously, and I have not worked out everything. Here is an outline. Devroye, Gyorfi, and Lugosi [DGL], A probabilistic theory of pattern recognition, give the following result (Theorem 6.5). First, some terminology: Let D_n = {(X_i,Y_i) in RR^d times {0,1} : i=1,...,n}, the iid training data, distributed like (X,Y); p(x) = Pr[Y=1|X=x], the posterior probabilities R^* = Pr[g^*(X) != Y] each (X_i,Y_ Theorem. Let eta_n be a weakly consistent regression estimate, that is, lim_{n->infty} E[(eta_n(X)-eta(X))^2] = 0, g^*(X)=argmin_g Pr[g(X)!=Y], the Bayes classfier R^*(X)=Pr[g^*(X)!=Y] be the Bayes risk, g_n(X)=I[eta_n(X) > 1/2] a classifier trained on D_n, R(X) be the error probability of a classification function g_n let R^*(x) be the bayes risk and let g_n(x) = I[eta_n(x) > 1/2]. Then E[L_n] - L^* lim_{n->infty} ----------------------------- = 0, sqrt(E[(eta_n(X)-eta(X))^2]) that is, E[L_n] - L^* converges to zero faster than the L_2 error of the regression estimate. I follow Devroye, Gyorfi, and Lugosi [DGL], A probabilistic theory of pattern recognition. One strategy is as follows. Step 1. (If so, then it is necessary to either specify a parametric family of manifolds and just learn the parameters, or to restrict the manifold in some other way, e.g., to require that it be do something like you suggested at our last I started to think about this because of your suggestion at our last meeting . You mentioned at our last meeting an idea to start with a parametric family of manifolds and simply to learn the parameters, instead of doing something nonparametric. Hypothesis: this is the *only* thing that will work, in general. Actually, they hypothesis is not quite right. . (Actually, the claim is too strong. But you must have some (By ``partition learning'' I mean clustering or classification.) (x0,y0),(x1,y1) = arg sup { mu(x0)mu(x1) : (x0,y0),(x1,y1) in XX times {0,1}, y neq yy, mu(x0) <= mu(x1), and there exists eps>0 and x1' in XX such that mu(x1') < mu(x1) and ||x1'-x1|| >= eps },