Misha's Homepage

The foundation model (FM) paradigm—pretraining a large model on massive data and then adapting it to downstream tasks—transformed vision and language, and is now spreading rapidly to specialized domains across the sciences, engineering, and healthcare. In natural language processing, the pretraining of the BERT model heralded the FM paradigm shift because fine-tuning it decisively outperformed models trained from scratch; after its release, supervised learning was rarely viewed as a reasonable baseline for text tasks. This raises the same question for the newer domains: have they had their own BERT moment, i.e. do foundation models outperform the traditional supervised-learning workflow of task-specific model development, tuning, and training? Answering it fairly requires comparing FMs not against weak defaults but against strong, carefully tuned baselines that use only data from the target task.

We study three specialized modalities—genomics, satellite imaging, and time series—pitting a range of recent domain-specific FMs against a standard supervised pipeline of model development, hyperparameter tuning, and training on the target task alone; we automate the model development component while staying below the fine-tuning budget by using the lightweight architecture search method DASH. Across all three domains, we find it is consistently possible to train simple supervised models—no more complicated than a lightly modified wide ResNet or U-Net—that match or even outperform the latest foundation models. To make such comparisons easy and reproducible for others, we release two automated, open-source workflows, DASHA and Auto-AR, for building strong baselines with minimal effort.

The takeaway from this work was not that pretraining is hopeless in these domains, but that its promised benefits had yet to be realized, and that the fields must evaluate new FMs on benchmarks that properly assess their quality and against baselines respected by practitioners. For example, in the year following our publication, time series has arguably had its BERT moment due to the latest generation of models such as Toto, whose development was supported by more rigorous new benchmarks such as GIFT-Eval. This is despite the fact that time series was the setting of perhaps the most dramatic result in our paper, in which multi-million-parameter FMs barely matched a 513-parameter basic linear auto-regression (AR).

The field of algorithms with predictions (a.k.a. learning-augmented algorithms) designs algorithms whose performance (e.g. runtime) improves with a good prediction or hint about the instance they are run on. Our work develops a systematic way to answer a crucial question in this area: where do the predictions come from? As the alternative name suggests, predictions often come from learning, but prior to our work the question of learnability—i.e. whether predictions can be learned with polynomially many instances or sublinear regret—had been under-addressed. We show that for many algorithms with predictions, existing cost functions that are challenging to analyze can be relaxed into surrogate bounds that are easy to online-learn, all while incurring only a small penalty.

As a consequence, we show dramatically improved learning-theoretic results for several graph algorithms, e.g. for minimum-weight bipartite matching with predictions on graphs with at most n nodes, our sample complexity guarantee is O(n²) times better than the previous best bounds. We also study several online algorithms and show the first learning guarantees for learning-augmented caching and online page migration. In most cases, our results also extend to learning linear models from instance features to predictions, or linear auto-regressive models from past states to actions in online algorithms.

Overall, our relaxation-based approach suggests that learning-augmented guarantees might best be viewed as surrogate algorithmic losses, in that just like we never optimize the actual 0-1 error when training a binary classifier, preferring instead a convex objective like the log-loss, for data-driven algorithms we can also get away with optimizing a nice surrogate function rather than the actual cost of the algorithm, which is rarely convex or even continuous. Since its publication, the ideas in our work have directly inspired learning-theoretic guarantees for several problems in learning-augmented discrete convex optimization. In my own work, we have used a similar approach to design algorithms for learning to release differentially private statistics—e.g. quantiles or covariance estimates—across related datasets.

In federated learning (FL), the goal is to train an model on the data of a heterogeneous network of devices, without sending all their data to a central server. Among other complications, this causes many challenges for a crucial aspect of ML: hyperparameter tuning. In particular, we often can't even compute a good estimate of validation performance because devices are often unavailable, and the high costs of training on weak devices makes the multiple runs required by standard approaches such as random search exorbitantly expensive. Can we tune hyperparameters in only a few training runs while making good use of noisy validation data?

We propose FedEx, a method that tackles this problem for local hyperparameters, a subset of all hyperparameters that arises because of the local training approach used by most federated optimizers such as FedAvg. At each communication round, such optimizers run identically initialized local SGD on each device in a batch, and then move the initialization closer to some aggregation (e.g. the average) of the last iterates. FedEx works by running these local SGD algorithms with different randomly sampled hyperparameters on each device and using the last iterates' performances on device validation data as signal to update the hyperparameter distribution.

Drawing upon the connection between (personalized) FL and meta-learning, we use ARUBA to show that—when devices have sufficiently similar data—FedEx provably finds a good local step-size in the convex setting when the devices are sufficiently similar. Empirically, our method is applicable to any local hyperparameter and we find that it consistently improves the performance of vanilla tuning across several stanard FL tasks. More recently, the utility of FedEx has been independently confirmed by the authors of FedHPO-Bench, a benchmark dedicated to hyperparameter tuning in FL; they show that applying our method is beneficial in 11 out of 12 evaluation settings.

This paper introduces a framework called ARUBA (Average Regret-Upper-Bound Analysis) that synthesizes ideas from our previous work—which showed some of the first provable guarantees for gradient-based meta-learning methods—into an algorithm design framework for learning across multiple tasks, a crucial component of methods in areas such as federated and few-shot learning. The key idea is to optimize a well-designed surrogate bound on the task-averaged regret, the main performance measure in online learning, yielding algorithms that improve upon the single-task baseline if the learning tasks are similar—e.g. if their model parameters are close in Euclidean distance—while not being much worse if they are not. Crucially, the methods are adaptive in that the extent of the task similarity doesn't need to be known beforehand, and it is applicable to numerous settings, including to dynamically evolving and non-Euclidean task distributions.

In the past few years, many works have built upon ARUBA to show guarantees in numerous settings beyond gradient-based learning, including private, discontinuous, multi-agent, reinforcement, and bandit meta-learning, as well as for federated hyperparameter optimization and meta-learning in games. In each case, ARUBA yields custom meta-algorithms that induce setting-specific notions of task-similarity; for example, in the last instance, tasks (games) are similar if they have nearby equilibria. Because our approach also often yields efficient meta-algorithms, in many of these papers the theory is used to design practically useful methods, as in our work on FedEx.

On a technical note, while showing how to simultaneously learn to initialize and pre-condition OGD, we showed a result of independent interest in online learning by generalizing the classic logarithmic regret bound of follow-the-leader on sequences of quadratics to sequences of Bregman divergences with adversarially chosen first arguments. Notably, this holds even when the Bregman divergence (and thus the function sequence) is non-convex in the second argument, which is (e.g.) true of the Tsallis divergence used to get optimal rates for multi-armed bandits. Bregman divergences appear frequently in guarantees for mirror descent, an important algorithm for (among other things) equilibrium-finding, and so our result has since been used to show guarantees for meta-learning in games and multi-agent RL. The cleanest (in my view) statement of the result can be found in Lemma A.1 here.

Self-supervised learning is a key driver of the recent success of large-scale pretraining. An important approach—used in e.g. the ten-year-old word2vec and the more recent SimCLR—is contrastive learning, in which the objective is minimized when the representations of any "positive" pair of similar inputs are close together while those of any random or "negative" pair are far apart. Here "similarity" is determined by the self-supervision, e.g. sentences are similar if they appear next to each other in a corpus. Why does such an approach lead to useful representations for downstream tasks such as classification?

We study this problem via a generative model with an underlying distribution across classes from which both downstream tasks and positive/negative pairs are sampled. For example, in topic modeling, sentences from the same document are likely to be in the same class in a classification task. Informally, the paper shows that for any (e.g. neural) representation, the task-averaged loss of the best linear classifier is bounded by the contrastive pretraining objective, in which inner products of representations of positive pairs are forced together and those of negative pairs are forced apart. This implies a bound on the task-averaged risk consisting of the error of a linear classifier on a (small) number of downstream samples plus the (Rademacher) complexity of the class of representations divided by the (large) number of unsupervised samples; this will be much smaller than the bound for training the full model—both the representation and the linear head—on the task directly, demonstrating the utility of contrastive learning.

The approach of simultaneously modeling the pretraining and downstream tasks has since been used in other efforts to understand the success of large-scale unsupervised pretraining, such as why next-word prediction is such a useful self-supervision signal. However, Nikunj and his co-authors have also shown that a full understanding will only be possible by looking at the inductive biases of the representation class and pretraining algorithm, which our paper did not consider.

Word embeddings such as word2vec and GloVe have been remarkably successful as semantic representations of text and were in many ways the forerunners of modern self-supervised language modeling. One drawback, however, is that the vocabulary is fixed: what if we are given a sentence with a word that wasn't in the text corpus used to train the embeddings, or did not occur enough times to be included in the training objective? Can we still use the provided context to compute a good embedding for that word?

A simple baseline is to just use what is called the context embedding: the sum of the embeddings of the words in the target word's context. However, this still results in the closest words to "cutting-edge" being "cut" and "edges," whereas we might want them to be something like "innovative" and "technology." We came up with a simple alternative called à la carte embedding that uses linear regression on the original corpus to compute a matrix mapping each word's context embeddings to its original embedding. Then, when faced with an unseen word, we can compute its context embedding, apply this linear transform, and get what turns out to be a quite decent vector for the new word, as evaluated on several tasks requiring embeddings of rare words (or other text features).

While language technology has advanced rapidly beyond word embeddings since 2018, the simplicity of à la carte and its ability to (sample-efficiently) induce the meaning of words from (specific) contexts has led to its continued use in statistical text analysis. In particular, due to the work of Pedro Rodriguez, Arthur Spirling, Brandon Stewart, and Elisa Wirsching, the regression-based approach has found significant application in computational social science, where it can be used to compare word meanings across different populations or time periods. For example, it has been used to study whether protest influences political speech in the UK and to analyze how judicial investigations affect news coverage in Kenya.