I am a Professor of Statistics at the University of Wisconsin–Madison, with courtesy appointments in The School of Journalism and Mass Communication, Electrical & Computer Engineering, and Educational Psychology. I’m also part of a broad and inclusive community of machine learning researchers at UW Madison and affiliate faculty for IFDS. Previously, I was an AE at JASA and JRSS-B.

I like making things for statisticians from things that are very old (PCA and varimax) and very new (Large Language Models). I have lots to say about LLMs and how we (scientists/statisticians/academics) can be using them productively, but nothing publicly yet.

100 years of data analysis; from Experiments to Embeddings.

My work is motivated by the changing nature of data analysis. The midcentury-modern era of Academic Statistics was primarily created to support experimentalists (e.g. Fisher in agriculture, hypothesis testing, randomization, causal inference, small samples). In contrast, in the era of Data Science, we do not necessarily generate data by running experiments, per se. Instead, we clean and combine the data sources that were previously collected and shared, sometimes for other purposes entirely (e.g. accounting). This is “post-modern era” of data analysis, with new challenges and new opportunities.

We made a new name for it and rightly so. It’s called Data Science… Here is what I tell students… the internet has made it super easy to share both data and software. This enables a rich web of dependencies in both data and software. From this rich network, data science is an emergent property. The full story is at the link above.

In this new era, the machine learning community has found great use for embeddings which represent the things you want to use (e.g. words / sentences / documents / DNA / Proteins / images / videos / graphs / click-streams / etc) as high dimensional vectors. Then, neural networks can do linear algebra on these vectors; this is the linear-algebraic-logic of neural networks. But linear algebra is not just special for neural networks; linear algebra is at the heart of so many statistical ideas.

At the heart of embeddings, there is a deep similarity to spectral techniques and Principal Components Analysis (PCA). Statisticians have long studied PCA, but when supporting experimentalists who collect data, the use case of PCA was often unclear. In our new Data Science era of data analysis, my work bets heavily on the future of PCA and making it useful as an embedding to a wider audience. This is the key motivation for PCA for the People.

In my work, I want to make PCA/embeddings more accessible and legible. There are multiple components to this. For example, compared to linear regression, PCA is much harder to (1) motivate, (2) perform, and (3) explain/justify. I want it to be just as fluid to use PCA/embeddings as it is for us to use linear regression.

You can find a list of my lab’s publications linked above, and other lists on arXiv or on Google Scholar. We put our code on github. My CV is here.

If we cross paths soon, let’s talk about graphs or Large Language Models or twitter or #blacklivesmatter or eigenvectors or reproducibility or public opinion or psychotherapy or raspberry pis or bitcoin or your data or the reproducibility crisis or p-values or longpca or peer-review or…

Thank you to the National Science Foundation and the Army Research Office for their support. DMS-1309998, DMS-1612456, DMS-1916378, W911NF-15-1-0423, W911NF-20-1-0051.

If you are on Twitter or BlueSky, come find me @karlrohe.