Karl
Rohe
Professor of Statistics, UW–Madison · courtesy appointment in Educational Psychology
Currently building datamint.ing — codebook-driven data extraction with AI, where the user experience is focused on harnessing AI uncertainty.
Researchers in different quantitative disciplines often have a language of statistics that is somewhat shared with other disciplines and also very much tailored-for-their-purposes and thus disjointed. You can imagine the same method evolving in different areas; sometimes converging and sometimes diverging. Sometimes, the researchers are aware of this and sometimes they are unaware. Those evolutions accumulate highly adapted know-how and specialization. I'm inspired by this mess and its ability to produce inspired and well-adapted statistical approaches and methodologies. In my work, I seek to find methodologies that are particularly useful across many disciplines, often with extensively adapted know-hows, to then aggregate those different understandings, while adding a layer of statistical work to modernize it. Two current areas of my research are discussed below: Statistical reading comprehension — what I'm spending most of my time on now. And the geometry of embeddings — older, slower, still moving.
In a lot of empirical research, the data isn't measured or scraped — it's read. Court opinions, clinical notes, interview transcripts, free-text survey answers. Most research on this kind of data runs on a ritual. Gather a corpus of documents and write down a set of questions that you want answered about each of them. That list of questions is called a codebook. Hand the codebook to two readers who cannot talk to each other. Measure how often they agree. Argue over the disagreements. Refine the codebook. Repeat. Cochrane formalized one version for medicine in the 1990s; qualitative researchers have done it under names like content analysis since the 1950s. Psychometricians, political scientists coding party platforms, legal scholars building citation datasets, education researchers reading classroom transcripts, historians annotating archives — every one of these fields built its own variant of the same machine.
I'll call what they all built statistical reading comprehension: reading a corpus of documents not to understand it in a literary sense, but to produce a structured dataset that supports inference. Protocols. Codebooks. Reliability coefficients. Intercoder agreement studies. Validity arguments. These are the instruments those fields developed so that "I read it" could become "I read it reproducibly."
Two pieces of this machinery have my attention right now: Cohen's kappa — the reliability statistic from 1960 that nearly every codebook study still cites — and the codebook itself, treated as an object that evolves under refinement. With AI in the loop, both are open to new uses, new ways of understanding, and more evolution.
Statistical inference is a kind of "user interface" for uncertainty and complexity. Ideally, you don't need to look at the complexities and vagaries inside the data; their deleterious downstream effects get characterized and managed for you. The technician should look at the raw data. But for everyone else, you get the interface — a number or a simple summary, with measures of uncertainty. In this way, statistical inference is the interface between you and the complexity.
Sometimes the interface is far too complicated — too much math, too many buttons/tuning-parameters, no clear insight. But sometimes it works, and when it does it communicates and enables a kind of trust. Not the trust that everything is right all the time, but the trust that most of the time we're close enough. At its heart, this is a large part of what statistics is about. A good statistical interface tells you both the "size" and the "shape" of the uncertainty. And if you squint, that's exactly what we need for using AI to do statistical reading comprehension.
Many are unforgiving of AI: "it hallucinates," "it makes stuff up," "it's confidently wrong." I suspect that aggression hides from a vulnerable fear more than it does productive scientific assessment — AI's uncertainty has a different flavor than we'd have imagined, closer in spirit to the kinds of errors we make on our bad days than to anything alien. But if you think about this Statistically, this is just another wild kind of uncertainty. The question isn't whether the uncertainty is there; it is always there, in any measurement. It's whether we can harness it: characterize its size and shape so the uncertainty becomes relatable — and then make sure what's left is still useful.
I've been playing with AI seriously since GPT-3 in 2022/2023, and the common critiques miss the mark. The uncertainty is far more interesting than "confidently wrong." Yes, AI does all of those things sometimes. But the overwhelming "errors" are far more forgivable than that: they are ambiguities in the codebook. The instructions allowed more than one response. I may have a preference among the interpretations, but more often I hadn't considered that one was needed — and considering it takes real thinking.
From this stance, I built datamint.ing for AI-assisted statistical reading comprehension. Five components make it work.
(1) Data Mint. Under the hood, Data Mint is carefully built to reduce hallucinations across a wide variety of documents:
More on the extraction techniques; how AI extraction works.
(2) Data Minting a spreadsheet. Given your codebook and your documents, Data Mint produces a spreadsheet. Each row is a document, each column is a codebook question, each cell an answer. Click any cell for the evidence trail — key quotes, reasoning, assumptions, and the rest.
Parts (1) and (2) don't yet capture the size and shape of the uncertainty. Reading every cell's evidence trail by hand and carefully inspecting the raw document could, in principle. However, in practice it produces a little bit of false confidence: I have literally never seen an evidence trail that wasn't reasonable given the instructions. The uncertainty is more epistemic than that — it's ambiguities in the codebook, the kind a single AI reader (n=1) cannot reveal. datamint.ing handles this by having multiple AI readers answer every cell (currently n=3, balancing cost and precision). The remaining three components follow from this insight.
(3) Three independent readers per cell. Three independent AI readers answer each cell, each producing its own chain of reasoning, key quotes, and assumptions.
(4) A fourth reader, looking at the first three. A fourth AI reader evaluates the three responses and reports back the size and shape of any disagreements.
(5) A UX for crafting the codebook. All of this is wrapped in a user experience that lets you focus on the codebook itself: building it, using it, refining it. The goal is a codebook that encapsulates a protocol flexible enough to cover your corpus and precise enough to dissuade ambiguity. You craft it iteratively — refining by using.
In my experience, refining a codebook to address these ambiguities promotes a real amount of reflection. It's enjoyable because you spend your time thinking about the most interesting edge cases — the things you didn't think to imagine. The process sharpens your understanding of the codebook and of all the ways your corpus isn't yet quite aligned with what you thought you meant. Not only is this fun; it is good for science, because sharpening your codebook makes your work more reproducible. The codebook captures the full intent — nothing left to the practices of the human coders, nothing left to their training, nothing left to the lab culture. The result is reproducible and extendible because the AI does the time-consuming part. That frees us to focus on the most important part: carefully crafting our instructions.
Your craft moves from creating your own data/sample to creating and refining your codebook.
Zooming out: AI is going to become instrumentation we use for large-scale measurement of text. As with any measurement, AI produces noise. And as with other instruments, AI becomes far more useful when its uncertainty is relatable — when users can develop a relationship to it. My hope is that users can begin to have a realistic and trusted relationship with their datamint.ing data — not the kind that trusts every cell, but the kind that trusts the process.
R.A. Fisher told statisticians to design the experiment, generate the data, analyze the result. A century later, most of us run in reverse. The data already exists — scraped, deposited, published, leaked — and the analyst's job begins with cleaning and combining what's already out there. I think about what statistics looks like on that side of the table.
We named the new era Data Science, and rightly so. The internet made it easy to share both data and software; from that rich web of dependencies, data science emerges as a property of the network. Here's the longer version I tell students.
In this new era, the machine learning community has found great use for embeddings: representing the things you want to analyze — words, sentences, documents, DNA, proteins, images, videos, graphs, click-streams — as high-dimensional vectors. Neural networks then do linear algebra on those vectors. But linear algebra isn't only special for neural networks; it's at the heart of many statistical ideas. And at the heart of embeddings, there's a deep similarity to spectral techniques and Principal Components Analysis (PCA and varimax).
Statisticians have long studied PCA, but when supporting experimentalists, its use case was often unclear. My work bets heavily on PCA's future — making it useful as an embedding for a wider audience. The bet is PCA for the People (longpca). Compared to linear regression, PCA is harder to motivate, harder to perform, and harder to explain. I want it to be as fluid to use as linear regression is.
If we cross paths soon, let's talk about graphs or Large Language Models or twitter or #blacklivesmatter or eigenvectors or reproducibility or public opinion or psychotherapy or raspberry pis or your data or the reproducibility crisis or p-values or longpca or peer-review or curiosity driven research.
Karl Rohe & Muzhe Zeng (2023).
Reply to the Discussion of "Vintage factor analysis with
varimax
performs statistical inference."
Journal of the Royal Statistical Society: Series B, 85(4), p.
1094–.
Original paper: JRSS-B, 85(4), 1037–1060.
Statistics as a field has been tremendously successful in creating and propagating the quantitative theories, techniques, and tools to do quantitative science. Because of our success, our community is continually fragmented into methodological subdisciplines within other fields — psychometrics, signal processing, econometrics, epidemiology, demography, chemometrics, actuarial sciences, machine learning, bioinformatics, and so on — each a certain type of universe in which methodologies are continuously evolving (almost independently) in parallel. This phenomenon has happened slowly over time and as a result, we think it is time to reappraise the role of Statisticians with a capital S (i.e. in Statistics departments). In this vein, the history of Varimax provides a parable.
Statisticians often perceive of our field as producing methodology by deriving it from our foundational theories (e.g. Maximum Likelihood, Bayesian, etc.). And then, other fields consume our methodologies. One issue with this perception is that the methodological subdisciplines are producing their own methodologies; what about the Statisticians with a capital S? Sometimes our foundational theories help them. Sometimes our theories do not help them. Sometimes their theories go beyond our own and then we call them Statisticians (with a capital S). Regardless, these are all skilled craftspeople and we fool ourselves when we pretend that Statisticians are the all-knowing producers.
The history of factor analysis is an antidote to our illusions of grandeur. In 1935, Thurstone (a psychologist) was inspired by the idea of multiple types of intelligence and set out to create a way to measure them. Inspired by this and without an "electronic" computer, he whittled a blunt tool, a way of iteratively plotting data and cleverly picking a tiny bit of a rotation and then iterating again. This was not derived from our theories. Moreover, this process was "subjective" at every iteration. In 1956, Anderson and Rubin wrote a Gaussian theory that discounted factor rotations; their theory appears in every textbook on Multidimensional Statistics. But already in 1953, Carroll introduced an optimisation problem to "objectively" pick a rotation using 4th-order moments. Then, Varimax came in 1958. Despite the protesting Statisticians, psychologists used factor rotations and conveyed them to future generations because factor rotations solved a problem. They did not need a "theoretical foundation" and they should not need one now. We hope that the "theoretical foundation" we have started to provide is that it might convince researchers in other disciplines to try using factor rotations on their problems.
This parable of factor analysis is extreme because Statisticians have opposed it for nearly 90 years. Our fundamental claim is more general. Successful methodologies, the ones that spread, will have a consilience of a product-market-fit, "statistical theory" (of models, theorems, and algorithms), and practical-know-how. In the successful diffusion of a methodology, Statistics as a discipline has an essential role. Ideally, we would be in a position to develop methodologies with product-market-fit, but often we are not. Our primary strength is that we can (1) provide a "statistical theory", and then (2) leverage our central position in the academic network of methodological subdisciplines to convey this methodology and statistical theory.
Before (1) developing a theory and (2) propagating it, there is a step (0). We believe Statisticians need to get better at step zero for our field to continue flourishing. The zero-problem is this: which methodologies should we support? Where should we direct our attention? Our journals are (unfortunately) filled with methodologies that lack a product-market-fit.
There is a direct path to ensuring the methodology is fit: we can learn from others. This alternative path leverages and reinforces our central role in the network of quantitative subdisciplines. When we learn popular techniques from others, they already have product-market-fit and likely already have practical-know-how. Our job is to give it a model and a theory (broadly interpreted) and to make it into something that other researchers might enjoy. Maybe this also inspires new algorithms and new estimators. Maybe it does not. By developing this framework, we enable other fields to learn. Importantly, this is not an entirely new way of doing Statistics; it's the way it is already happening.
In this process, methodologies are not derived, but rather methodologies evolve. We Statisticians can play a fundamental role in methodologies evolving and reaching consilience, but we should stop assuming that product-market-fit is easy. Instead, we should recognise the more realistic role that we play in the evolution of quantitative methodologies and leverage the subdisciplines that are simultaneously and endlessly refining numerous different methodologies and testing their product-market-fit. Varimax is a parable for this point.