Professor of Statistics, UW–Madison · courtesy appointment in Educational Psychology

Currently building datamint.ing — codebook-driven data extraction with AI, where the user experience is focused on harnessing AI uncertainty.

Researchers in different quantitative disciplines often have a language of statistics that is somewhat shared with other disciplines and also very much tailored-for-their-purposes and thus disjointed. You can imagine the same method evolving in different areas; sometimes converging and sometimes diverging. Sometimes, the researchers are aware of this and sometimes they are unaware. Those evolutions accumulate highly adapted know-how and specialization. I'm inspired by this mess and its ability to produce inspired and well-adapted statistical approaches and methodologies. In my work, I seek to find methodologies that are particularly useful across many disciplines, often with extensively adapted know-hows, to then aggregate those different understandings, while adding a layer of statistical work to modernize it. Two current areas of my research are discussed below: Statistical reading comprehension — what I'm spending most of my time on now. And the geometry of embeddings — older, slower, still moving.

01Statistical Reading Comprehension with AI

In a lot of empirical research, the data isn't measured or scraped — it's read. Court opinions, clinical notes, interview transcripts, free-text survey answers. Most research on this kind of data runs on a ritual. Gather a corpus of documents and write down a set of questions that you want answered about each of them. That list of questions is called a codebook. Hand the codebook to two readers who cannot talk to each other. Measure how often they agree. Argue over the disagreements. Refine the codebook. Repeat. Cochrane formalized one version for medicine in the 1990s; qualitative researchers have done it under names like content analysis since the 1950s. Psychometricians, political scientists coding party platforms, legal scholars building citation datasets, education researchers reading classroom transcripts, historians annotating archives — every one of these fields built its own variant of the same machine.

I'll call what they all built statistical reading comprehension: reading a corpus of documents not to understand it in a literary sense, but to produce a structured dataset that supports inference. Protocols. Codebooks. Reliability coefficients. Intercoder agreement studies. Validity arguments. These are the instruments those fields developed so that "I read it" could become "I read it reproducibly."

Two pieces of this machinery have my attention right now: Cohen's kappa — the reliability statistic from 1960 that nearly every codebook study still cites — and the codebook itself, treated as an object that evolves under refinement. With AI in the loop, both are open to new uses, new ways of understanding, and more evolution.

Statistical inference as a user interface

Statistical inference is a kind of "user interface" for uncertainty and complexity. Ideally, you don't need to look at the complexities and vagaries inside the data; their deleterious downstream effects get characterized and managed for you. The technician should look at the raw data. But for everyone else, you get the interface — a number or a simple summary, with measures of uncertainty. In this way, statistical inference is the interface between you and the complexity.

Sometimes the interface is far too complicated — too much math, too many buttons/tuning-parameters, no clear insight. But sometimes it works, and when it does it communicates and enables a kind of trust. Not the trust that everything is right all the time, but the trust that most of the time we're close enough. At its heart, this is a large part of what statistics is about. A good statistical interface tells you both the "size" and the "shape" of the uncertainty. And if you squint, that's exactly what we need for using AI to do statistical reading comprehension.

AI uncertainty needs a user interface

Many are unforgiving of AI: "it hallucinates," "it makes stuff up," "it's confidently wrong." I suspect that aggression hides from a vulnerable fear more than it does productive scientific assessment — AI's uncertainty has a different flavor than we'd have imagined, closer in spirit to the kinds of errors we make on our bad days than to anything alien. But if you think about this Statistically, this is just another wild kind of uncertainty. The question isn't whether the uncertainty is there; it is always there, in any measurement. It's whether we can harness it: characterize its size and shape so the uncertainty becomes relatable — and then make sure what's left is still useful.

I've been playing with AI seriously since GPT-3 in 2022/2023, and the common critiques miss the mark. The uncertainty is far more interesting than "confidently wrong." Yes, AI does all of those things sometimes. But the overwhelming "errors" are far more forgivable than that: they are ambiguities in the codebook. The instructions allowed more than one response. I may have a preference among the interpretations, but more often I hadn't considered that one was needed — and considering it takes real thinking.

datamint.ing

From this stance, I built datamint.ing for AI-assisted statistical reading comprehension. Five components make it work.

(1) Data Mint. Under the hood, Data Mint is carefully built to reduce hallucinations across a wide variety of documents:

More on the extraction techniques; how AI extraction works.

(2) Data Minting a spreadsheet. Given your codebook and your documents, Data Mint produces a spreadsheet. Each row is a document, each column is a codebook question, each cell an answer. Click any cell for the evidence trail — key quotes, reasoning, assumptions, and the rest.

Parts (1) and (2) don't yet capture the size and shape of the uncertainty. Reading every cell's evidence trail by hand and carefully inspecting the raw document could, in principle. However, in practice it produces a little bit of false confidence: I have literally never seen an evidence trail that wasn't reasonable given the instructions. The uncertainty is more epistemic than that — it's ambiguities in the codebook, the kind a single AI reader (n=1) cannot reveal. datamint.ing handles this by having multiple AI readers answer every cell (currently n=3, balancing cost and precision). The remaining three components follow from this insight.

(3) Three independent readers per cell. Three independent AI readers answer each cell, each producing its own chain of reasoning, key quotes, and assumptions.

(4) A fourth reader, looking at the first three. A fourth AI reader evaluates the three responses and reports back the size and shape of any disagreements.

(5) A UX for crafting the codebook. All of this is wrapped in a user experience that lets you focus on the codebook itself: building it, using it, refining it. The goal is a codebook that encapsulates a protocol flexible enough to cover your corpus and precise enough to dissuade ambiguity. You craft it iteratively — refining by using.

In my experience, refining a codebook to address these ambiguities promotes a real amount of reflection. It's enjoyable because you spend your time thinking about the most interesting edge cases — the things you didn't think to imagine. The process sharpens your understanding of the codebook and of all the ways your corpus isn't yet quite aligned with what you thought you meant. Not only is this fun; it is good for science, because sharpening your codebook makes your work more reproducible. The codebook captures the full intent — nothing left to the practices of the human coders, nothing left to their training, nothing left to the lab culture. The result is reproducible and extendible because the AI does the time-consuming part. That frees us to focus on the most important part: carefully crafting our instructions.

Your craft moves from creating your own data/sample to creating and refining your codebook.

Zooming out: AI is going to become instrumentation we use for large-scale measurement of text. As with any measurement, AI produces noise. And as with other instruments, AI becomes far more useful when its uncertainty is relatable — when users can develop a relationship to it. My hope is that users can begin to have a realistic and trusted relationship with their datamint.ing data — not the kind that trusts every cell, but the kind that trusts the process.

02100 Years of Data Analysis: From Experiments to Embeddings

R.A. Fisher told statisticians to design the experiment, generate the data, analyze the result. A century later, most of us run in reverse. The data already exists — scraped, deposited, published, leaked — and the analyst's job begins with cleaning and combining what's already out there. I think about what statistics looks like on that side of the table.

We named the new era Data Science, and rightly so. The internet made it easy to share both data and software; from that rich web of dependencies, data science emerges as a property of the network. Here's the longer version I tell students.

In this new era, the machine learning community has found great use for embeddings: representing the things you want to analyze — words, sentences, documents, DNA, proteins, images, videos, graphs, click-streams — as high-dimensional vectors. Neural networks then do linear algebra on those vectors. But linear algebra isn't only special for neural networks; it's at the heart of many statistical ideas. And at the heart of embeddings, there's a deep similarity to spectral techniques and Principal Components Analysis (PCA and varimax).

Statisticians have long studied PCA, but when supporting experimentalists, its use case was often unclear. My work bets heavily on PCA's future — making it useful as an embedding for a wider audience. The bet is PCA for the People (longpca). Compared to linear regression, PCA is harder to motivate, harder to perform, and harder to explain. I want it to be as fluid to use as linear regression is.


If we cross paths soon, let's talk about graphs or Large Language Models or twitter or #blacklivesmatter or eigenvectors or reproducibility or public opinion or psychotherapy or raspberry pis or your data or the reproducibility crisis or p-values or longpca or peer-review or curiosity driven research.