Photo credit: Hector Garcia-Molina


Theodoros (Theo) Rekatsinas

Update:I am currently on leave at Apple.

I am an assistant proferssor and a member of the UW-Madison Database Group. My lab works on the foundations of machine learning-based data preparation systems.

I am a co-founder of inductiv inc. Inductiv is based in Waterloo, ON and is focusing on automating data quality ops for analytical pipelines. [Inductiv is now part of Apple.]

Our mission is to go beyond research and put new technology in the hands of users:

Management of Noisy Data: My group is exploring the fundamental connections between data cleaning with statistical learning and probabilistic inference. This work is part of the HoloClean project. You can find an overview of our work for managing noisy data here.

AI assistants for accelarating knowledge discovery: Recently, we started exploring the use of modern, contextual AI to build COSMOS, an AI assistant that extracts and assimilates data from heterogeneous sources to accelerate analytics and knowledge discovery.

Email: thodrek [at]  /  Office: CS4361 @ Computer Sciences


  • Upcoming and Recent talks: Stanford MLSys Seminar Series Mar 2021, CMU (ML with Large Datasets) Feb 2021, MSR Redmond, MLOps@MLSys, AKBC 2020
  • In our recent work we show how to train graph embeddings for billion-edge graphs using a single machine; no need for expensive distributed training over multiple multi-GPU machines. We are currently working on open-sourcing the engine. Stay tuned!
  • In our new paper in SIGMOD 2020, we close the book on Functional Dependency discovery! Of course by adopting a statistical perpsective: [pdf] [Source Code]
  • Redundancy due to structure in data is key to more accurate robust mean estimation! Check out our new theoretical results (pre-print)!
  • Attention can be the key to simple and effective imputation of missing data (over mixed distributions). Our work on how attention-based networks are used in HoloClean will appear in MLSys 2020! [pdf]
  • Check out the COSMOS project, our effort to build an AI assistant for accelarating knowledge discovery.
  • Our vision on using the noisy channel model to manage noisy data is available here.
  • The whitepaper on the vision around SysML is out!
  • Excited to be giving at talk at ETH on new formal frameworks for managing noisy databases. You can see the recorded talk here.
  • The slides of our tutorial on the synergy between ML and data integration are available here.
  • Excited to release HoloClean as an open-source project! Check it out here!


NEW! Learning Large-Scale Graph Embeddings on a Single Machine
Jason Mohoney, Roger Waleffe, Henry Xu, Theodoros Rekatsinas, and Shivaram Venkataraman
Under Submission, 2020

NEW! Principal Component Networks:Parameter Reduction Early in Training
Roger Waleffe and Theodoros Rekatsinas
Pre-print, 2020

NEW! Picket: Self-supervised Data Diagnostics for ML Pipelines
Zifan Liu, Zhechun Zhou, and, Theodoros Rekatsinas
Under Submission, 2020 [code]

Robust Mean Estimation under Coordinate-level Corruption with Missing Entries
Zifan Liu, Jongho Park, Nils Palumbo, Theodoros Rekatsinas, and Christos Tzamos
Under Submission, 2020

Unsupervised Relation Extraction from Language Models using Constrained Cloze Completion
Ankur Goswami, Akshata Bhat, Hadar Ohana, and, Theodoros Rekatsinas
EMNLP-Findings, 2020

Record fusion: A learning approach
Alireza Heidari, George Michalopoulos, Shrinu Kushagra, Ihab Ilyas and Theodoros Rekatsinas
Manuscript, 2020

Data-Dependent Differentially Private Parameter Learning for Directed Graphical Models
Amrita Roy Chowdhury, Theodoros Rekatsinas, and Somesh Jha
ICML 2020

A Statistical Perspective on Discovering Functional Dependencies in Noisy Data
Yunjia Zhang, Zihan Guo, and Theodoros Rekatsinas

Attention-based Learning for Missing Data Imputation in HoloClean
Richard Wu, Aoqian Zhang, Ihab F. Ilyas, and Theodoros Rekatsinas
MLSys 2020

CRUX: Adaptive Querying for Efficient Crowdsourced Data Extraction
Theodoros Rekatsinas, Amol Deshpande, and Aditya Parameswaran
CIKM 2019

Approximate Inference in Structured Instances with Noisy Categorical Observations
Alireza Heidari, Ihab F. Ilyas, and Theodoros Rekatsinas
UAI 2019

Unsupervised Functional Dependency Discovery for Data Preparation
Zhihan Guo and Theodoros Rekatsinas
ICLR, Learning from Limited Data Workshop 2019 [arxiv]

HoloDetect: Few-Shot Learning for Error Detection
Alireza Heidari, Joshua McGrath, Ihab F. Ilyas, and Theodoros Rekatsinas
SIGMOD 2019 (to appear)

A Formal Framework For Probabilistic Unclean Databases
Christopher De Sa, Ihab F. Ilyas, Benny Kimelfeld, Christopher Ré and Theodoros Rekatsinas
ICDT 2019

Data Integration and Machine Learning: A Natural Synergy
Xin Luna Dong and Theodoros Rekatsinas
Tutorial@SIGMOD 2018, @VLDB2018, and @KDD2019 (to appear)

Deep Learning For Entity Matching: A Design Space Exploration
Sidharth Mudgal, Han Li, Anhai Doan, Theodoros Rekatsinas, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra
SIGMOD 2018 Code is available here

Fonduer: Knowledge Base Construction from Richly Formatted Data
Sen Wu, Luke Hsiao, Xiao Cheng, Braden Hancock, Theodoros Rekatsinas, Philip Levis and Christopher Ré

HoloClean: Holistic Data Repairs with Probabilistic Inference
Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas and Christopher Ré
VLDB 2017

SLiMFast: Guaranteed Results for Data Fusion and Source Reliability
Theodoros Rekatsinas, Manas Jogklekar, Hector Garcia-Molina, Aditya Parameswaran and Christopher Ré

Forecasting Rare Disease Outbreaks from Open Source Indicators
Theodoros Rekatsinas, Saurav Ghosh, Sumiko Mekaru, Elaine Nsoesie, John Brownstein, Lise Getoor and Naren Ramakrishnan
Journal of Statistical Analysis and Data Mining, Best of SDM Special Issue, 2016

SourceSight: Enabling Effective Source Selection
Theodoros Rekatsinas, Amol Deshpande, Xin Luna Dong, Lise Getoor and Divesh Srivastava

HawkesTopic: A Joint Model for Network Inference and Topic Modeling from Text-Based Cascades
Xinran He, Theodoros Rekatsinas, James Foulds, Lise Getoor, and Yan Liu
International Conference on Machine Learning (ICML), 2015

StoryPivot: Comparing and Contrasting Story Evolution
Anja Gruenheid, Donald Kossmann, Theodoros Rekatsinas, and Divesh Srivastava

SourceSeer: Forecasting Rare Disease Outbreaks Using Multiple Data Sources Best Paper Award
Theodoros Rekatsinas, Saurav Ghosh, Sumiko Mekaru, Elaine Nsoesie, John Brownstein, Lise Getoor and Naren Ramakrishnan
SIAM International Conference on Data Mining (SDM), 2015

Finding Quality in Quantity: The Challenge of Discovering Valuable Sources for Integration
Theodoros Rekatsinas, Xin Luna Dong, Lise Getoor and Divesh Srivastava
7th Biennial Conference on Innovative Data Systems Research (CIDR), 2015

Characterizing and selecting fresh data sources
Theodoros Rekatsinas, Xin Luna Dong and Divesh Srivastava

SPARSI: partitioning sensitive data amongst multiple adversaries
Theodoros Rekatsinas, Amol Deshpande and Ashwin Machanavajjhala
Proceedings of the VLDB Endowment Volume 6 Issue 13, 2013

Multi-relational Learning Using Weighted Tensor Decomposition with Modular Loss
Ben London, Theodoros Rekatsinas, Bert Huang and Lise Getoor
NIPS 2012 Workshop on Spectral Algorithms for Latent Variable Models

Local structure and determinism in probabilistic databases
Theodoros Rekatsinas, Amol Deshpande and Lise Getoor

Fuzzy rule based neuro-dynamic programming for mobile robot skill acquisition on the basis of a nested multi-agent architecture Best Of Conference
John Karigiannis, Theodoros Rekatsinas and Costas S. Tzafestas
IEEE International Conference on Robotics and Biomimetics (ROBIO), 2010


Adaptive Querying Strategies for Efficient Crowdsourced Data Extraction
Theodoros Rekatsinas, Amol Deshpande and Aditya Parameswaran, 2016

Quality-Aware Data Source Management
Theodoros Rekatsinas, Doctoral Dissertation, 2015


Current PhD Students:

Current MS and Undergraduate Student:

Friends and Collaborators:


  • Zhihan Guo (UW-Madison, currenty advised by Prof. Xiangyao Yu)
  • Joshua McGrath (BS 2019, University of Waterloo, PhD)
  • Jordan Vonderwell (BS 2019, Google)
  • Sidharth Mudgal (MS 2018, Amazon)
  • Sherine Zhang (BS 2018, Stanford for MS)


CS839: Modern Data Management and Machine Learning Systems, Spring 2020

CS639: Data Management for Data Science, Spring 2019

CS839: Probabilisitc Graphical Models, Fall 2018

CS839: Data Management for Machine Learning, Spring 2018

CS564: Database Management Systems, Fall 2017


Organizing Committee: ICDE 2019, SysML2019

PC-Member: SIGMOD 2017-2019, VLDB 2017, ICDE 2018, NIPS 2015-2017, ICML 2018, IJCAI 2016, CIKM 2017-2018