Magellan – Research

Research

This page describes the research projects that fall under the Magellan umbrella. If you have worked on one of these projects and your name is not listed, please accept our apologies and let us know.

SparkMatcher (2023 – Present)

This is our latest and most advanced EM platform. It provides blocking and matching tools that scale to hundreds of millions of tuples using Spark and AI.

Papers

Sparkly: A Simple yet Surprisingly Strong TF/IDF Blocker for Entity Matching, D. Paulsen, Y. Govind, A. Doan. VLDB-23.

Software

SparkMatcher consists of four open-source packages designed to support end-to-end EM workflows at scale.

The following packages support the blocking step:

Sparkly: Uses TF/IDF–based similarity to block, and has been shown to outperform many state-of-the-art blocking approaches in accuracy.
Delex: Allows users to combine multiple blocking strategies within a single workflow. It provides a declarative language for specifying blocking rules. Delex is currently in beta testing.

The following packages support the matching step:

MatchFlow: A library for experimenting with a wide range of workflows for the matching step. It provides modular components that can be composed into flexible matching pipelines. MatchFlow is hosted at MadMatcher.
ActiveMatcher: Uses active learning to train high-accuracy matchers with minimal manual labeling. It automatically selects informative tuple pairs for labeling and scales to very large candidate sets produced by blocking, often containing hundreds of millions of pairs. ActiveMatcher is currently in beta testing.

Users

A variant of Sparkly has been integrated into widely used industrial EM software and is currently used by hundreds of customers.
MatchFlow has been used to build matchers for the Environmental Data Initiative (EDI), a major data lake serving environmental scientists.

Data

BigGoat: A benchmark for evaluating the scalability of blocking methods.

Startup

MadMatcher: Founded by Dev Ahluwalia (2025 – Present)

Team

Dev Ahluwalia, Derek Paulsen, Yash Govind, Anson Doan

CloudMatcher (2017–2019)

CloudMatcher is a hands-off, self-service, cloud-based EM platform. Users upload two tables to be matched and label a small number of tuple pairs as match or no-match. The system then automatically performs blocking and matching using the labeled data and outputs the resulting matches. This design enables business users to perform EM with minimal technical expertise. CloudMatcher was acquired by Informatica in 2020.

Papers

CloudMatcher: A Hands-Off Cloud/Crowd Service for Entity Matching, Y. Govind, E. Paulson, P. Nagarajan, P. Suganthan G.C., A. Doan, Y. Park, G. Fung, D. Conanthan, M. Carter, M. Sun. VLDB-18. Demo paper.
CloudMatcher: A Cloud/Crowd Service for Entity Matching, Y. Govind, E. Paulson, M. Ashok, P. Suganthan G.C., A. Hitawala, A. Doan, Y. Park, P. Peissig, E. LaRose, J. Badger. BIGDAS Workshop @ KDD-17. Slides

Software

CloudMatcher (no longer available, acquired by Informatica in 2020)

Users

CloudMatcher was used by several domain science teams, hospitals, and companies. See Table 2 of this paper for details.
This article describes how CloudMatcher was used at American Family Insurance.

Startup

GreenBay Technologies: Co-founded by AnHai Doan, Yash Govind, Derek Paulsen (2019–2020)

Team

Yash Govind, Erik Paulson, Derek Paulsen, P. Nagarajan, Paul Suganthan GC, Mukilan Ashok, A. Hitawala

PyMatcher (2015–2025)

PyMatcher is an EM platform built on Python data science libraries (e.g., pandas, sklearn) and designed to run on a single machine. It targets small to medium-sized tables—typically up to a few million tuples per table.

Papers — Overall Vision, Progress, And Demos

Magellan: Toward Building Entity Matching Management Systems, P. Konda, S. Das, P. Suganthan G.C., A. Doan, and others. VLDB-16. extended version, slides. [492 citations as of 3/31/2026]
Magellan: Toward Building Entity Matching Management Systems over Data Science Stacks, P. Konda, S. Das, and others. VLDB-16, demo paper. Jupyter notebook & datasets for demo.
Magellan: Toward Building Entity Matching Management Systems, P. Konda, S. Das, and others. SIGMOD Record, 2018.
Entity Matching Meets Data Science: A Progress Report from the Magellan Project, Y. Govind, P. Konda, and others. SIGMOD-19. Industrial paper.
Magellan: Toward Building Ecosystems of Entity Matching Solutions, A. Doan, P. Konda, P. Suganthan G.C., Y. Govind, D. Paulsen, K. Chandrasekhar, P. Martinkus, M. Christie. Communications of the ACM, 2020.

Other Papers

MatchCatcher: A Debugger for Blocking in Entity Matching, H. Li, P. Konda, and others. EDBT-18. extended version, slides.
Executing Entity Matching End to End: A Case Study, P. Konda, S. Seshadri, E. Segarra, B. Hueth, A. Doan. EDBT-19. Industrial paper.

Software

py_stringmatching: Implements a wide range of string tokenizers and string similarity functions.
py_strsimjoin: Efficiently finds all matching string pairs between two large sets of strings.
py_entitymatching: Performs entity matching between two tables by identifying all matching tuple pairs.

Users

PyMatcher was used by several domain science teams and companies. See Table 1 of this paper for details.
This paper describes applying PyMatcher to match grants by economists at UW-Madison.
Appendix B of this paper describes its use in matching cattle ranches as part of a project aimed at reducing deforestation in the Amazon.

Team

Pradap Konda, Sanjib Das, Paul Suganthan G.C., Ardel Ardalan, Jeff Ballard, Han Li, Fatemah Panahi, Haojun Zhang, Shishir Prasad

Corleone and Falcon (2013–2018)

This project explored EM solutions that leverage crowdsourcing to enable hands-off matching of large tables at scale. The ideas developed here inspired the design of CloudMatcher.

Papers

Falcon: Scaling Up Hands-Off Crowdsourced Entity Matching to Build Cloud Services, S. Das, P. Suganthan G.C., A. Doan, and others. SIGMOD-17. extended version, slides. [139 citations as of 3/31/2026]
Corleone: Hands-off Crowdsourcing for Entity Matching, C. Gokhale, S. Das, A. Doan, and others. SIGMOD-14. extended report, slides. [345 citations as of 3/31/2026]

Team

Chaitanya Gokhale, Sanjib Das, Paul Suganthan G.C.

Deep Learning (2017–2022)

This project explores using deep learning for both the blocking and matching steps of EM. It investigated a broad design space of neural architectures and training strategies, including pre-trained language models.

Papers

Deep Learning for Blocking in Entity Matching: A Design Space Exploration, S. Thirumuruganathan, H. Li, N. Tang, M. Ouzzani, Y. Govind, D. Paulsen, G. Fung, A. Doan. VLDB-21. [165 citations as of 3/31/2026]
Deep Entity Matching with Pre-Trained Language Models, Y. Li, J. Li, Y. Suhara, A. Doan, W. Tan. VLDB-20. [656 citations as of 3/31/2026]
Deep Learning for Entity Matching: A Design Space Exploration, S. Mudgal, H. Li, T. Rekatsinas, A. Doan, and others. SIGMOD-18. extended version. [887 citations as of 3/31/2026]

Software

DeepMatcher: Implements the solutions in the SIGMOD-18 paper. No longer maintained.
Ditto: Implements the solutions in the VLDB-20 paper.

String Matching, Schema Matching, Ontology Matching, and Related Problems

Although designed for entity matching, SparkMatcher and related Magellan software can be applied to a wide range of semantic matching tasks, including string matching, schema matching, and ontology matching.

Examples

Smurf: Self-Service String Matching Using Random Forests, P. Suganthan G.C., A. Ardalan, A. Doan, A. Akella. VLDB-19.
A variant of SparkMatcher has been incorporated into an industrial schema matching system.
SparkMatcher has been used to match table columns with ontology concepts for the Environmental Data Initiative (EDI); this solution is now in production at EDI.