Magellan About Research Software Data Users Lessons Learned

Software

SparkMatcher (2023 – Present)

SparkMatcher is our latest and most advanced EM platform. It provides blocking and matching tools that scale to hundreds of millions of tuples using Spark and AI.

SparkMatcher consists of four open-source packages designed to support end-to-end EM workflows at scale.

The following packages support the blocking step:

The following packages support the matching step:

PyMatcher (2015–2025)

PyMatcher is an EM platform built on Python data science libraries (e.g., pandas, sklearn) and designed to run on a single machine. It targets small to medium-sized tables—typically up to a few million tuples per table. PyMatcher provides tools for sampling data, using those samples to design accurate EM pipelines, and then applying the pipelines to match the full tables.

PyMatcher consists of three open-source Python packages:

Other Software

CloudMatcher (2017–2019) is a cloud-based EM platform. It was acquired by Informatica in 2020.

DeepMatcher (2018) is an EM software that uses deep learning. It is designed primarily for researchers and is no longer maintained. If interested in deep learning based EM, you can consider Ditto, developed by Megagon Labs.