Software
SparkMatcher (2023 – Present)
SparkMatcher is our latest and most advanced EM platform. It provides blocking and matching tools that scale to hundreds of millions of tuples using Spark and AI.
SparkMatcher consists of four open-source packages designed to support end-to-end EM workflows at scale.
The following packages support the blocking step:
- Sparkly: Uses TF/IDF–based similarity to block, and has been shown to outperform many state-of-the-art blocking approaches in accuracy.
- Delex: Allows users to combine multiple blocking strategies within a single workflow. It provides a declarative language for specifying blocking rules. Delex is currently in beta testing.
The following packages support the matching step:
- MatchFlow: A library for experimenting with a wide range of workflows for the matching step, across different runtime environments. It provides modular components that can be composed into flexible matching pipelines. MatchFlow is hosted at MadMatcher, a recent EM startup.
- ActiveMatcher: Uses active learning to train high-accuracy matchers with minimal manual labeling. It automatically selects informative tuple pairs for labeling and scales to very large candidate sets produced by blocking, often containing hundreds of millions of pairs. ActiveMatcher is currently in beta testing.
PyMatcher (2015–2025)
PyMatcher is an EM platform built on Python data science libraries (e.g., pandas, sklearn) and designed to run on a single machine. It targets small to medium-sized tables—typically up to a few million tuples per table. PyMatcher provides tools for sampling data, using those samples to design accurate EM pipelines, and then applying the pipelines to match the full tables.
PyMatcher consists of three open-source Python packages:
- py_stringmatching: Implements a wide range of string tokenizers and string similarity functions.
- py_strsimjoin: Efficiently finds all matching string pairs between two large sets of strings, using py_stringmatching.
- py_entitymatching: Performs entity matching between two tables by identifying all matching tuple pairs, building on the above two packages.
Other Software
CloudMatcher (2017–2019) is a cloud-based EM platform. It was acquired by Informatica in 2020.
DeepMatcher (2018) is an EM software that uses deep learning. It is designed primarily for researchers and is no longer maintained. If interested in deep learning based EM, you can consider Ditto, developed by Megagon Labs.