About
Entity Matching
Entity Matching (EM) is the problem of finding data instances that refer to the same real-world entity. For example, given two tables A and B shown here, find all tuple pairs (a in A, b in B) that match, such as (Dave Smith, Madison, WI) and (David D. Smith, Madison, WI). We call these pairs matches.
Example of entity matching between two tables A and B.
EM arises frequently in data science and AI. Many projects must integrate multiple datasets into a single clean unified dataset before analysis or model training can take place. Solving the entity matching problem is often a necessary step in this integration process.
The EM problem is challenging for two main reasons. First, matching instances may be represented in different ways—using different names, formats, or levels of detail—making high-accuracy matching difficult. Second, real-world datasets are often very large: tables commonly contain hundreds of millions of tuples, which makes achieving reasonable runtime and scalability a major challenge.
Blocking and Matching
For large tables, considering all possible pairs between tables A and B is computationally infeasible. As a result, entity matching is typically performed in two stages: blocking and matching.
In the blocking stage, inexpensive heuristics are used to quickly eliminate the vast majority of tuple pairs that are unlikely to match. The goal is to dramatically reduce the search space while retaining most true matches.
In the matching stage, a more expensive rule-based or machine-learning–based matcher is applied to the remaining candidate pairs to predict whether each pair is a match or a non-match.
For example, in the figure above, the blocking step retains only pairs that share the same state (which can be done efficiently using an index on the State column). The matching step then applies a matcher that correctly predicts pairs (a1,b1) and (a3,b2) as matches.
Variations of EM and Related Problems
Variations of EM are known as entity resolution, record linkage, deduplication, and more. We use the term entity matching because it emphasizes the common structure shared by these tasks and aligns with a broader set of related problems whose names end with matching.
Specifically, string matching finds strings that refer to the same real-world concept, such as "UW-Madison" and "Univ of Wisc Madison". Schema matching finds similar columns across tables, such as "address" and "location". Ontology matching finds similar concepts across ontologies, such as "car" and "automobile". Other related problems include matching between ontology concepts and table columns, matching tables, and more.
While Magellan is designed primarily for EM, it can also be used for these related matching problems.
The Magellan Project
The Magellan project was launched in 2015 at the University of Wisconsin–Madison. At the time, despite a large body of research on entity matching, we found little industrial-strength software that could be readily used in practice. Our goal has therefore been to develop open-source, production-quality entity matching software, evaluate it through collaborations with real users, use this feedback to refine the software, and publish the resulting findings.
A central objective of Magellan is to release EM software that achieves widespread adoption. While publishing research is an important part of the project, it is viewed as a consequence of executing this iterative software–user–refinement cycle, rather than an end in itself.
Team and Contact
Many people have contributed to the Magellan project over the years. A list of contributors can be found on the Research page.
For inquiries or collaboration requests, please contact us at entitymatchinginfo@gmail.com.