Cymphony: A Crowdsourcing Platform for Data Integration

Motivation

Data integration (DI) tasks such as information extraction, data cleaning, entity matching, and schema matching are difficult to automate fully. They often require human judgment to achieve high accuracy. Crowdsourcing — leveraging human workers to solve such tasks — has therefore become a foundational technique in modern data science pipelines.

Despite widespread use, crowdsourcing for data integration remains poorly systematized. Academic research has largely focused on narrow algorithmic problems without producing reusable, end-to-end systems. Industrial platforms such as Amazon Mechanical Turk expose only limited abstractions and offer little support for complex multi-stage workflows. Contract-based services are generally accessible only to large organizations.

As a result, many data science teams still rely on ad-hoc scripts, emailed spreadsheets, and manual coordination to manage human-in-the-loop workflows — solutions that are brittle and difficult to maintain.

Real-world crowdsourcing workflows are rarely simple. They often involve multiple rounds of human input interleaved with machine processing such as sampling, SQL-based transformations, and quality estimation. Implementing such workflows from scratch imposes significant overhead and slows iteration.

We argue that crowdsourcing for data integration should be treated as a first-class systems problem. Rather than viewing human annotation as a peripheral service or collection of scripts, we model it as a structured execution substrate with explicit semantics and modular abstractions.

The Cymphony Solution

Cymphony is a project started in 2019, jointly with Qatar Computing Research Institute. It builds a general-purpose crowdsourcing platform tailored to data integration workflows. It models workflows as directed acyclic graphs (DAGs) of operators over relational tables.

Human operators encapsulate task assignment, annotation collection, and aggregation of noisy labels. The core operator 3a_kn solicits up to n votes per task and returns a final label when at least k votes agree. A variant, 3a_amt, integrates with Amazon Mechanical Turk for external workforce scaling. A third operator, 3a_knlm, supports role-aware aggregation with heterogeneous worker populations — for example, treating regular workers and trusted data stewards differently.

Machine operators perform deterministic transformations. sample_random draws a random subset of tuples for quality estimation or expert review. exec_sql executes arbitrary SQL queries to join, partition, or compute metrics over tables.

Workflows are specified declaratively as Cymphony programs and executed end-to-end by the system. All execution artifacts — task assignments, raw annotations, aggregated labels, and intermediate tables — are materialized as relational tables, enabling transparency, reproducibility, and seamless integration with downstream data tools.

Cymphony provides full lifecycle support: task instantiation, worker coordination, intermediate state management, aggregation, and output production. Workers interact through a web-based GUI or programmatic API. The system coordinates concurrent worker participation using a hybrid relational–filesystem data management layer.

Data Catalogs: A "Killer App" for Cymphony

We found data catalogs to be a natural "killer app" for crowdsourcing. Building and maintaining a data catalog often requires multiple crowdsourcing workflows, which can be complex — spanning multiple stages and involving different types of workers, including data stewards, in-house staff, and crowd workers (e.g., on Amazon Mechanical Turk).

Cymphony is well suited for this setting. We are exploring its use in SmartCat, our data catalog management system, where it orchestrates these crowdsourcing workflows. In this architecture, SmartCat retains control over user interaction and metadata management, while delegating workflow orchestration to Cymphony.

This design highlights Cymphony's role as a reusable human-in-the-loop backend for larger data management systems.

People and Funding

Amanpreet Singh Saini, Mark Tervo (UW-Madison), Mourad Ouzzani (QCRI), Nan Tang (HKUST, formerly QCRI).

Software

Cymphony is currently experimental and in beta testing.

Publications and Talks

Cymphony: Toward a Crowdsourcing Platform for Data Integration, A. Saini. PhD Dissertation, University of Wisconsin–Madison, 2026.
Cymphony: Toward a Crowdsourcing Platform for Data Science, A. Saini. PhD Defense.