Crowdsourcing

Crowdsourcing (2002–2015)

This project was among the first in the database research community to explore crowdsourcing for data management, with a focus on building data integration systems.

In many ways, it was ahead of its time—by nearly a decade. Broad interest in crowdsourcing within the database community did not emerge until around 2011.

This project examines how to apply crowdsourcing to problems such as schema matching, knowledge base construction, entity matching, and social media analysis.

In 2019, I started a new project called Cymphony, in collaboration with Qatar Computing Research Institute, to build a general-purpose crowdsourcing platform. That project is ongoing.

Early Work (2002–2004)

I focused on applying crowdsourcing to schema matching. At the time, the term "crowdsourcing" had not yet been coined, so this approach was referred to as "mass collaboration." The core idea, however, was the same: pose a question, collect multiple responses, and aggregate them (e.g., through majority voting).

Building Data Integration Systems via Mass Collaboration, R. McCann, A. Doan, A. Kramnik, and V. Varadarajan. Proc. of the Int. Workshop on Web and Databases (WebDB-03). [94 citations as of 3/31/2026]
Building Data Integration Systems: A Mass Collaboration Approach, A. Doan and R. McCann. Proc. of the IJCAI-03 Workshop on Information Integration on the Web.
Integrating Data from Disparate Sources: A Mass Collaboration Approach, R. McCann, A. Kramnik, W. Shen, V. Varadarajan, O. Sobulo, A. Doan. ICDE-05. Poster.
Matching Schemas in Online Communities: A Web 2.0 Approach, R. McCann, W. Shen, A. Doan. ICDE-08. [187 citations as of 3/31/2026]

Work on Crowdsourcing Knowledge Bases (2005–2009)

Subsequently I focused on crowdsourcing to build community-centric knowledge bases, and deployed such a knowledge base called DBLife.

Community Information Management, A. Doan, R. Ramakrishnan, F. Chen, P. DeRose, Y. Lee, R. McCann, M. Sayyadian, and W. Shen. IEEE Data Engineering Bulletin, Special Issue on Probabilistic Databases, 29(1), 2006. Invited. [120 citations as of 3/31/2026]
Building Structured Web Community Portals: A Top-Down, Compositional, and Incremental Approach, P. DeRose, W. Shen, F. Chen, A. Doan, R. Ramakrishnan. VLDB-07. [112 citations as of 3/31/2026]
Building Community Wikipedias: A Human-Machine Approach, P. DeRose, X. Chai, B. Gao, W. Shen, A. Doan, P. Bohannon, J. Zhu. ICDE-08.
Efficiently Incorporating User Feedback into Information Extraction and Integration Programs, X. Chai, B. Vuong, A. Doan, J. Naughton. SIGMOD-09.

Surveys

Crowdsourcing Systems on the World-Wide Web, A. Doan, R. Ramakrishnan, A. Halevy. Communications of the ACM, 2011. [2143 citations as of 3/31/2026]

Crowdsourcing Work in Silicon Valley (2010–2015)

I worked extensively on crowdsourcing in industry from 2010 to 2015, first at Kosmix and later at WalmartLabs.

Social Media Analytics: the Kosmix Story, with many authors. IEEE Data Engineering Bulletin, Sept 2013.
Entity Extraction, Linking, Classification, and Tagging for Social Media: A Wikipedia-Based Approach, A. Gattani, D. Lamba, N. Garera, M. Tiwari, X. Chai, S. Das, S. Subramaniam, A. Rajaraman, V. Harinarayan, and A. Doan. VLDB-13, industrial paper. [slides] [198 citations as of 3/31/2026]
Building, Maintaining, and Using Knowledge Bases: A Report from the Trenches, O. Deshpande, D. Lamba, M. Tourn, S. Das, S. Subramaniam, A. Rajaraman, V. Harinarayan, A. Doan. SIGMOD-13, industrial paper. [slides] [160 citations as of 3/31/2026]
Chimera: Large-Scale Classification using Machine Learning, Rules, and Crowdsourcing, C. Sun, N. Rampalli, F. Yang, A. Doan. VLDB-14, industrial paper. [148 citations as of 3/31/2026]
Corleone: Hands-Off Crowdsourcing for Entity Matching, C. Gokhale, S. Das, A. Doan, J. Naughton, N. Rampalli, J. Shavlik, J. Zhu. SIGMOD-14. [345 citations as of 3/31/2026]