Knowledge Graphs (2004–2015)
This project was among the first in the database research community to study building knowledge graphs, commonly called knowledge bases at the time.
We focused on building community knowledge bases. Our prototype was DBLife, a community knowledge base for the database research community.
This work influenced my subsequent work at Kosmix and WalmartLabs, where I worked on knowledge graphs for social media analytics and e-commerce.
Community Knowledge Bases
Vision and System
- Community Information Management. A. Doan, R. Ramakrishnan, F. Chen, P. DeRose, Y. Lee, R. McCann, M. Sayyadian, W. Shen. IEEE Data Eng. Bull. 29(1): 64–72 (2006) [120 citations as of 3/31/2026]
- Building Structured Web Community Portals: A Top-Down, Compositional, and Incremental Approach. P. DeRose, W. Shen, F. Chen, A. Doan, R. Ramakrishnan. VLDB 2007 [112 citations as of 3/31/2026]
- DBLife: A Community Information Management Platform for the Database Research Community (Demo). P. DeRose, W. Shen, F. Chen, Y. Lee, D. Burdick, A. Doan, R. Ramakrishnan. CIDR 2007 [111 citations as of 3/31/2026]
- User-Centric Research Challenges in Community Information Management Systems. A. Doan, P. Bohannon, R. Ramakrishnan, X. Chai, P. DeRose, B. J. Gao, W. Shen. IEEE Data Eng. Bull. 30(2): 32–40 (2007)
- Tuffy: Scaling up Statistical Inference in Markov Logic Networks using an RDBMS. F. Niu, C. Ré, A. Doan, J. W. Shavlik. Proc. VLDB Endow. 4(6): 373–384 (2011) [364 citations as of 3/31/2026]
Information Extraction
- Managing information extraction: state of the art and research directions. A. Doan, R. Ramakrishnan, S. Vaithyanathan. SIGMOD 2006 [115 citations as of 3/31/2026]
- Declarative Information Extraction Using Datalog with Embedded Extraction Predicates. W. Shen, A. Doan, J. F. Naughton, R. Ramakrishnan. VLDB 2007 [268 citations as of 3/31/2026]
- A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data. E. Chu, A. Baid, T. Chen, A. Doan, J. F. Naughton. VLDB 2007
- Efficient Information Extraction over Evolving Text Data. F. Chen, A. Doan, J. Yang, R. Ramakrishnan. ICDE 2008
- Toward best-effort information extraction. W. Shen, P. DeRose, R. McCann, A. Doan, R. Ramakrishnan. SIGMOD 2008
- Information extraction challenges in managing unstructured data. A. Doan, J. F. Naughton, R. Ramakrishnan, A. Baid, X. Chai, F. Chen, T. Chen, E. Chu, P. DeRose, B. J. Gao, C. Gokhale, J. Huang, W. Shen, B. Vuong. SIGMOD Rec. 37(4): 14–20 (2008) [106 citations as of 3/31/2026]
- On the provenance of non-answers to queries over extracted data. J. Huang, T. Chen, A. Doan, J. F. Naughton. Proc. VLDB Endow. 1(1): 736–747 (2008) [216 citations as of 3/31/2026]
- Optimizing complex extraction programs over evolving text data. F. Chen, B. J. Gao, A. Doan, J. Yang, R. Ramakrishnan. SIGMOD 2009
- Join Optimization of Information Extraction Output: Quality Matters! A. Jain, P. G. Ipeirotis, A. Doan, L. Gravano. ICDE 2009
Entity Matching
- Source-aware Entity Matching: A Compositional Approach. W. Shen, P. DeRose, L. H. Vu, A. Doan, R. Ramakrishnan. ICDE 2007
Crowdsourcing
- Building Community Wikipedias: A Machine-Human Partnership Approach. P. DeRose, X. Chai, B. J. Gao, W. Shen, A. Doan, P. Bohannon, X. Zhu. ICDE 2008
- Matching Schemas in Online Communities: A Web 2.0 Approach. R. McCann, W. Shen, A. Doan. ICDE 2008 [187 citations as of 3/31/2026]
- Efficiently incorporating user feedback into information extraction and integration programs. X. Chai, B. Vuong, A. Doan, J. F. Naughton. SIGMOD 2009
Querying
- Combining keyword search and forms for ad hoc querying of databases. E. Chu, A. Baid, X. Chai, A. Doan, J. F. Naughton. SIGMOD 2009 [149 citations as of 3/31/2026]
- Toward industrial-strength keyword search systems over relational data. A. Baid, I. Rae, A. Doan, J. F. Naughton. ICDE 2010
- Toward Scalable Keyword Search over Relational Data. A. Baid, I. Rae, J. Li, A. Doan, J. F. Naughton. Proc. VLDB Endow. 3(1): 140–149 (2010) [96 citations as of 3/31/2026]
Knowledge Bases for Social Media Analytics
From 2010-2011 I was on leave from UW-Madison, working as Chief Scientist of Kosmix, a startup in social media analytics.
I worked on several projects that built Web-scale knowledge bases for social media analytics. Parts of these projects are described in the following papers:
- Social Media Analytics: the Kosmix Story, with many authors. IEEE Data Engineering Bulletin, Sept 2013.
- Entity Extraction, Linking, Classification, and Tagging for Social Media: A Wikipedia-Based Approach, A. Gattani, D. Lamba, N. Garera, M. Tiwari, X. Chai, S. Das, S. Subramaniam, A. Rajaraman, V. Harinarayan, and A. Doan. VLDB-13, industrial paper. slides [198 citations as of 3/31/2026]
- Building, Maintaining, and Using Knowledge Bases: A Report from the Trenches, O. Deshpande, D. Lamba, M. Tourn, S. Das, S. Subramaniam, A. Rajaraman, V. Harinarayan, A. Doan. SIGMOD-13, industrial paper. slides [160 citations as of 3/31/2026]
- Muppet: MapReduce-Style Processing of Fast Data, W. Lam, L. Liu, S. Prasad, A. Rajaraman, Z. Vacheri, A. Doan. VLDB-12, industrial paper. [199 citations as of 3/31/2026]
I also worked on event detection and monitoring for social media. A talk that describes work at Kosmix at a high level: Social Media, Data Integration, and Human Computation.
Knowledge Bases for E-Commerce
Kosmix was acquired by Walmart in 2011 and turned into WalmartLabs, the research and development lab for e-commerce at Walmart.
From 2011-2014 I worked as Chief Scientist of WalmartLabs. I worked on building product knowledge bases, focusing on product matching, information extraction, data/rule cleaning, and crowdsourcing. Parts of these projects are described in the following papers:
- Corleone: Hands-Off Crowdsourcing for Entity Matching. C. Gokhale, S. Das, A. Doan, J. Naughton, N. Rampalli, J. Shavlik, J. Zhu. SIGMOD 2014 [345 citations as of 3/31/2026]
- Chimera: Large-Scale Classification using Machine Learning, Rules, and Crowdsourcing. C. Sun, N. Rampalli, F. Yang, A. Doan. Proc. VLDB Endow. 7(13): 1529–1540 (2014) [148 citations as of 3/31/2026]
- Why Big Data Industrial Systems Need Rules and What We Can Do About It. P. Suganthan G. C., C. Sun, K. Gayatri K., H. Zhang, F. Yang, N. Rampalli, S. Prasad, E. Arcaute, G. Krishnan, R. Deep, V. Raghavendra, A. Doan. SIGMOD 2015
The work at WalmartLabs motivated the Magellan project on entity matching, which is still ongoing.