Professor, Database Group
Department of Computer Sciences, University of Wisconsin
Room 4355, 1210 W. Dayton St, Madison WI 53706
email@example.com, (608) 262 9759, fax: (608) 262 9777
- Jul 2014: The Beckman Report on Database Research is out.
- May 2014: Wrapped up my three-year stint as Chief Scientist of WalmartLabs. That was fun (and hectic).
- Feb 2014: Updated CV.
- Nov 2013: The CS department launched the
Master's Program. Applications are being accepted for enrolling in
Fall 2014. A professional certificate program will be launched in May
- Oct 2013: Mike Carey and I co-organized the Beckman Database
Research Self-Assessment Meeting.
- Jul 2012: Morgan Kaufmann published our textbook,
``Principles of Data
Integration'' (with Alon Halevy and Zack Ives).
- Sep 2012: Returned to UW-Madison, while continuing as Chief Scientist of WalmartLabs.
- Apr 2011: Kosmix was acquired by Walmart and turned into
WalmartLabs. Continued as Chief Scientist of WalmartLabs.
- Jun 2010: Took leave from UW-Madison to work as Chief Scientist of Kosmix.
I work in data integration, data science, and big data. Data
integration discovers and exploits semantic relationships across
disparate data, or combines disparate data into a single unified
database. Data science studies how to explore and analyze data. Big
data refers to the phenomenon where companies rush to acquire and
process a lot of data, to glean new knowledge to drive business. The
three topics are interrelated. In particular, a data science process
often must integrate data, and use big data tools to scale. For these
topics, I am equally interested in developing the science, techniques,
and practical tools. Ongoing projects:
- 2013-?: New directions in data integration: the focus here is a new vision on where the field should go next.
- 2014-?: Magellan, an entity matching management system.
- 2014-?: Data cleaning: how to effectively use humans (e.g., crowd workers, analysts).
- 2014-?: Developing crowdsourced data management systems.
- 2013-?: Developing cloud-based infrastructures for data science and the masses.
- 2013-?: Developing knowledge bases for e-commerce.
Selected Recent Publications
Google Scholar Entry)
- 2010-2011: Work done at Kosmix on social media analytics, information extraction and integration, and Web-scale knowledge bases
- 2005-2010: Cimple/DBLife (Web-scale knowledge bases, community
information management, information extraction & integration, text management)
- 1999-2009: Schema/ontology matching
Selected Awards and Honors
- Chimera: Large-Scale Classification using Machine Learning, Rules, and Crowdsourcing,
C. Sun, N. Rampalli, F. Yang, A. Doan. VLDB-14, industrial paper. slides
- Corleone: Hands-off Crowdsourcing for Entity Matching,
C. Gokhale, S. Das, A. Doan, J. Naughton, N. Rampalli, J. Shavlik, J. Zhu.
- Modeling Entity Evolution for Temporal Record Matching,
Y. Chiang, A. Doan, J. Naughton. SIGMOD-14.
- Tracking Entities in the Dynamic World: a Fast Algorithm for
Matching Temporal Records, Y. Chiang, A. Doan, J. Naughton. VLDB-14.
- Social Media Analytics: the Kosmix Story, with many authors.
IEEE Data Engineering Bulletin, Sept 2013.
- Entity Extraction,
Linking, Classification, and Tagging for Social Media: A
Wikipedia-Based Approach, A. Gattani, D. Lamba, N. Garera,
M. Tiwari, X. Chai, S. Das, S. Subramaniam, A. Rajaraman,
V. Harinarayan, and A. Doan. VLDB-13, industrial paper. slides
- Building, Maintaining, and Using
Knowledge Bases: A Report from the Trenches, O. Deshpande,
D. Lamba, M. Tourn, S. Das, S. Subramaniam, A. Rajaraman,
V. Harinarayan, A. Doan. SIGMOD-13, industrial paper. slides
- Muppet: MapReduce-Style
Processing of Fast Data, W. Lam, L. Liu, S. Prasad,
A. Rajaraman, Z. Vacheri, A. Doan. VLDB-12, industrial
- Crowdsourcing Systems on the World-Wide Web,
A. Doan, R. Ramakrishnan, A. Halevy. Communications of the
- Tuffy: Scaling up Statistical Inference in Markov Logic Networks using an RDBMS,
F. Niu, C. Re, A. Doan, J. Shavlik. VLDB-11.
- Toward Scalable
Keyword Search over Relational Data, A. Baid, I. Rae, J. Li,
A. Doan, J. Naughton. VLDB-10.
- Toward Industrial-Strength Keyword Search Systems over Relational Data,
A. Baid, I. Rae, A. Doan, J. Naughton. ICDE-10. Short paper.
- Optimizing Complex
Extraction Programs over Evolving Text Data, F. Chen, B. Gao,
A. Doan, J. Yang, R. Ramakrishnan. SIGMOD-09. a longer version,
- Efficiently Incorporating User Feedback into Information
Extraction and Integration Programs, X. Chai, B. Vuong, A. Doan, J. Naughton. SIGMOD-09.
- Combining Keyword Search and Forms for Ad Hoc Querying of Databases,
E. Chu, A. Baid, X. Chai, A. Doan, J. Naughton. SIGMOD-09. slides
The Case for a Structured Approach to Managing Unstructured Data,
A. Doan, J. F. Naughton, A. Baid, X. Chai, F. Chen, T. Chen, E. Chu, P. DeRose, B. Gao,
C. Gokhale, J. Huang, W. Shen, B. Vuong. CIDR-09.
Sanjib Kumar Das,
Paul Suganthan GC
Alumni (and first jobs):
- Ba-Quy Vuong, 2012, WalmartLabs
- Xiaoyong Chai, 2011, Google
- Fei Chen (co-advised with Raghu Ramakrishnan), 2010, HP Research
- Yoonkyong Lee, 2009, Samsung R&D, Korea
- Warren Shen, 2009, Google Research
- Pedro DeRose (co-advised with Raghu Ramakrishnan), 2009, Microsoft
- Doug Burdick (co-advised with Raghu Ramakrishnan), 2007, MITRE
- Wensheng Wu (co-advised with Clement Yu), 2006, IBM Almaden Research.
- Robert McCann, 2005, Microsoft
I usually teach
CS 784, Advanced Topics in Data Management, and CS 564, Introduction to RDBMSs.
- I have been working with Karu Sankaralingam and Jeff Naughton to
set up a
professional MS program and a professional certificate program in
Computer Sciences at UW-Madison. The MS program is now up and running
(in 2014) and I am the current director of both programs.
- I chaired the industrial program at SIGMOD-12 and will co-chair the same program
at VLDB-15. I co-chaired the 2013 Beckman meeting with Mike Carey.
I'm also serving on the ICDE 10-Year Most Influential Paper Award Committee (since 2013).
- I co-authored a data
integration textbook with Alon Halevy and Zack Ives in 2012.