- Feb 2014: updated CV.
- Nov 2013: The CS department
the Professional Master's Program. Applications are being accepted
for enrolling in Fall 2014. A professional certificate program will be
launched in May 2014.
- Oct 2013: Mike Carey and I co-organized the Beckman Database
Research Self-Assessment Meeting. A report will be out soon.
- Jul 2012: Morgan Kaufmann published our textbook,
``Principles of Data
Integration'' (with Alon Halevy and Zack Ives).
- Sep 2012: Returned to UW-Madison, while continuing as Chief Scientist of WalmartLabs.
- Apr 2011: Kosmix was acquired by Walmart and turned into
WalmartLabs. Continued as Chief Scientist of WalmartLabs.
- Jun 2010: Took leave from UW-Madison to work as Chief Scientist of Kosmix.
Databases, AI, and Web. I am especially interested in developing
principles and tools to manage the growing universe of messy and heterogeneous data.
Currently I focus on data integration, data/schema/ontology matching,
information extraction, text management, building knowledge bases, and
crowdsourcing. Ongoing projects:
- 2013-?: Data matching
- 2013-?: Building and using large-scale knowledge bases (joint with various industrial partners)
- 2003-?: Crowdsourcing
Selected Recent Publications
Google Scholar Entry)
- 2010-2011: Work done at Kosmix on social media analytics and Web-scale knowledge bases
- 2005-2010: Cimple/DBLife (Web-scale knowledge bases, community
information management, information extraction & integration, text management)
- 1999-2009: Schema/ontology matching
Selected Awards and Honors
- Corleone: Hands-off Crowdsourcing for Entity Matching,
C. Gokhale, S. Das, A. Doan, J. Naughton, R. Rampalli, J. Shavlik, J. Zhu.
- Modeling Entity Evolution for Temporal Record Matching,
Y. Chiang, A. Doan, J. Naughton. SIGMOD-14.
- Tracking Entities in the Dynamic World: a Fast Algorithm for
Matching Temporal Records, Y. Chiang, A. Doan, J. Naughton. VLDB-14.
- Social Media Analytics: the Kosmix Story, with many authors.
IEEE Data Engineering Bulletin, Sept 2013.
- Entity Extraction,
Linking, Classification, and Tagging for Social Media: A
Wikipedia-Based Approach, A. Gattani, D. Lamba, N. Garera,
M. Tiwari, X. Chai, S. Das, S. Subramaniam, A. Rajaraman,
V. Harinarayan, and A. Doan. VLDB-13, industrial paper. slides
- Building, Maintaining, and Using
Knowledge Bases: A Report from the Trenches, O. Deshpande,
D. Lamba, M. Tourn, S. Das, S. Subramaniam, A. Rajaraman,
V. Harinarayan, A. Doan. SIGMOD-13, industrial paper. slides
- Muppet: MapReduce-Style
Processing of Fast Data, W. Lam, L. Liu, S. Prasad,
A. Rajaraman, Z. Vacheri, A. Doan. VLDB-12, industrial
- Crowdsourcing Systems on the World-Wide Web,
A. Doan, R. Ramakrishnan, A. Halevy. Communications of the
- Tuffy: Scaling up Statistical Inference in Markov Logic Networks using an RDBMS,
F. Niu, C. Re, A. Doan, J. Shavlik. VLDB-11.
- Toward Scalable
Keyword Search over Relational Data, A. Baid, I. Rae, J. Li,
A. Doan, J. Naughton. VLDB-10.
- Toward Industrial-Strength Keyword Search Systems over Relational Data,
A. Baid, I. Rae, A. Doan, J. Naughton. ICDE-10. Short paper.
- Optimizing Complex
Extraction Programs over Evolving Text Data, F. Chen, B. Gao,
A. Doan, J. Yang, R. Ramakrishnan. SIGMOD-09. a longer version,
- Efficiently Incorporating User Feedback into Information
Extraction and Integration Programs, X. Chai, B. Vuong, A. Doan, J. Naughton. SIGMOD-09.
- Combining Keyword Search and Forms for Ad Hoc Querying of Databases,
E. Chu, A. Baid, X. Chai, A. Doan, J. Naughton. SIGMOD-09. slides
The Case for a Structured Approach to Managing Unstructured Data,
A. Doan, J. F. Naughton, A. Baid, X. Chai, F. Chen, T. Chen, E. Chu, P. DeRose, B. Gao,
C. Gokhale, J. Huang, W. Shen, B. Vuong. CIDR-09.
Sanjib Kumar Das,
Paul Suganthan GC
Alumni (and first jobs):
- Ba-Quy Vuong, 2012, WalmartLabs
- Xiaoyong Chai, 2011, Google
- Fei Chen (co-advised with Raghu Ramakrishnan), 2010, HP Research
- Yoonkyong Lee, 2009, Samsung R&D, Korea
- Warren Shen, 2009, Google Research
- Pedro DeRose (co-advised with Raghu Ramakrishnan), 2009, Microsoft
- Doug Burdick (co-advised with Raghu Ramakrishnan), 2007, MITRE
- Wensheng Wu (co-advised with Clement Yu), 2006, IBM Almaden Research.
- Robert McCann, 2005, Microsoft
I usually teach
CS 784, Advanced Topics in Data Management, and CS 564, Introduction to RDBMSs.
- I have been working with Karu Sankaralingam and Jeff Naughton to
set up a
professional MS program and a professional certificate program in
Computer Sciences at UW-Madison. The MS program is now up and running
(in 2014) and I am the current director of both programs.
- I chaired the industrial program at SIGMOD-12 and will co-chair the same program
at VLDB-15. I co-chaired the 2013 Beckman meeting with Mike Carey.
I'm also serving on the ICDE 10-Year Most Influential Paper Award Committee (since 2013).
- I co-authored a data
integration textbook with Alon Halevy and Zack Ives in 2012.