AnHai Doan's HomePage

      AnHai Doan

      Vilas Distinguished Achievement Professor
      Gurindar S. Sohi Professor
      Department of Computer Science, University of Wisconsin
      Room 4355, 1210 W. Dayton St, Madison WI 53706
      anhai@cs.wisc.edu, (608) 262 9759
      Bio/Personal    Database Group   UW, CS, Living in Madison

News

Aug 2020: This homepage and project pages have been seriously out of date as way too much stuff happened in the past two years
- I was heavily involved in the effort to set up the School of Computer, Data, and Information Sciences at UW-Madison (for which I received the CS Department's Service Award, jointly with two colleagues).
- I co-chaired SIGMOD-2020 (with Wang-Chiew Tan), a full-time job itself for a year.
- I co-founded GreenBay Technologies to commercialize Magellan. GreenBay has been acquired by Informatica.
Jun 2018: The DeepMatcher package, which applies deep learning to EM, is released as a part of Magellan. See deepmatcher.ml for the code (and here for the paper).
Jun 2018: Our CloudMatcher/Magellan code is being deployed at American Family Insurance, a Fortune 500 company.
Jun 2018: A short paper on a system building agenda for data integration and data science. Invited to IEEE DEB Special Issue on Large-Scale Data Integration. (Another invited paper discusses BigGorilla.)
May 2018: The Magellan VLDB paper received a SIGMOD Research Highlight Award. Here's a shortened version of that paper.
Dec 2017: Discussed misc issues about UW, CS, and living in Madison.
Sep 2017: Revised homepage to reflect recent work on data cleaning/integration and data science.
Oct 2016: A talk on a system building agenda for data integration (and data science). The Magellan system described below is an example of realizing this agenda for entity matching.
Jul 2016: Launching Magellan, a new project to build an end-to-end entity matching system.
Old news

Research (Group's Homepage)

My work has charted new directions or bet on emerging directions that I believe would become fundamental for data management. Solving problems in these directions often requires a combination of machine learning, scalable data management, effective human-data interaction, and cloud technologies. Current directions:

Data cleaning & integration: I build end-to-end data integration systems as parts of the Python ecosystem of open-source data tools. I also leverage these systems to build cloud/crowd data integration services for lay users.
Data science: This direction is increasingly critical to the data management community, yet no clear agenda exists today. I'm working on an agenda that integrates research, system building, education, and outreach. This agenda currently focuses on data quality and builds on the above work in data cleaning/integration.
Quick links: DI agenda paper and talk, Magellan homepage and paper, code (py_entitymatching, py_stringsimjoin, py_stringmatching), data sets, data science course, BigGorilla repository of DI tools, DI textbook

Past directions: knowledge bases/graphs (2004-2012), crowdsourcing (2002-2015), schema/ontology matching (2000-2010). In between, from 2010-2014 I spent some time in Silicon Valley, putting my work in these directions to use, and learning a ton about doing things "in the wild".

Selected Recent Publications (DBLP Entry Google Scholar Entry)

Magellan: toward building ecosystems of entity matching solutions, AnHai Doan, Pradap Konda, Paul Suganthan G. C., Yash Govind, Derek Paulsen, Kaushik Chandrasekhar, Philip Martinkus, Matthew Christie, Communications of the ACM, 2020.
Deep Entity Matching with Pre-Trained Language Models, Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, Wang-Chiew Tan, VLDB-21
Data Curation with Deep Learning, Saravanan Thirumuruganathan, Nan Tang, Mourad Ouzzani, AnHai Doan, EDBT-20
Manually Detecting Errors for Data Cleaning Using Adaptive Crowdsourcing Strategies, Haojun Zhang, Chengliang Chai, AnHai Doan, Paris Koutris, Esteban Arcaute, EDBT-2020
Entity Matching Meets Data Science: A Progress Report from the Magellan Project, Y. Govind, P. Konda, and others. SIGMOD-19. Industrial paper.
Executing Entity Matching End to End: A Case Study, P. Konda, S. Seshadri, E. Segarra, B. Hueth, A. Doan. EDBT-19. Industrial paper.
Smurf: Self-Service String Matching Using Random Forests, P. Suganthan G.C., A. Ardalan, A. Doan, A. Akella. VLDB-19.
CloudMatcher: A Hands-Off Cloud/Crowd Service for Entity Matching, Y. Govind, E. Paulson, P. Nagarajan, P. Suganthan G.C., A. Doan, Y. Park, G. Fung, D. Conanthan, M. Carter, M. Sun. VLDB-18. Demo paper.
Toward a System Building Agenda for Data Integration (and Data Science), A. Doan, P. Konda, P. Suganthan G.C., A. Ardalan, J. Ballard, S. Das, Y. Govind, H. Li, P. Martinkus, S. Mudgal, E. Paulson, H. Zhang. IEEE Data Engineering Bulletin, Special Issue on Large-Scale Data Integration, 2018. Invited paper.
BigGorilla: An Open-Source Ecosystem for Data Preparation and Integration, C. Chen, B. Golshan, A. Halevy, W. Tan, A. Doan. IEEE Data Engineering Bulletin, Special Issue on Large-Scale Data Integration, 2018. Invited paper.
Deep Learning for Entity Matching: A Design Space Exploration, S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park, G. Krishnan, R. Deep, E. Arcaute, V. Raghavendra. SIGMOD-18. extended version.
MatchCatcher: A Debugger for Blocking in Entity Matching, H. Li, P. Konda, P. Suganthan G.C., A. Doan, B. Snyder, Y. Park, G. Krishnan, R. Deep, V. Raghavendra. EDBT-18. extended version, slides.
Human-in-the-Loop Data Analysis: A Personal Perspective, A. Doan. HILDA Workshop @ SIGMOD-18.
Magellan: Toward Building Entity Matching Management Systems, P. Konda, S. Das, P. Suganthan G.C., P. Martinkus, A. Doan, A. Ardalan, J. R. Ballard, Y. Govind, H. Li, F. Panahi, H. Zhang, J. Naughton, S. Prasad, G. Krishnan, R. Deep, V. Raghavendra. SIGMOD Record, 2018. 8-page version summarizing the progress on the Magellan system by Dec 2017.
- Magellan: Toward Building Entity Matching Management Systems, P. Konda, S. Das, P. Suganthan G.C., A. Doan, A. Ardalan, J. R. Ballard, H. Li, F. Panahi, H. Zhang, J. Naughton, S. Prasad, G. Krishnan, R. Deep, V. Raghavendra. VLDB-16. extended version, slides. the conference and tech report versions.
- Magellan: Toward Building Entity Matching Management Systems over Data Science Stacks, P. Konda, S. Das, P. Suganthan G.C., A. Doan, A. Ardalan, J. R. Ballard, H. Li, F. Panahi, H. Zhang, J. Naughton, S. Prasad, G. Krishnan, R. Deep, V. Raghavendra. VLDB-16, demo paper. Jupyter notebook & datasets for demo. the demo proposal.
CloudMatcher: A Cloud/Crowd Service for Entity Matching, Y. Govind, E. Paulson, M. Ashok, P. Suganthan G.C., A. Hitawala, A. Doan, Y. Park, P. Peissig, E. LaRose, J. Badger. BIGDAS Workshop @ KDD-17. slides
Human-in-the-Loop Challenges for Entity Matching: A Midterm Report, A. Doan, A. Ardalan, J. Ballard, S. Das, Y. Govind, P. Konda, H. Li, S. Mudgal, E. Paulson, P. Suganthan G.C., H. Zhang. HILDA Workshop @ SIGMOD-17.
Falcon: Scaling Up Hands-Off Crowdsourced Entity Matching to Build Cloud Services, S. Das, P. Suganthan G.C., A. Doan, J. Naughton, G. Krishnan, R. Deep, E. Arcaute, V. Raghavendra, Y. Park. SIGMOD-17. extended version, slides
Towards Interactive Debugging of Rule-Based Entity Matching, F. Panahi, W. Wu, A. Doan, J. Naughton, EDBT-17.
The Beckman Report on Database Research, with many authors. Communications of the ACM, 2016. extended version
Chimera: Large-Scale Classification using Machine Learning, Rules, and Crowdsourcing, C. Sun, N. Rampalli, F. Yang, A. Doan. VLDB-14, industrial paper. slides
Corleone: Hands-off Crowdsourcing for Entity Matching, C. Gokhale, S. Das, A. Doan, J. Naughton, N. Rampalli, J. Shavlik, J. Zhu. SIGMOD-14. slides, extended report

Selected Awards and Honors

Gurindar S. Sohi Professorship, 2020
Vilas Distinguished Achievement Professorship, 2018
SIGMOD Research Highlight Award, 2017
Vilas Associate, UW-Madison, 2016
Alfred P. Sloan Research Fellowship, 2007
IBM Faculty Award, 2007, 2008
NSF CAREER Award, 2004
ACM Doctoral Dissertation Award, 2003
William Chan Memorial Dissertation Award, Univ. of Washington, 2003

Teaching

Recent classes include data science at the undergrad and grad levels, and CS 564 (Introduction to RDBMSs).

Service

Selected recent service for the data management community:
- member, SIGMOD Advisory Board,
- member, ICDE 10-Year Most Influential Paper Award Committee,
- associate editor, VLDB-16,
- co-chair, industrial program, VLDB-15,
- co-chair, Beckman meeting (with Mike Carey), 2013.
- chair, industrial program, SIGMOD-12
- data integration textbook (with Alon Halevy and Zack Ives), 2012.
I spent 3 years (2011-2014) setting up a professional MS program and a certificate program in CS at UW-Madison (with help from Karu Sankaralingam, Jeff Naughton, and Suman Banerjee). These programs have been highly successful, enrolling hundreds of students.
I've participated in or led several important strategic initiatives for CS at UW-Madison. These include growth to 50 (tenure-track faculty), a re-organization of the department's governance structure, and data science initiatives. I serve as an associate chair since 2017. I've also worked on several strategic initiatives for UW-Madison, including serving on the steering committees of several data science centers and initiatives, and serving on the 2017-2018 campus task force for growth strategies for computing at UW-Madison.

Misc

For advice on research, communication, job hunt, and more, see the mother of all advice collections by Tao Xie at Illinois.
My academic job application package (in 2002), academic job talk slides (in 2003).