|   | AnHai Doan 
 Vilas Distinguished Achievement Professor
 Gurindar S. Sohi Professor
 Department of Computer Science, University of Wisconsin
 Room 4355, 1210 W. Dayton St, Madison WI 53706
 anhai@cs.wisc.edu, (608) 262 9759
 Bio/Personal   
Database Group   UW, CS, Living in Madison
 
 | 
News
| 
    Aug 2020: This homepage and project pages have been seriously
    out of date as way too much stuff happened in the past two years
        Jun 2018: The DeepMatcher package, which applies deep learning to EM, is released as a part of Magellan. See
      deepmatcher.ml for the code (and
      here for the paper).
    Jun 2018: Our CloudMatcher/Magellan code is
      
						       being deployed at American Family Insurance,
      a Fortune 500 company.
    Jun 2018: A short paper
      on a system building agenda for data integration and data science. Invited to IEEE DEB Special Issue on Large-Scale Data Integration.
      (Another invited paper discusses BigGorilla.)
    May 2018: 
	The Magellan VLDB paper received a SIGMOD Research Highlight Award. Here's a shortened version of that paper. 
    Dec 2017: Discussed misc issues about UW, CS, and living in Madison.
    Sep 2017: Revised homepage to reflect recent work on data cleaning/integration and data science. 
    Oct 2016: A talk on a system building agenda for data integration (and data science). 
    The Magellan system described below is an example of realizing this agenda for entity matching.
    Jul 2016: Launching Magellan,
    a new project to build an end-to-end entity matching system.
     Old news
   |   | 
Research   (Group's Homepage)
My work has charted new directions or bet on emerging directions that
I believe would become fundamental for data management. Solving
problems in these directions often requires a combination of machine
learning, scalable data management, effective human-data interaction,
and cloud technologies. 
Current directions:
  -  Data cleaning & integration:
    I build end-to-end data integration systems as parts of the Python ecosystem of open-source data tools. I also leverage these systems to build cloud/crowd data integration services for lay users.
    
  
- 
  Data
  science: This direction is increasingly critical to the data
  management community, yet no clear agenda exists today. I'm working
  on an agenda that integrates research, system
  building, education, and outreach. This agenda currently focuses
  on data quality and builds on the above
  work in data cleaning/integration.
  
- Quick links: DI agenda paper and talk, Magellan homepage and paper,
code (py_entitymatching,
py_stringsimjoin,
py_stringmatching),
data sets,
data science course,
BigGorilla repository of DI tools,
DI textbook
Past directions:
knowledge bases/graphs (2004-2012),
crowdsourcing (2002-2015),
schema/ontology matching (2000-2010).
In between, from 2010-2014 I
spent
some time in Silicon Valley, putting my work in these directions
to use, and learning a ton about doing things "in the wild".
Selected Recent Publications   
(DBLP Entry  
Google Scholar Entry)
  - Magellan: toward building ecosystems of entity matching
  solutions, AnHai Doan, Pradap Konda, Paul Suganthan G. C., Yash
  Govind, Derek Paulsen, Kaushik Chandrasekhar, Philip Martinkus,
  Matthew Christie, Communications of the ACM, 2020. 
  
- Deep Entity Matching with Pre-Trained Language Models, Yuliang
  Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, Wang-Chiew
  Tan, VLDB-21
  
- Data Curation with Deep Learning, Saravanan Thirumuruganathan,
  Nan Tang, Mourad Ouzzani, AnHai Doan, EDBT-20
 
- Manually Detecting Errors for Data Cleaning Using Adaptive
   Crowdsourcing Strategies, Haojun Zhang, Chengliang Chai, AnHai
   Doan, Paris Koutris, Esteban Arcaute, EDBT-2020
-  Entity Matching Meets Data Science: A Progress Report from the Magellan Project,
    Y. Govind, P. Konda, and others.
    SIGMOD-19. Industrial paper.
    
-  Executing Entity Matching End to End: A Case Study,
    P. Konda, S. Seshadri, E. Segarra, B. Hueth, A. Doan.
    EDBT-19. Industrial paper.
  
-  Smurf: Self-Service String Matching Using Random Forests,
    P. Suganthan G.C., A. Ardalan, A. Doan, A. Akella.
    VLDB-19.
  
-  CloudMatcher: A
  Hands-Off Cloud/Crowd Service for Entity Matching, Y. Govind,
  E. Paulson, P. Nagarajan, P. Suganthan G.C., A. Doan, Y. Park,
  G. Fung, D. Conanthan, M. Carter, M. Sun. VLDB-18. Demo
  paper. 
  
-  Toward a System Building Agenda for Data Integration (and Data Science),
    A. Doan, P. Konda, P. Suganthan G.C., A. Ardalan, J. Ballard, S. Das,
    Y. Govind, H. Li, P. Martinkus, S. Mudgal, E. Paulson, H. Zhang. IEEE Data Engineering Bulletin, Special Issue
      on Large-Scale Data Integration, 2018. Invited paper. 
  
-  BigGorilla: An Open-Source Ecosystem for Data Preparation and Integration,
    C. Chen, B. Golshan, A. Halevy, W. Tan, A. Doan.
      IEEE Data Engineering Bulletin, Special Issue
      on Large-Scale Data Integration, 2018. Invited paper.
      
  
-  Deep Learning for
  Entity Matching: A Design Space Exploration, S. Mudgal, H. Li,
  T. Rekatsinas, A. Doan, Y. Park, G. Krishnan, R. Deep, E. Arcaute,
  V. Raghavendra. SIGMOD-18. extended
      version.
  
-  MatchCatcher: A
  Debugger for Blocking in Entity Matching, H. Li, P. Konda,
  P. Suganthan G.C., A. Doan, B. Snyder, Y. Park, G. Krishnan,
  R. Deep,
  V. Raghavendra. EDBT-18. extended
      version, slides.
  
-  Human-in-the-Loop Data Analysis: A Personal Perspective, A. Doan.
    HILDA Workshop @ SIGMOD-18. 
  
-  Magellan: Toward
  Building Entity Matching Management Systems, P. Konda, S. Das,
  P. Suganthan G.C., P. Martinkus, A. Doan, A. Ardalan, J. R. Ballard,
  Y. Govind, H. Li, F. Panahi, H. Zhang, J. Naughton, S. Prasad,
    G. Krishnan, R. Deep, V. Raghavendra. SIGMOD Record, 2018.
    8-page version summarizing the progress on the Magellan system
      by Dec 2017.
    
  -  Magellan: Toward Building
  Entity Matching Management Systems, P. Konda, S. Das,
  P. Suganthan G.C., A. Doan, A. Ardalan, J. R. Ballard, H. Li,
  F. Panahi, H. Zhang, J. Naughton, S. Prasad, G. Krishnan, R. Deep,
    V. Raghavendra. VLDB-16. extended version,
    slides. the conference and tech report versions.
  
-  Magellan: Toward
  Building Entity Matching Management Systems over Data Science
  Stacks, P. Konda, S. Das, P. Suganthan G.C., A. Doan,
  A. Ardalan, J. R. Ballard, H. Li, F. Panahi, H. Zhang, J. Naughton,
  S. Prasad, G. Krishnan, R. Deep, V. Raghavendra. VLDB-16,
    demo paper. Jupyter notebook & datasets for demo.
    the demo proposal.
  
 
-  CloudMatcher: A Cloud/Crowd Service for Entity Matching,
    Y. Govind, E. Paulson, M. Ashok, P. Suganthan G.C., A. Hitawala, A. Doan, Y. Park, P. Peissig, E. LaRose, J. Badger.
    BIGDAS Workshop @ KDD-17. slides
-  Human-in-the-Loop Challenges for Entity Matching: A Midterm Report,
    A. Doan, A. Ardalan, J. Ballard, S. Das, Y. Govind, P. Konda, H. Li, S. Mudgal, E. Paulson, P. Suganthan G.C., H. Zhang.
    HILDA Workshop @ SIGMOD-17. 
  
  
-  Falcon: Scaling Up Hands-Off Crowdsourced Entity Matching to Build Cloud Services,
    S. Das, P. Suganthan G.C., A. Doan, J. Naughton, G. Krishnan, R. Deep, E. Arcaute, V. Raghavendra, Y. Park.
    SIGMOD-17. extended version, slides
  
-  Towards Interactive
  Debugging of Rule-Based Entity Matching, F. Panahi, W. Wu,
  A. Doan, J. Naughton, EDBT-17.
    
  
-  The Beckman Report on Database Research,
  with many authors. Communications of the ACM, 2016. extended version
  
-  Chimera: Large-Scale Classification using Machine Learning, Rules, and Crowdsourcing, 
  C. Sun, N. Rampalli, F. Yang, A. Doan. VLDB-14, industrial paper. slides
  
-  Corleone: Hands-off Crowdsourcing for Entity Matching, 
  C. Gokhale, S. Das, A. Doan, J. Naughton, N. Rampalli, J. Shavlik, J. Zhu. 
  SIGMOD-14. slides, extended report
Selected Awards and Honors
-  Gurindar S. Sohi Professorship, 2020
-  Vilas Distinguished Achievement Professorship, 2018
-  SIGMOD Research Highlight Award, 2017
-  Vilas Associate, UW-Madison, 2016
-  Alfred P. Sloan Research Fellowship, 2007
-  IBM Faculty Award, 2007, 2008
-  NSF CAREER Award, 2004
-  ACM Doctoral Dissertation Award, 2003
-  
     William Chan Memorial Dissertation Award, Univ. of Washington, 2003
Teaching 
Recent classes include data science at the undergrad
and grad
levels, and CS 564 (Introduction to RDBMSs).
Service
  
    -  Selected recent service for the data management community:
     
    -  member, SIGMOD Advisory Board, 
    
-  member, ICDE 10-Year Most Influential Paper Award Committee, 
    
-  associate editor, VLDB-16,
    
-  co-chair, industrial program, VLDB-15,
    
-  co-chair, Beckman meeting (with Mike Carey), 2013.
    
-  chair, industrial program, SIGMOD-12
    
-  data integration textbook (with Alon Halevy and Zack Ives), 2012.
    
 
-  I spent 3 years (2011-2014) setting up a
professional MS program and a
certificate
program in CS at UW-Madison (with help from Karu Sankaralingam,
Jeff Naughton, and Suman Banerjee). These programs have been highly
successful, enrolling hundreds of students.
-  I've participated in or led several important strategic
  initiatives for CS at UW-Madison. These include growth to 50
  (tenure-track faculty), a re-organization of the department's
  governance structure, and data science initiatives. I serve as an
  associate chair since 2017. I've also worked on several strategic
  initiatives for UW-Madison, including serving on the steering
  committees of several data science centers and initiatives, and
  serving on the 2017-2018 campus
  task
  force for growth strategies for computing at UW-Madison.
Misc