|
AnHai Doan
Vilas Distinguished Achievement Professor
Gurindar S. Sohi Professor
Department of Computer Science, University of Wisconsin
Room 4355, 1210 W. Dayton St, Madison WI 53706
anhai@cs.wisc.edu, (608) 262 9759
Bio/Personal
Database Group UW, CS, Living in Madison
|
News
- Aug 2020: This homepage and project pages have been seriously
out of date as way too much stuff happened in the past two years
- Jun 2018: The DeepMatcher package, which applies deep learning to EM, is released as a part of Magellan. See
deepmatcher.ml for the code (and
here for the paper).
- Jun 2018: Our CloudMatcher/Magellan code is
being deployed at American Family Insurance,
a Fortune 500 company.
- Jun 2018: A short paper
on a system building agenda for data integration and data science. Invited to IEEE DEB Special Issue on Large-Scale Data Integration.
(Another invited paper discusses BigGorilla.)
- May 2018:
The Magellan VLDB paper received a SIGMOD Research Highlight Award. Here's a shortened version of that paper.
- Dec 2017: Discussed misc issues about UW, CS, and living in Madison.
- Sep 2017: Revised homepage to reflect recent work on data cleaning/integration and data science.
- Oct 2016: A talk on a system building agenda for data integration (and data science).
The Magellan system described below is an example of realizing this agenda for entity matching.
- Jul 2016: Launching Magellan,
a new project to build an end-to-end entity matching system.
- Old news
|
|
Research (Group's Homepage)
My work has charted new directions or bet on emerging directions that
I believe would become fundamental for data management. Solving
problems in these directions often requires a combination of machine
learning, scalable data management, effective human-data interaction,
and cloud technologies.
Current directions:
- Data cleaning & integration:
I build end-to-end data integration systems as parts of the Python ecosystem of open-source data tools. I also leverage these systems to build cloud/crowd data integration services for lay users.
-
Data
science: This direction is increasingly critical to the data
management community, yet no clear agenda exists today. I'm working
on an agenda that integrates research, system
building, education, and outreach. This agenda currently focuses
on data quality and builds on the above
work in data cleaning/integration.
- Quick links: DI agenda paper and talk, Magellan homepage and paper,
code (py_entitymatching,
py_stringsimjoin,
py_stringmatching),
data sets,
data science course,
BigGorilla repository of DI tools,
DI textbook
Past directions:
knowledge bases/graphs (2004-2012),
crowdsourcing (2002-2015),
schema/ontology matching (2000-2010).
In between, from 2010-2014 I
spent
some time in Silicon Valley, putting my work in these directions
to use, and learning a ton about doing things "in the wild".
Selected Recent Publications
(DBLP Entry
Google Scholar Entry)
- Magellan: toward building ecosystems of entity matching
solutions, AnHai Doan, Pradap Konda, Paul Suganthan G. C., Yash
Govind, Derek Paulsen, Kaushik Chandrasekhar, Philip Martinkus,
Matthew Christie, Communications of the ACM, 2020.
- Deep Entity Matching with Pre-Trained Language Models, Yuliang
Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, Wang-Chiew
Tan, VLDB-21
- Data Curation with Deep Learning, Saravanan Thirumuruganathan,
Nan Tang, Mourad Ouzzani, AnHai Doan, EDBT-20
- Manually Detecting Errors for Data Cleaning Using Adaptive
Crowdsourcing Strategies, Haojun Zhang, Chengliang Chai, AnHai
Doan, Paris Koutris, Esteban Arcaute, EDBT-2020
- Entity Matching Meets Data Science: A Progress Report from the Magellan Project,
Y. Govind, P. Konda, and others.
SIGMOD-19. Industrial paper.
- Executing Entity Matching End to End: A Case Study,
P. Konda, S. Seshadri, E. Segarra, B. Hueth, A. Doan.
EDBT-19. Industrial paper.
- Smurf: Self-Service String Matching Using Random Forests,
P. Suganthan G.C., A. Ardalan, A. Doan, A. Akella.
VLDB-19.
- CloudMatcher: A
Hands-Off Cloud/Crowd Service for Entity Matching, Y. Govind,
E. Paulson, P. Nagarajan, P. Suganthan G.C., A. Doan, Y. Park,
G. Fung, D. Conanthan, M. Carter, M. Sun. VLDB-18. Demo
paper.
- Toward a System Building Agenda for Data Integration (and Data Science),
A. Doan, P. Konda, P. Suganthan G.C., A. Ardalan, J. Ballard, S. Das,
Y. Govind, H. Li, P. Martinkus, S. Mudgal, E. Paulson, H. Zhang. IEEE Data Engineering Bulletin, Special Issue
on Large-Scale Data Integration, 2018. Invited paper.
- BigGorilla: An Open-Source Ecosystem for Data Preparation and Integration,
C. Chen, B. Golshan, A. Halevy, W. Tan, A. Doan.
IEEE Data Engineering Bulletin, Special Issue
on Large-Scale Data Integration, 2018. Invited paper.
- Deep Learning for
Entity Matching: A Design Space Exploration, S. Mudgal, H. Li,
T. Rekatsinas, A. Doan, Y. Park, G. Krishnan, R. Deep, E. Arcaute,
V. Raghavendra. SIGMOD-18. extended
version.
- MatchCatcher: A
Debugger for Blocking in Entity Matching, H. Li, P. Konda,
P. Suganthan G.C., A. Doan, B. Snyder, Y. Park, G. Krishnan,
R. Deep,
V. Raghavendra. EDBT-18. extended
version, slides.
- Human-in-the-Loop Data Analysis: A Personal Perspective, A. Doan.
HILDA Workshop @ SIGMOD-18.
- Magellan: Toward
Building Entity Matching Management Systems, P. Konda, S. Das,
P. Suganthan G.C., P. Martinkus, A. Doan, A. Ardalan, J. R. Ballard,
Y. Govind, H. Li, F. Panahi, H. Zhang, J. Naughton, S. Prasad,
G. Krishnan, R. Deep, V. Raghavendra. SIGMOD Record, 2018.
8-page version summarizing the progress on the Magellan system
by Dec 2017.
- Magellan: Toward Building
Entity Matching Management Systems, P. Konda, S. Das,
P. Suganthan G.C., A. Doan, A. Ardalan, J. R. Ballard, H. Li,
F. Panahi, H. Zhang, J. Naughton, S. Prasad, G. Krishnan, R. Deep,
V. Raghavendra. VLDB-16. extended version,
slides. the conference and tech report versions.
- Magellan: Toward
Building Entity Matching Management Systems over Data Science
Stacks, P. Konda, S. Das, P. Suganthan G.C., A. Doan,
A. Ardalan, J. R. Ballard, H. Li, F. Panahi, H. Zhang, J. Naughton,
S. Prasad, G. Krishnan, R. Deep, V. Raghavendra. VLDB-16,
demo paper. Jupyter notebook & datasets for demo.
the demo proposal.
- CloudMatcher: A Cloud/Crowd Service for Entity Matching,
Y. Govind, E. Paulson, M. Ashok, P. Suganthan G.C., A. Hitawala, A. Doan, Y. Park, P. Peissig, E. LaRose, J. Badger.
BIGDAS Workshop @ KDD-17. slides
- Human-in-the-Loop Challenges for Entity Matching: A Midterm Report,
A. Doan, A. Ardalan, J. Ballard, S. Das, Y. Govind, P. Konda, H. Li, S. Mudgal, E. Paulson, P. Suganthan G.C., H. Zhang.
HILDA Workshop @ SIGMOD-17.
- Falcon: Scaling Up Hands-Off Crowdsourced Entity Matching to Build Cloud Services,
S. Das, P. Suganthan G.C., A. Doan, J. Naughton, G. Krishnan, R. Deep, E. Arcaute, V. Raghavendra, Y. Park.
SIGMOD-17. extended version, slides
- Towards Interactive
Debugging of Rule-Based Entity Matching, F. Panahi, W. Wu,
A. Doan, J. Naughton, EDBT-17.
- The Beckman Report on Database Research,
with many authors. Communications of the ACM, 2016. extended version
- Chimera: Large-Scale Classification using Machine Learning, Rules, and Crowdsourcing,
C. Sun, N. Rampalli, F. Yang, A. Doan. VLDB-14, industrial paper. slides
- Corleone: Hands-off Crowdsourcing for Entity Matching,
C. Gokhale, S. Das, A. Doan, J. Naughton, N. Rampalli, J. Shavlik, J. Zhu.
SIGMOD-14. slides, extended report
Selected Awards and Honors
- Gurindar S. Sohi Professorship, 2020
- Vilas Distinguished Achievement Professorship, 2018
- SIGMOD Research Highlight Award, 2017
- Vilas Associate, UW-Madison, 2016
- Alfred P. Sloan Research Fellowship, 2007
- IBM Faculty Award, 2007, 2008
- NSF CAREER Award, 2004
- ACM Doctoral Dissertation Award, 2003
-
William Chan Memorial Dissertation Award, Univ. of Washington, 2003
Teaching
Recent classes include data science at the undergrad
and grad
levels, and CS 564 (Introduction to RDBMSs).
Service
- Selected recent service for the data management community:
- member, SIGMOD Advisory Board,
- member, ICDE 10-Year Most Influential Paper Award Committee,
- associate editor, VLDB-16,
- co-chair, industrial program, VLDB-15,
- co-chair, Beckman meeting (with Mike Carey), 2013.
- chair, industrial program, SIGMOD-12
- data integration textbook (with Alon Halevy and Zack Ives), 2012.
- I spent 3 years (2011-2014) setting up a
professional MS program and a
certificate
program in CS at UW-Madison (with help from Karu Sankaralingam,
Jeff Naughton, and Suman Banerjee). These programs have been highly
successful, enrolling hundreds of students.
- I've participated in or led several important strategic
initiatives for CS at UW-Madison. These include growth to 50
(tenure-track faculty), a re-organization of the department's
governance structure, and data science initiatives. I serve as an
associate chair since 2017. I've also worked on several strategic
initiatives for UW-Madison, including serving on the steering
committees of several data science centers and initiatives, and
serving on the 2017-2018 campus
task
force for growth strategies for computing at UW-Madison.
Misc