Vilas Distinguished Achievement Professor
Gurindar S. Sohi Professor
Department of Computer Science, University of Wisconsin
Room 4355, 1210 W. Dayton St, Madison WI 53706
email@example.com, (608) 262 9759
Database Group UW, CS, Living in Madison
- Aug 2020: This homepage and project pages have been seriously
out of date as way too much stuff happened in the past two years
- Jun 2018: The DeepMatcher package, which applies deep learning to EM, is released as a part of Magellan. See
deepmatcher.ml for the code (and
here for the paper).
- Jun 2018: Our CloudMatcher/Magellan code is
being deployed at American Family Insurance,
a Fortune 500 company.
- Jun 2018: A short paper
on a system building agenda for data integration and data science. Invited to IEEE DEB Special Issue on Large-Scale Data Integration.
(Another invited paper discusses BigGorilla.)
- May 2018:
The Magellan VLDB paper received a SIGMOD Research Highlight Award. Here's a shortened version of that paper.
- Dec 2017: Discussed misc issues about UW, CS, and living in Madison.
- Sep 2017: Revised homepage to reflect recent work on data cleaning/integration and data science.
- Oct 2016: A talk on a system building agenda for data integration (and data science).
The Magellan system described below is an example of realizing this agenda for entity matching.
- Jul 2016: Launching Magellan,
a new project to build an end-to-end entity matching system.
- Old news
Research (Group's Homepage)
My work has charted new directions or bet on emerging directions that
I believe would become fundamental for data management. Solving
problems in these directions often requires a combination of machine
learning, scalable data management, effective human-data interaction,
and cloud technologies.
knowledge bases/graphs (2004-2012),
schema/ontology matching (2000-2010).
In between, from 2010-2014 I
some time in Silicon Valley, putting my work in these directions
to use, and learning a ton about doing things "in the wild".
- Data cleaning & integration:
I build end-to-end data integration systems as parts of the Python ecosystem of open-source data tools. I also leverage these systems to build cloud/crowd data integration services for lay users.
science: This direction is increasingly critical to the data
management community, yet no clear agenda exists today. I'm working
on an agenda that integrates research, system
building, education, and outreach. This agenda currently focuses
on data quality and builds on the above
work in data cleaning/integration.
- Quick links: DI agenda paper and talk, Magellan homepage and paper,
data science course,
BigGorilla repository of DI tools,
Selected Recent Publications
Google Scholar Entry)
Selected Awards and Honors
- Magellan: toward building ecosystems of entity matching
solutions, AnHai Doan, Pradap Konda, Paul Suganthan G. C., Yash
Govind, Derek Paulsen, Kaushik Chandrasekhar, Philip Martinkus,
Matthew Christie, Communications of the ACM, 2020.
- Deep Entity Matching with Pre-Trained Language Models, Yuliang
Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, Wang-Chiew
- Data Curation with Deep Learning, Saravanan Thirumuruganathan,
Nan Tang, Mourad Ouzzani, AnHai Doan, EDBT-20
- Manually Detecting Errors for Data Cleaning Using Adaptive
Crowdsourcing Strategies, Haojun Zhang, Chengliang Chai, AnHai
Doan, Paris Koutris, Esteban Arcaute, EDBT-2020
- Entity Matching Meets Data Science: A Progress Report from the Magellan Project,
Y. Govind, P. Konda, and others.
SIGMOD-19. Industrial paper.
- Executing Entity Matching End to End: A Case Study,
P. Konda, S. Seshadri, E. Segarra, B. Hueth, A. Doan.
EDBT-19. Industrial paper.
- Smurf: Self-Service String Matching Using Random Forests,
P. Suganthan G.C., A. Ardalan, A. Doan, A. Akella.
- CloudMatcher: A
Hands-Off Cloud/Crowd Service for Entity Matching, Y. Govind,
E. Paulson, P. Nagarajan, P. Suganthan G.C., A. Doan, Y. Park,
G. Fung, D. Conanthan, M. Carter, M. Sun. VLDB-18. Demo
- Toward a System Building Agenda for Data Integration (and Data Science),
A. Doan, P. Konda, P. Suganthan G.C., A. Ardalan, J. Ballard, S. Das,
Y. Govind, H. Li, P. Martinkus, S. Mudgal, E. Paulson, H. Zhang. IEEE Data Engineering Bulletin, Special Issue
on Large-Scale Data Integration, 2018. Invited paper.
- BigGorilla: An Open-Source Ecosystem for Data Preparation and Integration,
C. Chen, B. Golshan, A. Halevy, W. Tan, A. Doan.
IEEE Data Engineering Bulletin, Special Issue
on Large-Scale Data Integration, 2018. Invited paper.
- Deep Learning for
Entity Matching: A Design Space Exploration, S. Mudgal, H. Li,
T. Rekatsinas, A. Doan, Y. Park, G. Krishnan, R. Deep, E. Arcaute,
V. Raghavendra. SIGMOD-18. extended
- MatchCatcher: A
Debugger for Blocking in Entity Matching, H. Li, P. Konda,
P. Suganthan G.C., A. Doan, B. Snyder, Y. Park, G. Krishnan,
V. Raghavendra. EDBT-18. extended
- Human-in-the-Loop Data Analysis: A Personal Perspective, A. Doan.
HILDA Workshop @ SIGMOD-18.
- Magellan: Toward
Building Entity Matching Management Systems, P. Konda, S. Das,
P. Suganthan G.C., P. Martinkus, A. Doan, A. Ardalan, J. R. Ballard,
Y. Govind, H. Li, F. Panahi, H. Zhang, J. Naughton, S. Prasad,
G. Krishnan, R. Deep, V. Raghavendra. SIGMOD Record, 2018.
8-page version summarizing the progress on the Magellan system
by Dec 2017.
- Magellan: Toward Building
Entity Matching Management Systems, P. Konda, S. Das,
P. Suganthan G.C., A. Doan, A. Ardalan, J. R. Ballard, H. Li,
F. Panahi, H. Zhang, J. Naughton, S. Prasad, G. Krishnan, R. Deep,
V. Raghavendra. VLDB-16. extended version,
slides. the conference and tech report versions.
- Magellan: Toward
Building Entity Matching Management Systems over Data Science
Stacks, P. Konda, S. Das, P. Suganthan G.C., A. Doan,
A. Ardalan, J. R. Ballard, H. Li, F. Panahi, H. Zhang, J. Naughton,
S. Prasad, G. Krishnan, R. Deep, V. Raghavendra. VLDB-16,
demo paper. Jupyter notebook & datasets for demo.
the demo proposal.
- CloudMatcher: A Cloud/Crowd Service for Entity Matching,
Y. Govind, E. Paulson, M. Ashok, P. Suganthan G.C., A. Hitawala, A. Doan, Y. Park, P. Peissig, E. LaRose, J. Badger.
BIGDAS Workshop @ KDD-17. slides
- Human-in-the-Loop Challenges for Entity Matching: A Midterm Report,
A. Doan, A. Ardalan, J. Ballard, S. Das, Y. Govind, P. Konda, H. Li, S. Mudgal, E. Paulson, P. Suganthan G.C., H. Zhang.
HILDA Workshop @ SIGMOD-17.
- Falcon: Scaling Up Hands-Off Crowdsourced Entity Matching to Build Cloud Services,
S. Das, P. Suganthan G.C., A. Doan, J. Naughton, G. Krishnan, R. Deep, E. Arcaute, V. Raghavendra, Y. Park.
SIGMOD-17. extended version, slides
- Towards Interactive
Debugging of Rule-Based Entity Matching, F. Panahi, W. Wu,
A. Doan, J. Naughton, EDBT-17.
- The Beckman Report on Database Research,
with many authors. Communications of the ACM, 2016. extended version
- Chimera: Large-Scale Classification using Machine Learning, Rules, and Crowdsourcing,
C. Sun, N. Rampalli, F. Yang, A. Doan. VLDB-14, industrial paper. slides
- Corleone: Hands-off Crowdsourcing for Entity Matching,
C. Gokhale, S. Das, A. Doan, J. Naughton, N. Rampalli, J. Shavlik, J. Zhu.
SIGMOD-14. slides, extended report
- Gurindar S. Sohi Professorship, 2020
- Vilas Distinguished Achievement Professorship, 2018
- SIGMOD Research Highlight Award, 2017
- Vilas Associate, UW-Madison, 2016
- Alfred P. Sloan Research Fellowship, 2007
- IBM Faculty Award, 2007, 2008
- NSF CAREER Award, 2004
- ACM Doctoral Dissertation Award, 2003
William Chan Memorial Dissertation Award, Univ. of Washington, 2003
Recent classes include data science at the undergrad
levels, and CS 564 (Introduction to RDBMSs).
- Selected recent service for the data management community:
- member, SIGMOD Advisory Board,
- member, ICDE 10-Year Most Influential Paper Award Committee,
- associate editor, VLDB-16,
- co-chair, industrial program, VLDB-15,
- co-chair, Beckman meeting (with Mike Carey), 2013.
- chair, industrial program, SIGMOD-12
- data integration textbook (with Alon Halevy and Zack Ives), 2012.
- I spent 3 years (2011-2014) setting up a
professional MS program and a
program in CS at UW-Madison (with help from Karu Sankaralingam,
Jeff Naughton, and Suman Banerjee). These programs have been highly
successful, enrolling hundreds of students.
- I've participated in or led several important strategic
initiatives for CS at UW-Madison. These include growth to 50
(tenure-track faculty), a re-organization of the department's
governance structure, and data science initiatives. I serve as an
associate chair since 2017. I've also worked on several strategic
initiatives for UW-Madison, including serving on the steering
committees of several data science centers and initiatives, and
serving on the 2017-2018 campus
force for growth strategies for computing at UW-Madison.