AI-Driven Data Catalog Management Systems
Modern organizations—companies, scientific domains, and government agencies—are increasingly relying on data catalogs for two key reasons.
- Data Discovery. Organizations often manage a large number of datasets, making it difficult to identify the right ones for a data science or AI project. A data catalog addresses this challenge by profiling datasets and constructing a catalog graph that captures their metadata and relationships. Users can then explore this graph to discover relevant datasets through browsing, keyword search, and natural language queries.
- Data Governance. As data becomes a critical organizational asset, effective governance is essential. Data catalogs play a central role by helping organizations track their datasets, improve data quality, protect against loss and misuse, and conform to government regulations.
Much academic research has addressed individual components of data catalog systems, but little has focused on integrating these advances into end-to-end solutions. SmartCat is a new project (started in 2025) at UW–Madison that aims to bridge this gap. SmartCat is distinguished by:
- Building open-source, end-to-end data catalog systems
- Deploying with real users and learning from real-world feedback
- Targeting domain science, government, and SMB settings
- Exploring the role of generative AI in catalog construction
- Combining existing research with new solutions where needed
- Contributing back through data releases and publications
Overall, we believe that building data catalog management systems is a compelling direction for data management research. It brings together and advances multiple previously disparate research areas, while also producing software that serves an increasingly critical real-world need.